Tjelvar Olsson2020-05-15T20:53:59+00:00http://tjelvarolsson.comTjelvar OlssonUsing relative paths in Linux scripts2020-05-15T00:00:00+00:00http://tjelvarolsson.com/blog/using-relative-paths-in-linux-scripts<p>In the <a href="/blog/relative-and-absolute-paths-in-linux/">preivous post</a>
I discussed the difference between absolute and relative paths.</p>
<p><em>So what is better absolute or relative paths? Which one should be used when
one needs to refer to a file in a script?</em></p>
<p>Let me prefix my answer with the caveat that all paths are a pain. However,
absolute paths are more of a pain than relative paths. This is because absolute
paths make it difficult to restructure the way that directories are organised
on your computer. They also make it difficult for you to share your scripts
with collaborators because they would need their computer to be structured in
exactly the same way as yours. It is possible to get around some of these
issues by using relative paths.</p>
<p>This post will show you how create scripts that have a clear separation between
raw and derived data using relative paths. As a bonus the scripts will also be
more portable and less fragile with respect to reorganisations of your
directory structure.</p>
<p>This post makes use of some more advanced Linux skills including the use of
environment variables, the creation of a Bash script and adding execute
permissions to the script. You don’t need too worry too much about these details
if they are new to you. An environment variable is a means to store a piece of
information for use later, a Bash script is a text file with commands to run,
and execute permissions allows the script to be run by referencing its path. If
you would like to learn more about these topics please let me know.</p>
<iframe width="560" height="315" src="https://www.youtube.com/embed/YsyPTvIq57s" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen=""></iframe>
<p>First of all let us recap to get setup to the where the previous post ended.
We need two subdirectories <code class="language-plaintext highlighter-rouge">raw_data</code> and <code class="language-plaintext highlighter-rouge">scripts</code>. The <code class="language-plaintext highlighter-rouge">mkdir</code> command
below will create these if they do not already exist (the <code class="language-plaintext highlighter-rouge">-p</code> flag means
that no errors are generated if the directories already exist).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir -p raw_data scripts
</code></pre></div></div>
<p>We also need a file with raw data. The command below creates this file if
it does not exist and <strong>overwrites</strong> it if it already exists.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo "Raw data isn't baked data" > raw_data/raw_data.txt
</code></pre></div></div>
<p>To illustrate the use of relative paths in scripts create a file named
<code class="language-plaintext highlighter-rouge">analysis.sh</code> in your <code class="language-plaintext highlighter-rouge">scripts</code> directory, i.e. with the relative path
<code class="language-plaintext highlighter-rouge">./scripts/analysis.sh</code>, and copy and paste the code below into it.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c">#!/bin/bash</span>
<span class="c"># Save the current working directory in an environment variable.</span>
<span class="nv">INITIAL_WORKING_DIRECTORY</span><span class="o">=</span><span class="si">$(</span><span class="nb">pwd</span><span class="si">)</span>
<span class="c"># This line changes to current working directory to where</span>
<span class="c"># the analysis.sh file is.</span>
<span class="nb">cd</span> <span class="s2">"</span><span class="si">$(</span><span class="nb">dirname</span> <span class="s2">"</span><span class="nv">$0</span><span class="s2">"</span><span class="si">)</span><span class="s2">"</span>
<span class="c"># Create an environment variable with the relative path to the</span>
<span class="c"># derived data directory.</span>
<span class="nv">DERIVED_DATA_DIRECTORY</span><span class="o">=</span>../derived_data
<span class="c"># Create the derived data directory if it does not already exist.</span>
<span class="nb">mkdir</span> <span class="nt">-p</span> <span class="nv">$DERIVED_DATA_DIRECTORY</span>
<span class="c"># This code streams the content of the raw data file into the sed</span>
<span class="c"># stream editor. The sed stream editor is used to edit the content</span>
<span class="c"># of the stream. Finally, the output of sed is redirected to a</span>
<span class="c"># derived_data.txt file in the derived data directory.</span>
<span class="nb">cat</span> ../raw_data/raw_data.txt <span class="se">\</span>
| <span class="nb">sed</span> <span class="nt">-e</span> <span class="s2">"s/Raw/Fudged/"</span> <span class="se">\</span>
| <span class="nb">sed</span> <span class="nt">-e</span> <span class="s2">"s/isn't/is/"</span> <span class="se">\</span>
<span class="o">></span> <span class="nv">$DERIVED_DATA_DIRECTORY</span>/derived_data.txt
<span class="c"># Go back to where we were before changing into the</span>
<span class="c"># scripts directory.</span>
<span class="nb">cd</span> <span class="nv">$INITIAL_WORKING_DIRECTORY</span>
</code></pre></div></div>
<p>The code above works with relative paths. The paths are relative to the
<code class="language-plaintext highlighter-rouge">scripts</code> directory. That means that the outcome of the script will be
independent of the directory one is in when running the script, i.e. no nasty
side effects of input files not being found or output files being written to
the wrong directory.</p>
<p>To achieve this the script first makes a note of the directory you are
currently in and stores it in the <code class="language-plaintext highlighter-rouge">INITIAL_WORKING_DIRECTORY</code> environment
variable. The script then changes the working directory to be that of the
<code class="language-plaintext highlighter-rouge">analysis.sh</code> script. At this point the script can start working with paths
relative to the <code class="language-plaintext highlighter-rouge">scripts</code> directory.</p>
<p>The details of the analysis in this script do not really matter. It creates a
directory for derived data (<code class="language-plaintext highlighter-rouge">../derived_data</code>) if it does not already exist.
It then takes as input the raw data, transforms it before writing it to a file
in the derived data directory.</p>
<p>Finally, and importantly, the script changes the working directory back to
whatever it was before the script was invoked.</p>
<p>To test the script script ensure that you are not in the <code class="language-plaintext highlighter-rouge">scripts</code> directory.
In the command below I change into my home directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd /home/olssont
</code></pre></div></div>
<p>In the above I’m using an absolute path to make it clear which directory I am
referring to. Depending on your setup this path may be different. For clarity,
I am referring to the directory in which you created the <code class="language-plaintext highlighter-rouge">raw_data</code> and
<code class="language-plaintext highlighter-rouge">scripts</code> directories.</p>
<p>Before we run the script we need to use the <code class="language-plaintext highlighter-rouge">chmod</code> command to make it
executable.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>chmod +x ./scripts/analysis.sh
</code></pre></div></div>
<p>We can then run the script by calling its path.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>./scripts/analysis.sh
</code></pre></div></div>
<p>This will have created a directory called <code class="language-plaintext highlighter-rouge">dervied_data</code> at the same level
as the <code class="language-plaintext highlighter-rouge">scripts</code> directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls
derived_data raw_data scripts
</code></pre></div></div>
<p>Let us also use the <code class="language-plaintext highlighter-rouge">cat</code> command to look at the content of the
<code class="language-plaintext highlighter-rouge">./derived_data/derived_data.txt</code> file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat ./derived_data/derived_data.txt
Fudged data is baked data
</code></pre></div></div>
<p>In a <a href="/blog/data-management-for-biologists/">previous post</a>
about data management I talked about the need to keep raw data separate from
derived data. In this post I have given you some tips on how you can accomplish
this. Setting up scripts in the fashion outlined above also has the benefit
that it is easier to rename and reorganise directories without your scripts
breaking. Furthermore, it will make it easier for you to share your scripts
with collaborators.</p>
Relative and absolute paths in Linux2020-05-05T00:00:00+00:00http://tjelvarolsson.com/blog/relative-and-absolute-paths-in-linux<p>Paths is a topic that causes a lot of confusion for people that want to learn
how to make use of the command line in Linux. In this post I will explain
what paths are, and the difference between absolute and relative paths.
By the end of this post you should be able to understand the diagram below.</p>
<p><img src="/images/linux_paths.png" alt="Linux paths" /></p>
<p>The file system on Unix-like systems is built up like a tree starting from the
root directory (<code class="language-plaintext highlighter-rouge">/</code>). One can view the content of the root directory by
typing in the command <code class="language-plaintext highlighter-rouge">ls /</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls
bin boot dev etc home lib lib64 media mnt opt proc root
run sbin srv sys tmp usr var
</code></pre></div></div>
<p>The term “path” refers to the name of a file or directory that can be used to
uniquely identify it in the file system. For example, the command <code class="language-plaintext highlighter-rouge">pwd</code>
prints the name of the current working directory, in this case my home directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pwd
/home/olssont
</code></pre></div></div>
<p>In the above <code class="language-plaintext highlighter-rouge">/home/olssont</code> is the path to the directory that I am currently
in. To illustrate this further we can create a file in this directory. The
command below creates a file named <code class="language-plaintext highlighter-rouge">raw_data.txt</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo "Raw data isn't baked data" > raw_data.txt
</code></pre></div></div>
<p>We can see the files in the working directory using the command <code class="language-plaintext highlighter-rouge">ls</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ls
raw_data.txt
</code></pre></div></div>
<p>Depending on how many files you have in your working directory you may see more
output from the command above.</p>
<p>To print the content of the file one can use the command <code class="language-plaintext highlighter-rouge">cat</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat raw_data.txt
Raw data isn't baked data
</code></pre></div></div>
<p>In the above the text <code class="language-plaintext highlighter-rouge">raw_data.txt</code> could be used to find the file that we
just created. This was because the file was present in the working directory.</p>
<p>So how can we refer to a file if it is not present in the working directory?
This is where the use of <em>absolute</em> and <em>relative</em> paths comes into play.</p>
<p>To illustrate this we will use the command <code class="language-plaintext highlighter-rouge">mkdir</code> to create a new directory
called <code class="language-plaintext highlighter-rouge">raw_data</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir raw_data
</code></pre></div></div>
<p>Then we will use the command <code class="language-plaintext highlighter-rouge">mv</code> to move the file into that directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mv raw_data.txt raw_data/
</code></pre></div></div>
<p>Before illustrating the correct way to refer to the file let us see
what happens if we use the same command as previously.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat raw_data.txt
cat: raw_data.txt: No such file or directory
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">cat</code> command is no longer able to find the file (or rather the file is
no longer there, because we moved it into the <code class="language-plaintext highlighter-rouge">raw_data</code> directory). There
are two methods that you could use to refer to the file in the <code class="language-plaintext highlighter-rouge">raw_data</code>
directory. The first is to use the absolute path, in this case
<code class="language-plaintext highlighter-rouge">/home/olssont/raw_data/raw_data.txt</code>. Note that the absolute path to your
file will be different if your username is different to mine and/or if you are
not working in your home directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat /home/olssont/raw_data/raw_data.txt
Raw data isn't baked data
</code></pre></div></div>
<p>The second method is to make use of a relative path, that means a path that
is relative to your current working directory, in this case <code class="language-plaintext highlighter-rouge">raw_data/raw_data.txt</code>.
The forward slash (<code class="language-plaintext highlighter-rouge">/</code>) is the symbol used to separate directories and files.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat raw_data/raw_data.txt
Raw data isn't baked data
</code></pre></div></div>
<p>A different way to represent this relative path is to prepend it with <code class="language-plaintext highlighter-rouge">./</code>, where
the dot is a symbol that is used to represent the current working directory. Some
people prefer this way because it makes it easer to see that it is a relative path.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat ./raw_data/raw_data.txt
Raw data isn't baked data
</code></pre></div></div>
<p>To illustrate the concept of relative paths further we will create another
directory called <code class="language-plaintext highlighter-rouge">scripts</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir scripts
</code></pre></div></div>
<p>Now we will change our working directory to be <code class="language-plaintext highlighter-rouge">scripts</code> using the <code class="language-plaintext highlighter-rouge">cd</code>
(change directory) command.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cd scripts
</code></pre></div></div>
<p>Let us see what happens if we run the previous <code class="language-plaintext highlighter-rouge">cat</code> command again (you
should be able to get back to this by using the up and down arrows on your
keyboard to navigate through the command line history).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat ./raw_data/raw_data.txt
cat: ./raw_data/raw_data.txt: No such file or directory
</code></pre></div></div>
<p>The command above fails because it expects to be able to find a directory named
<code class="language-plaintext highlighter-rouge">raw_data</code> in the current working directory and because we have moved into
the <code class="language-plaintext highlighter-rouge">scripts</code> directory there is no such directory.</p>
<p>In order to make this work we need to be able to specify that we want to go
up one level in the directory tree. This can be achieved using double dots,
i.e. using the prefix <code class="language-plaintext highlighter-rouge">../</code>. To refer to something two levels up in the
directory tree one would use the prefix <code class="language-plaintext highlighter-rouge">../../</code>, and so forth.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat ../raw_data/raw_data.txt
Raw data isn't baked data
</code></pre></div></div>
<p>To summarise:</p>
<ul>
<li>Paths are used to refer to unique files</li>
<li>An absolute path starts at the top of the directory tree and includes all
parent directories separated by slashes as well as the file or directory of
interest. Examples of absolute paths include <code class="language-plaintext highlighter-rouge">/home/olssont</code> (a directory),
<code class="language-plaintext highlighter-rouge">/home/olssont/raw_data/raw_data.txt</code> (a file)</li>
<li>Relative paths are used to refer to files and directories with respect to the
current working directory</li>
<li>In a relative path the prefix <code class="language-plaintext highlighter-rouge">./</code> means the current working directory,
the prefix <code class="language-plaintext highlighter-rouge">../</code> means the parent directory, and the prefix <code class="language-plaintext highlighter-rouge">../../</code>
means the parent’s parent directory, and so forth</li>
</ul>
<p>This should be enough to get you started with paths in the Linux command line.
In the <a href="/blog/using-relative-paths-in-linux-scripts/">next post</a> I will show how relative paths can be used to make scripts
more portable, and how they can be used to improve your data management.</p>
Homeworking: opportunities for scientists2020-03-24T00:00:00+00:00http://tjelvarolsson.com/blog/homeworking-opportunities-for-scientists<p>Covid-19 is causing world wide chaos and it is terrible. However that is not
what this post is about. This post is about staying positive and finding
opportunities. Why? Amidst this chaos it is important to stay sane. You are
no good to anyone if you become a nervous wreck. You need to stay strong to be
able to help your family and friends.</p>
<p>My work is basically centered around helping people get more scientific value
out of the computational resources we have available to us. Some of it is
technical; working with computers. And some of it is softer; working with
people. Just over a week ago I started working from home. The technical
aspects of my work have been relatively easy to transition. The softer parts
of my work still have some way to go. I’m learning more and more about
different types of video conferencing software. To me this presents an
opportunity because I have wanted to do more home working for a while. The
current situation has accelerated that process and I’m keen to make sure I
learn to work from home more efficiently.</p>
<p>Clearly it is easier to work from home if your research is mainly
computational. However, if you are a bench scientist perhaps this presents new
opportunities for you as well. Perhaps this presents an opportunity to learn more
about computational approaches? Perhaps this is the time to learn R? If you
are looking for something like this I have written a book to help you:
<a href="http://biologistsguide2computing.com/">The Biologist’s Guide to Computing</a>
(it is free).</p>
<p>Or perhaps this presents an opportunity to go over all your old data? (Or,
heaven forbid, an opportunity to do some data management? I have also written
software to make this easier: <a href="https://peerj.com/articles/6562/">dtool</a>. It is
also free.) Going over old data with fresh eyes can sometimes lead to new
insights and generate ideas for new manuscripts. Speaking of manuscripts
perhaps this period of working at home presents an opportunity for you to
finish up and submit those manuscripts that have been weighing on your mind for
the past couple of years.</p>
<p>On a different note, schools closed in the UK on Monday and it is likely to
remain that way for the next four months or so. I’ll therefore be one of the
many parents juggling home schooling and working at the same time.</p>
<p>What opportunities present themselves here?</p>
<p>I’ll be spending much more time with my son. That means that I’ll have an
opportunity to teach him about things where I have specialist knowledge.
In particular I am hoping to teach him Python programming.
There are lots of great books out there for teaching kids how to program.
I’ve invested in a copy of <em>Computer Coding Python Games for Kids</em>, because I
had a good experience with a similar book called <em>Coding Games in Scratch</em> from
the same series.</p>
<p>I do not have the skill to do anything to help stop the spread of Covid-19.
However, I do have the skills to help you develop your computational and
data management expertise. Please get in touch if this is something that you
would be interested in. Stay safe, stay sane and stay positive!</p>
New journal for "Patterns" in data2019-11-14T00:00:00+00:00http://tjelvarolsson.com/blog/new-journal-for-patterns-in-data<p><img src="/images/patterns.jpg" alt="Patterns." /></p>
<p>This summer I went to a Research Data Alliance meeting in London. The meeting
was about persistent identifiers, and my goal was to meet like-minded people
that care about data. During the meeting I got talking to Sarah Callaghan.
Sarah described <em>Patterns</em>; the new data science journal she was setting up.
After the meeting our discussion continued via email and Skype and Sarah asked
me to become a member of the academic advisory board.</p>
<p>Data science is an emerging discipline that is gaining more and more attention
both in academia and industry. It is a multi-disciplinary field and it is not
limited to data analysis. It also includes topics such as data cleaning,
computational infrastructure, as well as legal and policy aspects of data. This
can present problems for academic researchers. How do you publish, and get
credit for, work that you have done on developing data science infrastructure
or policy?</p>
<p>In this interview Sarah describes how Cell Press are creating a new journal,
called <em>Patterns</em>, to try to help alleviate this and other problems with
knowledge sharing in data science. Sarah has a 20-year career in creating,
managing, and analysing scientific data and she is <em>Patterns’</em> Editor-in-Chief.</p>
<p><strong>Tjelvar Olsson:</strong> What prompted the creation of <em>Patterns</em>?</p>
<p><strong>Sarah Callaghan:</strong> Data are everywhere, and we’re producing more and more of it
as time goes by. A lot of the time that data creation is intentional, like when
a researcher designs and runs an experiment and collects the data. Sometimes
that data creation is unintentional, like when a supermarket customer buys one
brand instead of another. But regardless of how the data was created, there are
uses for it, whether that’s in developing new science, or in figuring out how
to market a new type of toothpaste.</p>
<p>One common trend across all the domains that create and manage data is this:
everyone has common problems in dealing with data. Everyone, whether they’re an
astronomer or a zoologist, has problems with data collection, cleaning, sharing
with other researchers, understanding the legal and policy aspects of data,
analysing it, and publishing it. And researchers in different domains have come
up with solutions to those problems that work in their particular domain, but
could also be usefully shared across domains.</p>
<p>That cross-disciplinary knowledge sharing is growing, but it’s not quite there
yet. <em>Patterns</em> is all about providing a forum for researchers to share their
data-related solutions, tools, methods and analyses across multiple domains.
There is a lot of really exciting and innovative work out there that has not
gotten out to the wider world – <em>Patterns</em> is here to help change that!</p>
<p><strong>TO:</strong> That sounds fantastic. I certainly know the feeling of struggling to find
out how others have tackled problems that I’m facing working with scientific
infrastructure and data management. It would be great to be able to read about
solutions and lessons learnt by others.</p>
<p>I’ve never met anyone who has set up a new journal before. What are the
challenges involved in this?</p>
<p><strong>SC:</strong> Lots! First and foremost, the main challenge is getting the word out.
People can’t submit their articles to a journal that they don’t know exists. So
(at the risk of going all marketing-speak) building a journal brand is
important. That includes setting a scope and aims that will suit the journal
audience, and recruiting an advisory board who will promote the journal in
their own networks.</p>
<p>Getting people enthused and interested in the journal is also vital, especially
in the case of <em>Patterns</em>, where we’re bringing together different communities
into a new, more-inclusive group. Data are fundamental to research, regardless
of what your domain is, so <em>Patterns</em> is bringing together computer scientists,
data stewards and engineers, and researchers in data-intensive domains in order
to share solutions and knowledge.</p>
<p>Commissioning papers for a new journal is also a challenge. Because <em>Patterns</em> is
new, there can be a bit of convincing required to get authors to submit their
articles to my journal, rather than a more established one. This is where
<em>Patterns</em> cross-disciplinary focus and open access nature have added value –
it allows researchers to reach readers outside their usual domains.</p>
<p>From a personal point of view, setting up a new journal means a lot of
travelling to conferences and meetings, and even more talking to people about
their research in order to commission papers (which to be fair, I do enjoy).
And email. Lots of email!</p>
<p><strong>TO:</strong> It sounds like you get to talk to lots of people about data science. I
think this means that you have your finger on the pulse in this field. How do
you think data science and management will develop over the next ten years?</p>
<p><strong>SC:</strong> I think there’ll be a lot more of it, and there will be different
variations in the roles and job titles associated with data. At the moment, a
“data scientist” role can cover a wide range of skills and talents, and as a
title, it means different things to different people.</p>
<p>I also think that we’re on the cusp of a change in the way that data is
produced and dealt with. The closest analogy I can think of is the industrial
revolution, where goods moved from being produced as piecework, done by
individuals, to being produced in factories. Historically, with data, datasets
have been hand created in their own formats by individuals or small groups. The
landscape has moved to large scale data creation, and to deal with the issues
that come out of that, you need things like infrastructure and standards to
drive tools and services.</p>
<p>Academics aren’t the only ones doing research into data science – there is a
lot of very interesting and exciting work being done in the business domain. I
expect, in the next decade or so, that we will see more of the innovations
developed by business to work with data rolled out more widely across research.
This is already happening with advances in computer vision for example. And
it’s only a little stretch to see how the same artificial intelligence network
that can count people in a crowd could be repurposed to count antelope in a
herd.</p>
<p><strong>TO:</strong> What types of manuscripts should people be submitting to <em>Patterns</em>?</p>
<p><strong>SC:</strong> I am always looking out for exciting, innovative original research
where a data science solution has been applied to a problem in a research
domain, and that solution has the potential to be applied to different domains
too. The solution doesn’t have to be complete, in fact <em>Patterns</em> has developed
a Data Science Maturity Level scale in order to help readers understand what
stage the research is at.</p>
<p><em>Patterns</em> also publishes descriptor articles – which are papers that describe a
data science resource, whether that’s a dataset, piece of software,
infrastructure, workflow, algorithm, even a piece of hardware. As long as the
resource can be uniquely and unambiguously identified and is useful to the
wider community, then an article about it is in scope. This allows the
researchers who spend their time building, for example, infrastructures, to
gain academic credit for their work.</p>
<p>I am also interested in opinion pieces and reviews on topics of interest to the
community. Reviews can be on the literature around a certain topic in data
science (e.g. GANs, blockchain, knowledge graphs, etc.) or can be on types of
software and tools, highlighting their strengths, weaknesses and uses for the
community.</p>
<p>Fundamentally, I want to publish interesting, exciting and innovative work that
people from a wide range of domains want to read!</p>
<p><strong>TO:</strong> Where can people find out more about <em>Patterns</em> and how to submit
manuscripts to it?</p>
<p><strong>SC:</strong> We have a very pretty and informative website at
<a href="http://www.cell/com/patterns">http://www.cell/com/patterns</a> where you can find
all the information needed by authors to write and submit their article. This
includes details of the article types, and the aims and the scope of
<em>Patterns</em>. There’s also the link on that page to the system where you can
submit your manuscript, and also another link so that you can get the journal
e-table of contents delivered free to your inbox when each issue is released.</p>
<p>We’re on Twitter too (<a href="https://twitter.com/Patterns_CP">@Patterns_CP</a>) where we’ll be promoting our content and
also sharing cool and interesting data science things (and pretty pictures of
interesting patterns I come across when out and about).</p>
<p>And of course, if the readers of this interview have any other questions, or
want to discuss whether or not their research is suitable for <em>Patterns</em>, then
I’d be very glad to hear from you! My email is s.callaghan@cell.com</p>
<p>I’d just like to finish up by saying that the future for data science is bright
– let’s make it together!</p>
Packaging data and metadata using dtool2019-03-19T00:00:00+00:00http://tjelvarolsson.com/blog/packaging-data-and-metadata-using-dtool<h2 id="introduction">Introduction</h2>
<p>In the <a href="/blog/data-management-for-biologists/">previous post</a>
I described four principles for effective data management.</p>
<ol>
<li>Make it clear who is responsible for what</li>
<li>Keep raw data safe and separate from derived data</li>
<li>Standardise the location and structure of data</li>
<li>Provide metadata</li>
</ol>
<p>Getting a research group together and discussing these principles can lead to a
more coherent strategy for managing data. However, the fourth principle
presents a challenge in that currently there is no perfect solution for
associating metadata with data.</p>
<p><em>What is metadata anyway?</em></p>
<p>Metadata is data about data. Take, for example, an experiment comparing the
expression profiles of different tissues in mouse.
In this example the species, <em>Mus musculus</em>, is a key piece of metadata that needs
to be recorded and associated with the data.
Another key piece of metadata to make sense of the data is the tissue.
In other words one would need to record the tissue associated each expression profile.
These types of metadata are called descriptive metadata. Without these
descriptive metadata it would be impossible to draw any conclusions from the
data.</p>
<p>When working with digital files one can also think of sizes and checksums
of the files themselves as metadata. This type of metadata is called structural
metadata. Structural metadata can be useful to ensure that files have not become corrupted.
For example, sequencing companies typically provide MD5 checksums alongside
the raw data files so that the downloaded sequence files can be verified to contain the
expected content.</p>
<p>There is also a third type of metadata called administrative metadata.
Administrative metadata is used to manage data as a resource. For example a
UniProt identifier is a piece of administrative metadata used manage a protein
in the UniProt database.</p>
<p>Although metadata is essential for making sense of data finding solutions for
managing metadata can be difficult. In some cases metadata resides inside the heads
of individuals. This is not a sustainable solution!</p>
<p>One strategy for associating metadata with files is to include the metadata in directory
structures and file names. This takes the form of file paths along the lines of
<code class="language-plaintext highlighter-rouge">replicate_1/chitin/col0_leaf_1.tif</code>. Here the file name tells us that the
image is from leaf sample one from the Colombia-0 ecotype of <em>A. thaliana</em>.
The directory structure encodes that this is replicate one and that the sample
has been treated with chitin.</p>
<p>Using file names and directory structures to store metadata is better than
keeping it in ones head. However, it is also fragile in that the metadata can
easily be lost if one moves or renames the file.</p>
<p>In this post I will describe our approach to overcoming this problem.</p>
<h2 id="executive-summary">Executive summary</h2>
<p>Our solution was to develop
<a href="https://dtool.readthedocs.io">dtool</a>, a utility to package metadata with data and treat
the two as a unified whole. In dtool terminology the packaged data and metadata is
referred to as a <em>dataset</em>.</p>
<p>A dataset can be likened to a box with items in it and a label on it describing
its content. The items in the box are the data and the label the metadata.</p>
<p><img src="/images/package_data_and_metadata_into_beautiful_box.png" alt="Packaging data and metadata into a beautiful box." /></p>
<p>There are several benefits to this approach, some of which only become apparent
once one has spent some time using dtool to manage data. However, in brief,
dtool prevents accidental loss of metadata when moving data around. It
also enables researchers to store and work with data in a variety of
storage solutions, and it has built in support for verifying the integrity
of a dataset. In other words dtool automates a lot of tedious work associated with
data management, and gives researchers peace of mind that their data are safe
and secure.</p>
<p>The dtool software was recently published in PeerJ <a href="https://peerj.com/articles/6562/?td=bl">Lightweight data management
with dtool</a>.</p>
<h2 id="the-hairy-details">The hairy details</h2>
<p>At its core dtool is a command line utility (with a Python API) that can be used
to create and interact with datasets.</p>
<p>First of all one needs to install the
software. This can be done using the Python package installer
<a href="https://pip.pypa.io/en/stable/installing/">pip</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pip install dtool
</code></pre></div></div>
<p>dtool can be used to retrieve and display the descriptive metadata of a
dataset. In the example below the URL refers to a dataset hosted in the cloud.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dtool readme show http://bit.ly/Ecoli-ref-genome
description: U00096.3 genome with Bowtie2 indices
organism: Escherichia coli str. K-12 substr. MG1655
accession_id: U00096.3
link: https://www.ebi.ac.uk/ena/data/view/U00096.3
index_builder: bowtie2-build version 2.3.3
index_build_cmd: bowtie2-build U00096.3.fasta reference
</code></pre></div></div>
<p>From this metadata one can discern that this dataset contains an <em>E. coli</em>
reference genome with Bowtie2 indices.</p>
<p>Using dtool it is possible to list the files in the dataset.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dtool ls http://bit.ly/Ecoli-ref-genome
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74 reference.2.bt2
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf reference.3.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e reference.1.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc U00096.3.fasta
37e2d68bb38271036d96b6979d24666e0d4fd814 reference.rev.1.bt2
23ebd7cd21a905d5f255919ca1d0491901cb8718 reference.4.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404 reference.rev.2.bt2
</code></pre></div></div>
<p>The output above lists identifiers and the relative paths of all the files in
the dataset. In dtool terminology the files in a dataset are referred to as
<em>items</em>.</p>
<p>It is also possible to get administrative and structural metadata from a dataset.
This can be achieved using the <code class="language-plaintext highlighter-rouge">dtool summary</code> command.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dtool summary http://bit.ly/Ecoli-ref-genome
name: Escherichia-coli-ref-genome
uuid: 8ecd8e05-558a-48e2-b563-0c9ea273e71e
creator_username: olssont
number_of_items: 7
size: 18.8MiB
frozen_at: 2018-09-26
</code></pre></div></div>
<p>From this one can see, amongst other things, that the data is 18.8MiB in
size and that it has been given the Universally Unique Identifier (UUID)
<code class="language-plaintext highlighter-rouge">8ecd8e05-558a-48e2-b563-0c9ea273e71e</code>.</p>
<p>This particular dataset can be useful if one has <em>E. coli</em> RNA sequencing data
that one wants to align using Bowtie2. However, in order to make use of the
dataset one needs to download it from the cloud to local filesystem. In the
example below a directory for storing datasets is created, and dtool is used to
download the dataset into this directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir datasets
$ dtool cp -q http://bit.ly/Ecoli-ref-genome datasets/
file:///Users/olssont/datasets/Escherichia-coli-ref-genome
</code></pre></div></div>
<p>The command above achieved a lot. It downloaded all the data and metadata from
a dataset stored in the cloud, in an Amazon S3 bucket to be precise, and
reconstructed the dataset on local disk. Note that this involved working with
two different storage technologies, both S3 object storage and filesystem.</p>
<p>All the commands that we have been using on the dataset hosted in the cloud
work the same on the dataset stored on local filesystem.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dtool readme show datasets/Escherichia-coli-ref-genome
description: U00096.3 genome with Bowtie2 indices
organism: Escherichia coli str. K-12 substr. MG1655
accession_id: U00096.3
link: https://www.ebi.ac.uk/ena/data/view/U00096.3
index_builder: bowtie2-build version 2.3.3
index_build_cmd: bowtie2-build U00096.3.fasta reference
</code></pre></div></div>
<p>The structure the dataset on the local filesystem is shown below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tree datasets/Escherichia-coli-ref-genome
datasets/Escherichia-coli-ref-genome
├── README.yml
└── data
├── U00096.3.fasta
├── reference.1.bt2
├── reference.2.bt2
├── reference.3.bt2
├── reference.4.bt2
├── reference.rev.1.bt2
└── reference.rev.2.bt2
1 directory, 8 files
</code></pre></div></div>
<p>From the above we can see that the data files are stored in a subdirectory
named <code class="language-plaintext highlighter-rouge">data</code>. The descriptive metadata is stored in the <code class="language-plaintext highlighter-rouge">README.yml</code> file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat datasets/Escherichia-coli-ref-genome/README.yml
description: U00096.3 genome with Bowtie2 indices
organism: Escherichia coli str. K-12 substr. MG1655
accession_id: U00096.3
link: https://www.ebi.ac.uk/ena/data/view/U00096.3
index_builder: bowtie2-build version 2.3.3
index_build_cmd: bowtie2-build U00096.3.fasta reference
</code></pre></div></div>
<p>On filesystem the data and metadata are stored in files. Furthermore, the
metadata files are plain text and make use of open standards. This makes it
possible to read and understand them without the need for specialised tools.</p>
<p>In this section three important features of dtool have been highlighted:</p>
<ol>
<li>
<p>The dtool command line interface can be used to inspect a dataset’s metadata
allowing one to understand the content of the dataset.</p>
</li>
<li>
<p>When copying a dataset with dtool both the data and the metadata are copied
across. This means that it is possible to copy datasets, for example to
long-term storage systems, without fear of loosing metadata.</p>
</li>
<li>
<p>dtool supports several storage systems including both filesystem and Amazon
S3 object storage. This make it possible to copy datasets between different
storage systems without having to learn the specifics (and quirks) of the
various storage systems.</p>
</li>
</ol>
<h2 id="creating-a-dataset">Creating a dataset</h2>
<p>So far the use and benefits of dtool have been illustrated
using an existing dataset. Now we will go through the
process of creating a dataset.</p>
<p>The creation of a dataset happens in three stages:</p>
<ol>
<li>One creates a “proto” dataset that one can add data and metadata to</li>
<li>One adds the data and metadata to the proto dataset</li>
<li>One converts the proto dataset into a dataset by “freezing” it</li>
</ol>
<p>This can be likened to creating an open box (the proto dataset), putting items
(data) into it, sticking a label (metadata) on it, and closing the box
(freezing the dataset).</p>
<p><img src="/images/package_data_and_metadata_into_beautiful_box.png" alt="Packaging data and metadata into a beautiful box." /></p>
<p>Now we will create a minimal dataset containing a single file with the content
<code class="language-plaintext highlighter-rouge">Hola Mundo</code>. The command below creates a dataset named <code class="language-plaintext highlighter-rouge">hello</code> in
the <code class="language-plaintext highlighter-rouge">datasets</code> directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dtool create hello datasets/
Created proto dataset file:///Users/olssont/datasets/hello
Next steps:
1. Add raw data, eg:
dtool add item my_file.txt file:///Users/olssont/datasets/hello
Or use your system commands, e.g:
mv my_data_directory /Users/olssont/datasets/hello/data/
2. Add descriptive metadata, e.g:
dtool readme interactive file:///Users/olssont/datasets/hello
3. Convert the proto dataset into a dataset:
dtool freeze file:///Users/olssont/datasets/hello
</code></pre></div></div>
<p>Now we add a file named <code class="language-plaintext highlighter-rouge">greeting.txt</code> to the proto dataset.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo "Hola Mundo" > datasets/hello/data/greeting.txt
</code></pre></div></div>
<p>There are several ways to add descriptive metadata to a dataset. Below we make
use of dtool’s built-in template to interactively prompt for metadata to
describe the dataset.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dtool readme interactive datasets/hello
description [Dataset description]: Hello World greeting in Spanish
project [Project name]: dtool demo
confidential [False]:
personally_identifiable_information [False]:
name [Tjelvar Olsson]:
email [tjelvar.olsson@dtool-solutions.com]:
username [olssont]:
Updated readme
To edit the readme using your default editor:
dtool readme edit file:///Users/olssont/datasets/hello
</code></pre></div></div>
<p>Finally, we need to convert the proto dataset into a dataset by freezing it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dtool freeze datasets/hello
Generating manifest [####################################] 100% greeting.txt
Dataset frozen file:///Users/olssont/datasets/hello
</code></pre></div></div>
<p>Congratulations, you have just created your first dtool dataset!</p>
<h2 id="validating-the-integrity-of-a-dataset">Validating the integrity of a dataset</h2>
<p>The <code class="language-plaintext highlighter-rouge">dtool freeze</code> command generates a manifest containing structural metadata.
In the manifest each file in the data directory is given an identifier that is
the SHA1 checksum of the file’s relative path in the data directory. The identifiers
are used to create one record for each data item containing the file’s relative path,
size, checksum and timestamp. Below is the content of the manifest file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cat datasets/hello/.dtool/manifest.json
{
"dtoolcore_version": "3.8.0",
"hash_function": "md5sum_hexdigest",
"items": {
"0ce56d0a6e9baa0c5d170001592c9b9c65d19276": {
"hash": "b4b9e397fb7e08bfeaa54090d2989e53",
"relpath": "greeting.txt",
"size_in_bytes": 11,
"utc_timestamp": 1551631241.827989
}
}
}
</code></pre></div></div>
<p>This information can be used to verify the integrity of the dataset by checking
that the expected items are present and that they have the correct size and content.</p>
<p><img src="/images/verify_items_in_box.png" alt="Verify the items in a box." /></p>
<p>Using dtool this type of integrity check can be performed using the <code class="language-plaintext highlighter-rouge">dtool
verify</code> command.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dtool verify --full datasets/hello
All good :)
</code></pre></div></div>
<p>In the command above we use the <code class="language-plaintext highlighter-rouge">--full</code> flag to include the step to compute and
compare the checksum. Only item identifiers and sizes are verified by default as
computing checksums can be time consuming for datasets that contain lots of large
files.</p>
<p>We can simulate data corruption by editing the <code class="language-plaintext highlighter-rouge">data/greeting.txt</code> file in the dataset.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ echo "Bonjour le Monde" > datasets/hello/data/greeting.txt
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">data/greeting.txt</code> file no longer contains the expected content, it has
been corrupted. Let’s see the output of the <code class="language-plaintext highlighter-rouge">dtool verify</code> command.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ dtool verify --full datasets/hello
Altered item size: 0ce56d0a6e9baa0c5d170001592c9b9c65d19276 greeting.txt
Altered item hash: 0ce56d0a6e9baa0c5d170001592c9b9c65d19276 greeting.txt
</code></pre></div></div>
<p>In the above the content of the <code class="language-plaintext highlighter-rouge">hello/data</code> directory is compared against
the expected content stored in the manifest. In this case both the file size and
checksum of the <code class="language-plaintext highlighter-rouge">greeting.txt</code> file are different and this is reported back
to the user.</p>
<h2 id="discussion">DISCUSSION</h2>
<p>In this post I have shown how one can use dtool to package data and metadata
into a unified whole. Using dtool to manage data provides several benefits:</p>
<ul>
<li>It prompts people to add metadata to describe their data, making the data
more reusable</li>
<li>It standardises the structure of the metadata, making it easier to access
the metadata</li>
<li>It makes it possible to verify the integrity of dataset, providing peace of mind that data is intact</li>
<li>It makes it possible to copy a dataset without fear of loosing metadata</li>
<li>It makes it possible to copy a dataset between different types of storage
systems, e.g. from filesystem to Amazon S3 object storage</li>
</ul>
<p>There are several aspects of dtool this post did not go into. For example, it
is possible to customise the template used to prompt for descriptive metadata.
This, and other more advanced topics, will be the topics of future blog posts.</p>
<p>If you are keen to find out more about dtool I suggest having a look at the paper
<a href="https://peerj.com/articles/6562/?td=bl">Lightweight data management with dtool</a>
and the <a href="https://dtool.readthedocs.io">dtool documentation</a>.</p>
<p>If you have made it this far you deserve a lollipop!</p>
Data management for biologists2019-02-26T00:00:00+00:00http://tjelvarolsson.com/blog/data-management-for-biologists<h2 id="introduction">Introduction</h2>
<p>Data management is a great challenge in the biological sciences and discussing
it is often difficult because it is a multi-faceted problem and the term “data
management” often means different things to different people.</p>
<p>Over the past couple of years I have become more and more involved in helping
biological research groups manage their data. The first step in this process is
typically a meeting that gathers all the members of the research group or project
with the aim of getting people onto the same page.</p>
<p>During these sessions the participants are often surprised to find out how
different their point of view are to other people in the group. There is
typically a split across two axis. One axis being project leaders vs group members.
The former being more concerned with the long term safety and
viability of the data produced by the group and the latter being more concerned
with the limitations of the tools available for them to do their day to day
work. The other axis across which people have different points of view is that
between experimental biologists vs bioinformaticians. The former having pain
points around managing distributed versions of Word and Excel files and the
latter struggling with having enough storage quota on the computer
cluster to analyse the high-throughput sequencing data produced by the group.</p>
<p>Having mediated many such meeting it has become clear that there is not one
solution that fits all. Each research group and project has its own quirks and
the members need to find a solution that works for them. However,
there are some general principles that can help guide a group
towards a more consistent and coherent way of managing their data.</p>
<p>In this post I’d like to share these guiding principles that I use to mediate
these types of group data management sessions.</p>
<h2 id="principle-1-make-it-clear-who-is-responsible-for-what">Principle 1: Make it clear who is responsible for what</h2>
<p>In terms of data management responsibilities are often implicitly assumed to be
with someone else. Let’s illustrate this with an example.</p>
<p>Ambitious Anna, is an established group leader who has started making more and
more use of next generation sequencing. Two of the people in Ambitious Anna’s
group are Fastidious Fatima and Binary Beatrice. Fastidious Fatima, an
experimental biology post doc, prepares a large batch of samples and sends it
off for sequencing with Nebulous New Sequencing Ltd. After a month of waiting
Nebulous New Sequencing sends Fastidious Fatima an email with instructions for
how to download her 100GB of sequencing data. Fastidious Fatima is busy
preparing more samples and she asks Binary Beatrice, the group
bioinformatician, to download the data. Binary Beatrice is happy to help,
particularly as she needs to process the data anyway.</p>
<p>Ambitious Anna, the group leader, has neither touched the experimental sample
nor the raw data produced by Nebulous New Sequencing
Ltd. So Ambitious Anna implicitly assumes that the people in her group are
managing the data.</p>
<p>Fastidious Fatima, the experimental biologist, has a record of her experimental
work and samples in her lab notebook. However, since she did not download or
process the sequencing data produced by Nebulous New Sequencing Ltd she assumes
that Binary Beatrice and Ambitious Anna are managing that data.</p>
<p>Binary Beatrice, the bioinformatician, is overworked. As well as analysing data
produced by Fastidious Fatima she also has another six experimental biologists
to support. On top of this she needs to find another post-doc as her contract
runs out in three months time. Binary Beatrice therefore thinks that it is
Ambitious Anna’s job, as the group leader, to ensure that the data is managed
properly.</p>
<p>In this contrived example all the actors implicitly assume that the
management of data is somebody else’s responsibility.</p>
<p>Getting everyone into a room to discuss data management can help improve this
situation. By explicitly stating who is responsible for what data are less likely
to fall between the cracks.</p>
<p>One may consider using the template below for assigning responsibilities.</p>
<p><em>Ultimately data management is the responsibility of the group leader. However,
in practise the group leader is unlikely to be working with data on a day to
day basis so he or she needs to delegate this responsibility to a data
champion. The data champion then becomes responsible for ensuring that the
existing and new members of the group are aware of the group data management
processes.</em></p>
<h2 id="principle-2-keep-raw-data-safe-and-separate-from-derived-data">Principle 2: Keep raw data safe and separate from derived data</h2>
<p>Most researchers are aware that they should keep their data safe by backing it
up. If possible it is also worth protecting raw data by making it read
only. This means that you cannot accidentally delete or modify it. More good
suggestion on this topic can be found in <a href="https://doi.org/10.1371/journal.pcbi.1005097">Ten Simple Rules for Digital Data
Storage</a>.</p>
<p>However, here I would like to emphasize another point, <em>the importance of
keeping <strong>raw</strong> data separate from <strong>derived</strong> data</em>.</p>
<p>Let’s illustrate this with another story. Once upon a time Binary Beatrice was
making the transition from experimental biology to bioinformatics. She had got
her first sequencing data and was eager to analyse it.</p>
<p>Binary Beatrice wanted to run a tool called The Latest & Greatest Aligner,
which after she had spent three weeks installing it, was ready for her to use.
Half a year earlier, as preparation, Binary Beatrice had attended the
institute’s cluster computing course and she had learnt how to write a
batch submission script to submit jobs to the cluster. She therefore wrote such
a batch submission script to run her Latest & Greatest Aligner. The Latest &
Greatest Aligner needed to know where the data was so she put the batch
submission script next to the raw data. That way Binary Beatrice did not have
to worry about file paths (the bane of scientific computing).</p>
<p>To Binary Beatrice’s surprise The Latest & Greatest Aligner worked out of the
box and produced great results. It also produced lots and lots of files.
However, her analysis did not end there she also had to run The Latest &
Greatest Normaliser and The Greatest and Latest Plotter. These tools produced
even more files.</p>
<p>Then something terrible happened. Binary Beatrice hit her storage quota and
could not write any more files. At this point she had a directory filled with
millions of files. Some of them were raw data, some of them were batch
submission scripts, some of them were intermediate files and some of them were
figures that she wanted to use in her paper.</p>
<p>Because all the derived file names were based on the names of the raw data
files Binary Beatrice did not dare create an expression for deleting files in
bulk. She therefore spent two weeks cleaning up her data.</p>
<p>At this point Binary Beatrice made a promise to herself to always keep raw data
separate from derived data. In fact all her new projects have a structure with
four directories: <code class="language-plaintext highlighter-rouge">raw_data</code>, <code class="language-plaintext highlighter-rouge">scripts</code>, <code class="language-plaintext highlighter-rouge">intermediate_data</code>, and
<code class="language-plaintext highlighter-rouge">final_data</code>. When she hits her quota it is now easy for her to remove the
files in the <code class="language-plaintext highlighter-rouge">intermediate_data</code> directory.</p>
<p>In the fictional example above Binary Beatrice learnt from her mistake
immediately. This is not always the case. In real life many people ask to get
their storage quota increased and don’t learn the lesson of separating raw data
from derived data. Eventually, when these people leave the group, no one can
work out what their raw/derived data is.</p>
<p>If you are interested in some practical tips on how to do this in Linux have
a look at <a href="/blog/using-relative-paths-in-linux-scripts/">this post</a>.</p>
<h2 id="principle-3-standardise-the-location-and-structure-of-data">Principle 3: Standardise the location and structure of data</h2>
<p>It is natural, and common, for PhD students and post docs to think of the data
that they generate as their own. This tends to lead to a situation where the
data is organised per research group member. For example, Ambitious Anna might
have a shared folder for her group and at the top level are the folders with
names of the group members <code class="language-plaintext highlighter-rouge">Fastidious-Fatima</code>, <code class="language-plaintext highlighter-rouge">Binary-Beatrice</code>, etc.
Fastidious Fatima then organises her work and her data in the
<code class="language-plaintext highlighter-rouge">Fastidious-Fatima</code> folder and Binary Beatrice organises her work and her
data in the <code class="language-plaintext highlighter-rouge">Binary-Beatrice</code> folder.</p>
<p>This is not necessarily a bad way to organise the group’s data.
Group leaders sometimes find it easier to remember data based on who generated it.
However, it is
important to realise that (unless otherwise stated) the data generated when
working in a research group does not belong to the individual generating it.
The data belongs to the group leader. If this is not stated explicitly, and
made clear within the group, it is easy for each member of the group to invent
their own way of structuring the data within their own folder. When this
happens files and data often become incomprehensible once the person who
organised them leaves the group.</p>
<p>It is therefore highly recommended that the location and structure of data is
standardised, and ideally that this standard is recorded in a document that can
be read by everyone at the top level of the shared folder. If a data champion
has been nominated it is his/her responsibility to ensure that this document is
kept up to date and that the other members of the group know that they need to
follow the standards for organising data outlined in this document.</p>
<p>I also highly recommend having a separate folder at the top level of the shared
folder dedicated to storing raw data, see Principle 2 above. Below is an example
that still gives individuals their own working space.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Ambitious-Anna
├── GROUP-MEMBERS
│ ├── Binary-Beatrice
│ └── Fastidious-Fatima
├── RAW-DATA
└── README.txt
</code></pre></div></div>
<p>Below is an example that structures work based on projects rather than individuals.
This can be useful if more than one person is working on a project.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Ambitious-Anna
├── PROJECTS
│ ├── Cure-Cancer
│ └── Feed-The-World
├── RAW-DATA
└── README.txt
</code></pre></div></div>
<p>Obviously it is possible to mix and match according to need. However, it is
useful to document the rational of the structure and how it is intended to be
used. In the examples above this information is recorded in the <code class="language-plaintext highlighter-rouge">README.txt</code> file.</p>
<h2 id="principle-4-provide-metadata">Principle 4: Provide metadata</h2>
<p>Metadata is a fancy term for information that put data into context. For
example, in a microscopy experiment the pixels captured are data and
information about the experiment such as the magnification and the X/Y scales
are metadata. More formally, metadata is data about data. Without metadata
raw data is meaningless as it cannot be understood.</p>
<p>It is important to think about what metadata one needs to capture. Often this
is closely linked to the design of the experiment. For example, if one is
performing a time series study, it is important to associate the date/time with
each data point.</p>
<p>This type of metadata is called descriptive metadata. Descriptive metadata is
important as it allows data to be put into the context of a scientific
question. For example, if one performs a RNA sequencing experiment to compare
expression profiles in different tissues it is important to record which data
are associated with which tissues.</p>
<p>When thinking about how to organise data it is worth thinking about how
descriptive metadata should be recorded and associated with the data.
This is a non-trivial problem. It is not uncommon for metadata to be stored
in an individual’s memory. This is not a safe strategy! Another common approach
is to store descriptive metadata in file names and directory structures. This is
better, but is also fragile as it is easy to loose metadata when moving and/or
renaming files.</p>
<p>Another type of metadata is structural metadata and includes things such as
sizes and checksums of files. Structural metadata can be used to
verify that the raw data files have not become corrupted. For example,
sequencing companies typically provide MD5 checksums along side the raw data files
so that one can verify that the downloaded files contain the expected
content.</p>
<p>This fourth principle, is more complicated than the previous ones. Although, it
is easy to understand that metadata is important there is currently not an easy
way to bundle arbitrary metadata with files on disk. The poor mans solution is
to capture this metadata using some sort of directory structure. However, this
is fragile and makes it difficult to add more meta data on an <em>ad-hoc</em> basis.</p>
<p>Furthermore, there is not really a neat solution for capturing structural
metadata such as sizes and checksums of files. It is therefore rarely done
within research groups. Ideally this is something that should be automated
as it is not a productive use of researchers’ time to calculate and record
these types of file properties.</p>
<h2 id="discussion">Discussion</h2>
<p>Using these principles to mediate discussions about working practises can result in
a much more coherent strategy to managing data.</p>
<p>The first three principles are relatively easy for a research group to get to grips with.
They can be implemented by discussing how the group think things should be done
and by coming to a mutual understanding and agreement on how data should be
structured and organised.</p>
<p>The fourth principle highlights the importance of recoding metadata. However,
having metadata separate from data, for example in directory structures and
file names, is fragile. The metadata can easily be lost when moving and
renaming files. In the next post I will describe our solution to this problem.</p>
Python for biologists2016-10-13T00:00:00+00:00http://tjelvarolsson.com/blog/python-for-biologists<p>Python is a high-level scripting language that is growing in popularity in the
scientific community. It uses a syntax that is relatively easy to get to grips
with and that encourages code readability.</p>
<p>This post aims to give you a flavour of what it feels like to work with Python.
We will use Python to calculate the
<a href="https://en.wikipedia.org/wiki/GC-content">guanine-cytosine (GC) content</a> of
a DNA sequence. In the process you will also learn about some key aspects of
programming namely variables, functions and loops.</p>
<h2 id="getting-a-flavour-of-python">Getting a flavour of Python</h2>
<p>The most traditional way of working with Python is to write your code in a script
and run it using the <code class="language-plaintext highlighter-rouge">python</code> command. For example, if you had your code
in a file named <code class="language-plaintext highlighter-rouge">analysis.py</code> you could run it using the command below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python analysis.py
</code></pre></div></div>
<p>However, there are other ways of interacting with Python.</p>
<p>These days so called “notebooks” are becoming more and more popular.
They are used for creating and sharing documents that include explanations of
code as well as code blocks that can be run interactively. Check out the
<a href="http://jupyter.org">Jupyter</a> project for more details.</p>
<p>Python can also be run interactively in your terminal
using its
<a href="https://docs.python.org/2/tutorial/interpreter.html#interactive-mode]">interactive mode</a>.</p>
<p>To start Python in its interactive mode simply type <code class="language-plaintext highlighter-rouge">python</code> into your terminal.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python
Python 2.7.10 (default, Jul 14 2015, 19:46:27)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>
</code></pre></div></div>
<p>This prints out information about the version of Python that is being used
and how it was compiled before leaving you at the interactive prompt. In this instance
I am using Python version 2.7.10.</p>
<p>The three greater than signs (<code class="language-plaintext highlighter-rouge">>>></code>) represent the primary prompt into which
commands can be entered.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="mi">1</span> <span class="o">+</span> <span class="mi">2</span>
<span class="mi">3</span>
</code></pre></div></div>
<p>There is also a secondary prompt that is represented by three dots (<code class="language-plaintext highlighter-rouge">...</code>).
It is used as a continuation line.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">line</span> <span class="o">=</span> <span class="s">">myseq1"</span>
<span class="o">>>></span> <span class="k">if</span> <span class="n">line</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">">"</span><span class="p">):</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="o">...</span>
<span class="o">></span><span class="n">myseq1</span>
</code></pre></div></div>
<p>The rest of this post will use this “interactive” Python format. You can try to
follow along using Python running interactively in a terminal or using a
Python notebook. You can access a Python notebook from <a href="https://try.jupyter.org/">Try
Jupyter</a>.</p>
<h2 id="variables">Variables</h2>
<p>A variable is a means of storing a piece of information using using a
descriptive name. The use of variables is encouraged as it allows us to
avoid having to repeat ourselves.</p>
<p>In Python variables are assigned using the equals sign.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">pi</span> <span class="o">=</span> <span class="mf">3.14</span>
</code></pre></div></div>
<p>When naming variables being explicit is more important than being succinct.
One reason for this is that you will spend more time reading your code than
you will writing it. Avoiding the mental overhead of trying to understand
what all the acronyms mean is a good thing. For example, suppose that we
wanted to create a variable for storing the radius of a circle. Please
avoid the temptation of naming the variable <code class="language-plaintext highlighter-rouge">r</code>, and go for the longer
but more explicit name <code class="language-plaintext highlighter-rouge">radius</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">radius</span> <span class="o">=</span> <span class="mf">1.5</span>
</code></pre></div></div>
<h2 id="determining-the-gc-count-of-a-sequence">Determining the GC count of a sequence</h2>
<p>One feature of interest when examining DNA is the
<a href="https://en.wikipedia.org/wiki/GC-content">guanine-cytosine (GC) content</a>.
DNA with high GC-content is more stable than DNA with low GC-content.</p>
<p>Suppose that we had a string representing a DNA sequence.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">dna_string</span> <span class="o">=</span> <span class="s">"attagcgcaatctaactacactactgccgcgcggcatatatttaaatata"</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">dna_string</span><span class="p">)</span>
<span class="n">attagcgcaatctaactacactactgccgcgcggcatatatttaaatata</span>
</code></pre></div></div>
<p>A string is a data type for representing text. As such it is not ideal for data
processing purposes. In this case the DNA sequence would be better represented
using a “list”, with each item in the list representing a DNA letter. A list,
also known as an array, is a data structure representing a collection of elements
with a specific order.</p>
<p>In Python we can convert a string into a list using the built-in <code class="language-plaintext highlighter-rouge">list()</code>
function.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">dna_list</span> <span class="o">=</span> <span class="nb">list</span><span class="p">(</span><span class="n">dna_string</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">dna_list</span><span class="p">)</span>
<span class="p">[</span><span class="s">'a'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span>
<span class="s">'t'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span>
<span class="s">'t'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'g'</span><span class="p">,</span> <span class="s">'c'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span>
<span class="s">'t'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span> <span class="s">'t'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">,</span>
<span class="s">'t'</span><span class="p">,</span> <span class="s">'a'</span><span class="p">]</span>
</code></pre></div></div>
<p>Python’s list has got a method called <code class="language-plaintext highlighter-rouge">count()</code> that we can use to find out
the counts of particular elements in the list.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">dna_list</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">"a"</span><span class="p">)</span>
<span class="mi">17</span>
</code></pre></div></div>
<p>To find out the total number of items in a list one can use Python’s built-in
<code class="language-plaintext highlighter-rouge">len()</code> function, which returns the length of the list.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="nb">len</span><span class="p">(</span><span class="n">dna_list</span><span class="p">)</span>
<span class="mi">50</span>
</code></pre></div></div>
<p>When using Python you need to be careful when dividing integers, because in
Python 2 the default is to use integer division, i.e. to discard the remainder.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="mi">3</span> <span class="o">/</span> <span class="mi">2</span>
<span class="mi">1</span>
</code></pre></div></div>
<p>One can work around this by ensuring that at least one of the numbers is
represented using floating point.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="mi">3</span> <span class="o">/</span> <span class="mf">2.0</span>
<span class="mf">1.5</span>
</code></pre></div></div>
<p><em>Warning: In Python 3, the behaviour of the division operator has been
changed, and dividing two integers will result in normal division.</em></p>
<p>One can convert an integer to a floating point number using Python’s built-in
<code class="language-plaintext highlighter-rouge">float()</code> function.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="nb">float</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
<span class="mf">2.0</span>
</code></pre></div></div>
<p>We now have all the information required to calculate the GC-content of the DNA
sequence.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">gc_count</span> <span class="o">=</span> <span class="n">dna_list</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">"g"</span><span class="p">)</span> <span class="o">+</span> <span class="n">dna_list</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">"c"</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">gc_frac</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">gc_count</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">dna_list</span><span class="p">)</span>
<span class="o">>>></span> <span class="mi">100</span> <span class="o">*</span> <span class="n">gc_frac</span>
<span class="mf">38.0</span>
</code></pre></div></div>
<h2 id="creating-reusable-functions">Creating reusable functions</h2>
<p>Suppose that we wanted to calculate the GC-content for several sequences. In
this case it would be very annoying, and error prone, to have to enter the
commands above into the Python shell manually for each sequence. Rather, it
would be advantageous to be able to create a piece of code that could be called
repeatedly to calculate the GC-content. We can achieve this using the concept of
functions. In other words functions are a means for programmers to avoid repeating
themselves.</p>
<p>Let us create a simple function that adds two items together.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">def</span> <span class="nf">add</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
<span class="o">...</span> <span class="k">return</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>
<span class="o">...</span>
<span class="o">>>></span> <span class="n">add</span><span class="p">(</span><span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">)</span>
<span class="mi">5</span>
</code></pre></div></div>
<p>In Python functions are defined using the <code class="language-plaintext highlighter-rouge">def</code> keyword. Note that the
<code class="language-plaintext highlighter-rouge">def</code> keyword is followed by the name of the function. The name of the
function is followed by a parenthesized set of arguments, in this case the
function takes two arguments <code class="language-plaintext highlighter-rouge">a</code> and <code class="language-plaintext highlighter-rouge">b</code>. The end of the function
definition is marked using a colon.</p>
<p>The body of the function, in this example the <code class="language-plaintext highlighter-rouge">return</code> statement, needs to be
indented. The standard in Python is to use four white spaces to indent code
blocks. In this case the function body only contains one line of code. However,
a function can include several indented lines of code.</p>
<p><em>Warning: Whitespace really matters in Python! If your code is not correctly
aligned you will see <code class="language-plaintext highlighter-rouge">IndentationError</code> messages telling you
that everything is not as it should be. You will also run into
<code class="language-plaintext highlighter-rouge">IndentationError</code> messages if you mix white spaces and tabs.</em></p>
<p>Now we can create a function for calculating the GC-content of a sequence.
As with variables explicit trumps succinct in terms of naming.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">def</span> <span class="nf">gc_content</span><span class="p">(</span><span class="n">sequence</span><span class="p">):</span>
<span class="o">...</span> <span class="n">gc_count</span> <span class="o">=</span> <span class="n">sequence</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">"g"</span><span class="p">)</span> <span class="o">+</span> <span class="n">sequence</span><span class="o">.</span><span class="n">count</span><span class="p">(</span><span class="s">"c"</span><span class="p">)</span>
<span class="o">...</span> <span class="n">gc_fraction</span> <span class="o">=</span> <span class="nb">float</span><span class="p">(</span><span class="n">gc_count</span><span class="p">)</span> <span class="o">/</span> <span class="nb">len</span><span class="p">(</span><span class="n">sequence</span><span class="p">)</span>
<span class="o">...</span> <span class="k">return</span> <span class="mi">100</span> <span class="o">*</span> <span class="n">gc_fraction</span>
<span class="o">...</span>
<span class="o">>>></span> <span class="n">gc_content</span><span class="p">(</span><span class="n">dna_list</span><span class="p">)</span>
<span class="mf">38.0</span>
</code></pre></div></div>
<h2 id="list-slicing">List slicing</h2>
<p>Suppose that we wanted to look at local variability in GC-content. To achieve
this we would like to be able to select segments of our initial list. This is
known as “slicing”, as in slicing up a salami.</p>
<p>In Python slicing uses a <code class="language-plaintext highlighter-rouge">[start:end]</code> syntax that is inclusive for the start
index and exclusive for the end index. To illustrate slicing let us first
create a list to work with.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">zero_to_five</span> <span class="o">=</span> <span class="p">[</span><span class="s">"zero"</span><span class="p">,</span> <span class="s">"one"</span><span class="p">,</span> <span class="s">"two"</span><span class="p">,</span> <span class="s">"three"</span><span class="p">,</span> <span class="s">"four"</span><span class="p">,</span> <span class="s">"five"</span><span class="p">]</span>
</code></pre></div></div>
<p>To get the first two elements we therefore use 0 for the start index, as
Python uses a zero-based indexing system, and 2 for the end index as the
element from the end index is excluded.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">zero_to_five</span><span class="p">[</span><span class="mi">0</span><span class="p">:</span><span class="mi">2</span><span class="p">]</span>
<span class="p">[</span><span class="s">'zero'</span><span class="p">,</span> <span class="s">'one'</span><span class="p">]</span>
</code></pre></div></div>
<p>Note that the start position for the slicing is 0 by default so we could just
as well have written.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">zero_to_five</span><span class="p">[:</span><span class="mi">2</span><span class="p">]</span>
<span class="p">[</span><span class="s">'zero'</span><span class="p">,</span> <span class="s">'one'</span><span class="p">]</span>
</code></pre></div></div>
<p>To get the last three elements.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">zero_to_five</span><span class="p">[</span><span class="mi">3</span><span class="p">:]</span>
<span class="p">[</span><span class="s">'three'</span><span class="p">,</span> <span class="s">'four'</span><span class="p">,</span> <span class="s">'five'</span><span class="p">]</span>
</code></pre></div></div>
<p>We can use list slicing to calculate the local GC-content measurements of
our DNA.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">gc_content</span><span class="p">(</span><span class="n">dna_list</span><span class="p">[:</span><span class="mi">10</span><span class="p">])</span>
<span class="mf">40.0</span>
<span class="o">>>></span> <span class="n">gc_content</span><span class="p">(</span><span class="n">dna_list</span><span class="p">[</span><span class="mi">10</span><span class="p">:</span><span class="mi">20</span><span class="p">])</span>
<span class="mf">30.0</span>
<span class="o">>>></span> <span class="n">gc_content</span><span class="p">(</span><span class="n">dna_list</span><span class="p">[</span><span class="mi">20</span><span class="p">:</span><span class="mi">30</span><span class="p">])</span>
<span class="mf">70.0</span>
<span class="o">>>></span> <span class="n">gc_content</span><span class="p">(</span><span class="n">dna_list</span><span class="p">[</span><span class="mi">30</span><span class="p">:</span><span class="mi">40</span><span class="p">])</span>
<span class="mf">50.0</span>
<span class="o">>>></span> <span class="n">gc_content</span><span class="p">(</span><span class="n">dna_list</span><span class="p">[</span><span class="mi">40</span><span class="p">:</span><span class="mi">50</span><span class="p">])</span>
<span class="mf">0.0</span>
</code></pre></div></div>
<h2 id="loops">Loops</h2>
<p>It can get a bit repetitive, tedious, and error prone specifying all the ranges
manually. A better way to do this is to make use of a loop construct. A loop
allows a program to cycle through the same set of operations a number of times.</p>
<p>In lower level languages <code class="language-plaintext highlighter-rouge">while</code> loops are common because they operate in a way
that closely mimic how the hardware works. The code below illustrates a typical
setup of a while loop.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="o">>>></span> <span class="n">cycle</span> <span class="o">=</span> <span class="mi">0</span>
<span class="o">>>></span> <span class="k">while</span> <span class="n">cycle</span> <span class="o"><</span> <span class="mi">5</span><span class="p">:</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">cycle</span><span class="p">)</span>
<span class="o">...</span> <span class="n">cycle</span> <span class="o">=</span> <span class="n">cycle</span> <span class="o">+</span> <span class="mi">1</span>
<span class="o">...</span>
<span class="mi">0</span>
<span class="mi">1</span>
<span class="mi">2</span>
<span class="mi">3</span>
<span class="mi">4</span>
</code></pre></div></div>
<p>In the code above Python moves through the commands in the while loop executing
them in order, i.e. printing the value of the <code class="language-plaintext highlighter-rouge">cycle</code> variable and then
incrementing it. The logic then moves back to the <code class="language-plaintext highlighter-rouge">while</code> statement and
the conditional (<code class="language-plaintext highlighter-rouge">cycle < 5</code>) is re-evaluated. If true the commands in the
while statment are executed in order again, and so forth until the conditional
is false. In this example the <code class="language-plaintext highlighter-rouge">print(cycle)</code> command was called five times,
i.e. until the <code class="language-plaintext highlighter-rouge">cycle</code> variable incremented to 5 and the <code class="language-plaintext highlighter-rouge">cycle < 5</code>
conditional evaluated to false.</p>
<p>However, when working in Python it is much more common to make use of <code class="language-plaintext highlighter-rouge">for</code>
loops. For loops are used to iterate over elements in data structures such as
lists.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">]:</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>
<span class="o">...</span>
<span class="mi">0</span>
<span class="mi">1</span>
<span class="mi">2</span>
<span class="mi">3</span>
<span class="mi">4</span>
</code></pre></div></div>
<p>In the above we had to manually write out all the numbers that we wanted. However,
because iterating over a range of integers is such a common task Python has a
built-in function for generating such lists.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">4</span><span class="p">]</span>
</code></pre></div></div>
<p>So a typical for loop might look like the below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">for</span> <span class="n">item</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">):</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">item</span><span class="p">)</span>
<span class="o">...</span>
<span class="mi">0</span>
<span class="mi">1</span>
<span class="mi">2</span>
<span class="mi">3</span>
<span class="mi">4</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">range()</code> function can also be told to start at a larger number. Say for
example that we wanted a list including the numbers 5, 6 and 7.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="nb">range</span><span class="p">(</span><span class="mi">5</span><span class="p">,</span> <span class="mi">8</span><span class="p">)</span>
<span class="p">[</span><span class="mi">5</span><span class="p">,</span> <span class="mi">6</span><span class="p">,</span> <span class="mi">7</span><span class="p">]</span>
</code></pre></div></div>
<p>As with slicing the start value is included whereas the end value is excluded.</p>
<p>It is also possible to alter the step size. To do this we must specify the start
and end values explicitly before adding the step size.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">10</span><span class="p">)</span>
<span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">20</span><span class="p">,</span> <span class="mi">30</span><span class="p">,</span> <span class="mi">40</span><span class="p">]</span>
</code></pre></div></div>
<p>We are now in a position where we can create a naive loop for for calculating
the local GC-content of our DNA.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">for</span> <span class="n">start</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">50</span><span class="p">,</span> <span class="mi">10</span><span class="p">):</span>
<span class="o">...</span> <span class="n">end</span> <span class="o">=</span> <span class="n">start</span> <span class="o">+</span> <span class="mi">10</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">gc_content</span><span class="p">(</span><span class="n">dna_list</span><span class="p">[</span><span class="n">start</span><span class="p">:</span><span class="n">end</span><span class="p">]))</span>
<span class="o">...</span>
<span class="mf">40.0</span>
<span class="mf">30.0</span>
<span class="mf">70.0</span>
<span class="mf">50.0</span>
<span class="mf">0.0</span>
</code></pre></div></div>
<p>Loops are really powerful. They provide a means to iterate over lots of items
and as such to automate repetitive tasks.</p>
<h2 id="summary">Summary</h2>
<p>I hope this post has given you a flavour of what it feels like to work with
Python.</p>
<p>The key take home messages were:</p>
<ul>
<li>You can explore Python’s syntax using its interactive mode</li>
<li>Variables and functions help us avoid having to repeat ourselves</li>
<li>When naming variables and functions explicit trumps succinct</li>
<li>Loops are really powerful, they form the basis of automating repetitive tasks</li>
</ul>
<p>If you enjoyed this post please check out the book that I am working on
<a href="http://biologistsguide2computing.com/">The Biologist’s Guide to Computing</a>!</p>
Biologist's Guide to Python string manipulation2016-10-01T00:00:00+00:00http://tjelvarolsson.com/blog/python-string-manipulation<p>Because information about DNA and proteins are often stored in plain text files
many aspects of biological data processing involves manipulating text. In
computing text is often referred to as strings of characters. String
manipulation is is therefore a common task both for processing biological
sequences and for interpreting sequence identifiers.</p>
<p>This post provides a quick summary of how Python can be used for such string
manipulation, using the <a href="https://en.wikipedia.org/wiki/FASTA_format#Description_line">FASTA description
line</a> as an
example.</p>
<h2 id="the-python-string-object">The Python string object</h2>
<p>When reading in strings from a text file one often has to deal with
lines that have leading and/or trailing white spaces. Commonly one wants
to get rid of them. This can be achieved using the <code class="language-plaintext highlighter-rouge">strip()</code> method
built into the Python string object.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">" text with leading/trailing spaces "</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="s">'text with leading/trailing spaces'</span>
</code></pre></div></div>
<p>Another common use case is to replace a word in a line. For example,
when we strip out the leading and trailing white spaces one might want
to update the word “with” to “without” to make the resulting string
reflect its current state. This can be achieved using the <code class="language-plaintext highlighter-rouge">replace()</code>
method.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">" text with leading/trailing spaces "</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span><span class="o">.</span><span class="n">replace</span><span class="p">(</span><span class="s">"with"</span><span class="p">,</span> <span class="s">"without"</span><span class="p">)</span>
<span class="s">'text without leading/trailing spaces'</span>
</code></pre></div></div>
<p>In the example above we chain the <code class="language-plaintext highlighter-rouge">strip()</code> and <code class="language-plaintext highlighter-rouge">replace()</code> methods together.
In practise this means that the <code class="language-plaintext highlighter-rouge">replace()</code> method acts on the return value of
the <code class="language-plaintext highlighter-rouge">strip()</code> method.</p>
<p>Python’s string object also comes with a <code class="language-plaintext highlighter-rouge">startswith()</code> method. This can,
for example, be used to identify FASTA description lines.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">">MySeq1|description line"</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">">"</span><span class="p">)</span>
<span class="bp">True</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">endswith()</code> method complements the <code class="language-plaintext highlighter-rouge">startswith()</code> method and is
often used to examine file extensions.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">"/home/olsson/images/profile.png"</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">"png"</span><span class="p">)</span>
<span class="bp">True</span>
</code></pre></div></div>
<p>The example above only works if the file extension is in lower case.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">"/home/olsson/images/profile.PNG"</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">"png"</span><span class="p">)</span>
<span class="bp">False</span>
</code></pre></div></div>
<p>However, we can overcome this issue by adding a call to the <code class="language-plaintext highlighter-rouge">lower()</code>
method, which converts the string to lower case.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">"/home/olsson/images/profile.PNG"</span><span class="o">.</span><span class="n">lower</span><span class="p">()</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">"png"</span><span class="p">)</span>
<span class="bp">True</span>
</code></pre></div></div>
<p>Another common use case is to search for a particular string
within another string. For example one might want to find out if the
UniProt identifier “Q6GZX4” is present in a FASTA description line. To
achieve this one can use the <code class="language-plaintext highlighter-rouge">find()</code> method, which returns the index
position (zero-based) where the search term was first identified.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">">sp|Q6GZX4|001R_FRG3G"</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"Q6GZX4"</span><span class="p">)</span>
<span class="mi">4</span>
</code></pre></div></div>
<p>If the search term is not identified <code class="language-plaintext highlighter-rouge">find()</code> returns -1.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">">sp|P31946|1433B_HUMAN"</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">"Q6GZX4"</span><span class="p">)</span>
<span class="o">-</span><span class="mi">1</span>
</code></pre></div></div>
<p>When iterating over lines in a file one often wants to split the line
based on a delimiter. This can be achieved using the <code class="language-plaintext highlighter-rouge">split()</code> method.
By default this splits on white space characters and returns a list of
strings.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">"text without leading/trailing spaces"</span><span class="o">.</span><span class="n">split</span><span class="p">()</span>
<span class="p">[</span><span class="s">'text'</span><span class="p">,</span> <span class="s">'without'</span><span class="p">,</span> <span class="s">'leading/trailing'</span><span class="p">,</span> <span class="s">'spaces'</span><span class="p">]</span>
</code></pre></div></div>
<p>A different delimiter can be used by providing it as an argument to the
<code class="language-plaintext highlighter-rouge">split()</code> method.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">">sp|Q6GZX4|001R_FRG3G"</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">"|"</span><span class="p">)</span>
<span class="p">[</span><span class="s">'>sp'</span><span class="p">,</span> <span class="s">'Q6GZX4'</span><span class="p">,</span> <span class="s">'001R_FRG3G'</span><span class="p">]</span>
</code></pre></div></div>
<p>There are many variations on the string operators described above. It is
useful to familiarise yourself with the <a href="https://docs.python.org/2/library/string.html">Python documentation on
strings</a>.</p>
<h2 id="regular-expressions">Regular expressions</h2>
<p>Regular expressions can be defined as a series of characters that define
a search pattern.</p>
<p>Regular expressions can be very powerful. However, they can be difficult
to build up. Often it is a process of trial and error. This means that
once they have been created, and the trial and error process has been
forgotten, it can be extremely difficult to understand what the regular
expression does and why it is constructed the way it is.</p>
<p><em>Warning: only use regular expression as a last resort!</em></p>
<p>A good rule of thumb is to always try to use string operations to
implement the desired functionality and only switch to regular expressions when
the code implemented using these become more difficult to
understand than the equivalent regular expression.</p>
<p>To use regular expressions in Python we need to import the <code class="language-plaintext highlighter-rouge">re</code> module.
The <code class="language-plaintext highlighter-rouge">re</code> module is part of Python’s standard library. Importing modules in
Python is achieved using the <code class="language-plaintext highlighter-rouge">import</code> keyword.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">re</span>
</code></pre></div></div>
<p>Let us store a FASTA description line in a variable.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">fasta_desc</span> <span class="o">=</span> <span class="s">">sp|Q6GZX4|001R_FRG3G"</span>
</code></pre></div></div>
<p>Now, let us search for the UniProt identifier <code class="language-plaintext highlighter-rouge">Q6GZX4</code> within the line.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">r"Q6GZX4"</span><span class="p">,</span> <span class="n">fasta_desc</span><span class="p">)</span> <span class="c1"># doctest: +ELLIPSIS
</span><span class="o"><</span><span class="n">_sre</span><span class="o">.</span><span class="n">SRE_Match</span> <span class="nb">object</span> <span class="n">at</span> <span class="mi">0</span><span class="n">x</span><span class="o">...></span>
</code></pre></div></div>
<p>There are two things to note here:</p>
<ol>
<li>We use a raw string to represent our regular expression, i.e. the
string prefixed with an <code class="language-plaintext highlighter-rouge">r</code></li>
<li>The regular expression <code class="language-plaintext highlighter-rouge">search()</code> method returns a match object (or
None if no match is found)</li>
</ol>
<p><em>What is a “raw” string?</em> In Python “raw” strings differ from regular strings
in that the bashslash <code class="language-plaintext highlighter-rouge">\</code> character is interpreted literally. For example the
regular string equivalent of <code class="language-plaintext highlighter-rouge">r"\n"</code> would be <code class="language-plaintext highlighter-rouge">"\\n"</code> where the first backslash
is used to escape the effect of the second (remember that <code class="language-plaintext highlighter-rouge">\n</code> represents a
newline). Raw strings were introduced in Python to make it easier to create
regular expressions that rely heavily on the use of literal backslashes.</p>
<p>The index of the first matched character can be accessed using the match
object’s <code class="language-plaintext highlighter-rouge">start()</code> method. The match object also has an <code class="language-plaintext highlighter-rouge">end()</code> method
that returns the index of the last character + 1.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">match</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">r"Q6GZX4"</span><span class="p">,</span> <span class="n">fasta_desc</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">if</span> <span class="n">match</span><span class="p">:</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">fasta_desc</span><span class="p">[</span><span class="n">match</span><span class="o">.</span><span class="n">start</span><span class="p">():</span><span class="n">match</span><span class="o">.</span><span class="n">end</span><span class="p">()])</span>
<span class="o">...</span>
<span class="n">Q6GZX4</span>
</code></pre></div></div>
<p>In the above we make use of the fact that Python strings support
slicing. Slicing is a means to access a subsection of a sequence. The
<code class="language-plaintext highlighter-rouge">[start:end]</code> syntax is inclusive for the start index and exclusive for
the end index.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="s">"012345"</span><span class="p">[</span><span class="mi">2</span><span class="p">:</span><span class="mi">4</span><span class="p">]</span>
<span class="s">'23'</span>
</code></pre></div></div>
<p>To see the merit of regular expressions we need to create one that
matches more than one thing. For example a regular expression that could
match all the patterns <code class="language-plaintext highlighter-rouge">id0</code>, <code class="language-plaintext highlighter-rouge">id1</code>, …, <code class="language-plaintext highlighter-rouge">id9</code>.</p>
<p>Now suppose that we had a list containing FASTA description lines with
these types of identifiers.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">fasta_desc_list</span> <span class="o">=</span> <span class="p">[</span><span class="s">">id0 match this"</span><span class="p">,</span>
<span class="o">...</span> <span class="s">">id9 and this"</span><span class="p">,</span>
<span class="o">...</span> <span class="s">">id100 but not this (initially)"</span><span class="p">,</span>
<span class="o">...</span> <span class="s">"AATCG"</span><span class="p">]</span>
<span class="o">...</span>
</code></pre></div></div>
<p>Note that the list above also contains a sequence
line that we never want to match.</p>
<p>Let us loop over the items in this list and print out the lines that
match our identifier regular expression.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">fasta_desc_list</span><span class="p">:</span>
<span class="o">...</span> <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">r">id[0-9]\s"</span><span class="p">,</span> <span class="n">line</span><span class="p">):</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="o">...</span>
<span class="o">></span><span class="n">id0</span> <span class="n">match</span> <span class="n">this</span>
<span class="o">></span><span class="n">id9</span> <span class="ow">and</span> <span class="n">this</span>
</code></pre></div></div>
<p>There are two noteworthy aspects of the regular expression. Firstly, the
<code class="language-plaintext highlighter-rouge">[0-9]</code> syntax means match any digit. Secondly, the <code class="language-plaintext highlighter-rouge">\s</code> regular expression
meta character means match any white space character.</p>
<p>If one wanted to create a regular expression to match an identifier with
an arbitrary number of digits one can make use of the <code class="language-plaintext highlighter-rouge">*</code> meta
character, which causes the regular expression to match the preceding
expression 0 or more times.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">fasta_desc_list</span><span class="p">:</span>
<span class="o">...</span> <span class="k">if</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">r">id[0-9]*\s"</span><span class="p">,</span> <span class="n">line</span><span class="p">):</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="o">...</span>
<span class="o">></span><span class="n">id0</span> <span class="n">match</span> <span class="n">this</span>
<span class="o">></span><span class="n">id9</span> <span class="ow">and</span> <span class="n">this</span>
<span class="o">></span><span class="n">id100</span> <span class="n">but</span> <span class="ow">not</span> <span class="n">this</span> <span class="p">(</span><span class="n">initially</span><span class="p">)</span>
</code></pre></div></div>
<p>It is possible to extract specific pieces of information from a line
using regular expressions. This uses a concept known as “groups”, which
are indicated using parenthesis. Let us try to extract the UniProt
identifier from a FASTA description line.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">fasta_desc</span><span class="p">)</span>
<span class="o">></span><span class="n">sp</span><span class="o">|</span><span class="n">Q6GZX4</span><span class="o">|</span><span class="mi">001</span><span class="n">R_FRG3G</span>
<span class="o">>>></span> <span class="n">match</span> <span class="o">=</span> <span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">r">sp\|([A-Z,0-9]*)\|"</span><span class="p">,</span> <span class="n">fasta_desc</span><span class="p">)</span>
</code></pre></div></div>
<p><em>Note how horrible and incomprehensible the regular expression is!</em></p>
<p>It took me a couple of attempts to get this regular expression right as
I forgot that <code class="language-plaintext highlighter-rouge">|</code> is a regular expression meta character that needs to
be escaped using a backslash <code class="language-plaintext highlighter-rouge">\</code>.</p>
<p>The regular expression representing the UniProt idendifier <code class="language-plaintext highlighter-rouge">[A-Z,0-9]*</code> means
match capital letters (<code class="language-plaintext highlighter-rouge">A-Z</code>) and digits (<code class="language-plaintext highlighter-rouge">0-9</code>) zero or more times (<code class="language-plaintext highlighter-rouge">*</code>).
The UniProt regular expression is enclosed in parenthesis. The parenthesis
denote that the UniProt identifier is a group that we would like access to. In
other words, the purpose of a group is to give the user access to a section of
interest within the regular expression.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">match</span><span class="o">.</span><span class="n">groups</span><span class="p">()</span>
<span class="p">(</span><span class="s">'Q6GZX4'</span><span class="p">,)</span>
<span class="o">>>></span> <span class="n">match</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">0</span><span class="p">)</span> <span class="c1"># Everything matched by the regular expression.
</span><span class="s">'>sp|Q6GZX4|'</span>
<span class="o">>>></span> <span class="n">match</span><span class="o">.</span><span class="n">group</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span>
<span class="s">'Q6GZX4'</span>
</code></pre></div></div>
<p>Note that there is a difference between the <code class="language-plaintext highlighter-rouge">groups()</code> and the <code class="language-plaintext highlighter-rouge">group()</code>
methods. The former returns a tuple containing all the groups defined in the
regular expression. The latter takes an integer as input and returns a specific
group. However, confusingly <code class="language-plaintext highlighter-rouge">group(0)</code> returns everything matched by the
regular expression and <code class="language-plaintext highlighter-rouge">group(1)</code> returns the first group; making the <code class="language-plaintext highlighter-rouge">group()</code>
method appear as if it used a one-based indexing scheme.</p>
<p>Finally, let us have a look at a common pitfall when using regular
expressions in Python: the difference between the methods <code class="language-plaintext highlighter-rouge">search()</code> and
<code class="language-plaintext highlighter-rouge">match()</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="s">r"cat"</span><span class="p">,</span> <span class="s">"my cat has a hat"</span><span class="p">))</span> <span class="c1"># doctest: +ELLIPSIS
</span><span class="o"><</span><span class="n">_sre</span><span class="o">.</span><span class="n">SRE_Match</span> <span class="nb">object</span> <span class="n">at</span> <span class="mi">0</span><span class="n">x</span><span class="o">...></span>
<span class="o">>>></span> <span class="k">print</span><span class="p">(</span><span class="n">re</span><span class="o">.</span><span class="n">match</span><span class="p">(</span><span class="s">r"cat"</span><span class="p">,</span> <span class="s">"my cat has a hat"</span><span class="p">))</span> <span class="c1"># doctest: +ELLIPSIS
</span><span class="bp">None</span>
</code></pre></div></div>
<p>Basically <code class="language-plaintext highlighter-rouge">match()</code> only looks for a match at the beginning of the
string to be searched. For more information see the <a href="https://docs.python.org/2/library/re.html#search-vs-match">search() vs
match()</a>
section in the Python documentation.</p>
<p>There is a lot more to regular expressions in particular all the meta
characters. For more information have a look at the <a href="https://docs.python.org/2/library/re.html">regular expressions
operations</a> section in the
Python documentation.</p>
<p>This blog post was adapted from a section in the book that I am working on:
<a href="http://biologistsguide2computing.com/">The Biologist’s Guide to Computing</a>.
Please check it out if you found this post useful!</p>
Biologist's Guide to Computing - almost there2016-09-24T00:00:00+00:00http://tjelvarolsson.com/blog/biologists-guide-to-computing-almost-there<p>Last week I announced the launch of the
<a href="http://biologistsguide2computing.com/">Biologist’s Guide to Computing</a>
website and the response was tremendous.</p>
<blockquote class="twitter-tweet" data-lang="en"><p lang="en" dir="ltr">I've created a website for the book that I am working on: The biologist's guide to computing. Please spread the word<a href="https://t.co/zfsLikrsTq">https://t.co/zfsLikrsTq</a></p>— Tjelvar Olsson (@tjelvar_olsson) <a href="https://twitter.com/tjelvar_olsson/status/777252254470537216">September 17, 2016</a></blockquote>
<script async="" src="//platform.twitter.com/widgets.js" charset="utf-8"></script>
<p>Thank you all!</p>
<p>I thought I’d take the opportunity to provide a status update.</p>
<p>The first draft of the book is finished. Yay! In its current form
it has 47,000 words, spread over 195 pages, split into 14 chapters.</p>
<p>Although, I call it the first draft each chapter has undergone several
revisions as I have received feedback from many of my colleague, to whom I am
very grateful. Thanks to Nadia Radzman, Sam Mugford and Anna Stavrinides for
providing feedback on early versions of the initial chapters. Many thanks to
Tyler McCleary for continued in depth feedback and suggestions for
improvements. Thanks also to Nick Pullen for feedback and discussions on the
data visualisation chapter. Finally, many thanks to Matthew Hartley for
all the discussions and encouragement.</p>
<p><em>So what happens now?</em></p>
<p>I will go over the draft with a red pen and I’m sure that I will find plenty of
things that needs fixing.</p>
<p>Then I will make the book available more broadly under creative commons
licence to get more feedback.</p>
<p>I’m currently debating if I should try to get a publisher or if I should
self publish. Does anyone have any thoughts or recommendations with regards
to this?</p>
<p>If I go down the self publishing route I’d like to find a copy editor
that has some grasp of both coding and biology. Does anyone know of such
a person?</p>
<p>Finally, I am pondering how I can publicise the book. Please do help me spread
the word by pointing people at the <a href="http://biologistsguide2computing.com/">Biologist’s Guide to
Computing</a> website. The website has
a newsletter sign up form, please do sign up to it.
By signing up you encourage me to finish the project! Also, if you
are a blogger and would consider writing a review of the book I will give you
early access to it. Please do get in touch if you would be interested in this.</p>
Taking the effort out of server configuration using Ansible2016-03-06T00:00:00+00:00http://tjelvarolsson.com/blog/taking-the-effort-out-of-server-configuration-using-ansible<p><em>This article was originally published in the <a href="http://www.nordevcon.com/">NorDevCon</a> 2016 conference programme.</em></p>
<p>Ansible is an IT automation tool that is growing in popularity. It is ideally
suited for configuration management, i.e. automating the configuration of your
development and production infrastructure.</p>
<p>Ansible is a relatively new addition to the “DevOps” arena (first released in
2012) and it has quite a different philosophy to some of the more well
established players in the field. Most notably, it is “agent-less”; i.e. there
is no need to have an “agent” pulling updates from a “master” configuration
manager.</p>
<p>Ansible has been designed to be easy to use and it achieves this through
two aspects of its architecture:</p>
<ul>
<li>It uses a push based method to interact with the hosts (the machines to be configured)</li>
<li>It uses OpenSSH as its authentication method</li>
</ul>
<p>What this means in practise is that you can install Ansible on your laptop and
as long as you have setup password-less <code class="language-plaintext highlighter-rouge">ssh</code> to the machines you are wanting
to interact with you are ready to go. In other words no master, no databases,
no services; no fighting the system that is meant to be making your life
easier!</p>
<h2 id="listing-your-inventory">Listing your inventory</h2>
<p>Ansible has the concept of an inventory where you list all of the hosts that
you want to be able to interact with through Ansible. The inventory is a plain
text file using an INI-like format. The default path to the inventory file is
<code class="language-plaintext highlighter-rouge">/etc/ansible/hosts</code>. However, you can provide an alternative path as a
command line argument using the <code class="language-plaintext highlighter-rouge">-i</code> option.</p>
<p>Below is an example <code class="language-plaintext highlighter-rouge">hosts</code> file that groups three web servers into a
<code class="language-plaintext highlighter-rouge">webservers</code> group.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[webservers]
web1.example.com
web2.example.com
web3.example.com
</code></pre></div></div>
<p>It is also possible to create aliases and specify host specific variables.
Below is an example of an alias to enable Ansible to communicate with a Vagrant
generated virtual machine.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>testserver ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/default/virtualbox/private_key ansible_sudo=yes
</code></pre></div></div>
<h2 id="interacting-with-ansible">Interacting with Ansible</h2>
<p>There are two different ways of interacting with ansible: ad-hoc commands and playbooks.
Ad-hoc commands are useful for quick tasks that you don’t want to save for later. Whereas
playbooks provides a means to specify reproducible configuration recipes.</p>
<p>Ad-hoc commands are accessed using the <code class="language-plaintext highlighter-rouge">ansible</code> program and can be useful if
you want to do something quickly. For example, suppose that there was a
critical security patch for bash that you needed to apply to all your
webservers rapidly.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ansible webservers -m apt -a "name=bash state=latest update_cache=yes"
</code></pre></div></div>
<p>There are a few things to note in the command above. The first argument
(<code class="language-plaintext highlighter-rouge">webservers</code>) is the host group defined in the inventory. The <code class="language-plaintext highlighter-rouge">-m apt</code>
option states that we want to use Ansible’s <code class="language-plaintext highlighter-rouge">apt</code> module, in other words our
servers are running on a Debian based OS such as Ubuntu. Finally, the <code class="language-plaintext highlighter-rouge">-a ...</code>
option specifies the arguments to supply, i.e. we want to install the latest
version of <code class="language-plaintext highlighter-rouge">bash</code>.</p>
<p>It is worth noting that Ansible does not try to abstract away the layer of
package management. As such there are separate modules for <code class="language-plaintext highlighter-rouge">apt</code>, <code class="language-plaintext highlighter-rouge">yum</code>,
<code class="language-plaintext highlighter-rouge">homebrew</code>, etc. This is useful in that the functionality of Ansible modules
are not limited by the constraint of a set of common features. For example in
this case we make use of the <code class="language-plaintext highlighter-rouge">update_cache</code> option to run the equivalent of
<code class="language-plaintext highlighter-rouge">apt-get update</code> before the we try to install the latest version of bash.</p>
<h2 id="batteries-included-as-ansible-modules">Batteries included as Ansible modules</h2>
<p>Ansible comes with a large number of built-in modules for configuring your
systems. These include modules for package management, performing systems
administration tasks, working with files, interacting with source control
programs, and much more.</p>
<h2 id="reproducible-configuration-scripts-using-playbooks">Reproducible configuration scripts using playbooks</h2>
<p>Ansible playbooks allow you to create reproducible recipes for configuring your
servers. Playbooks are written in YAML and are meant to be human readable. As
YAML is simply a data serialisation language, a playbook can be thought of as
“infrastructure as data”.</p>
<p>Below is a basic playbook for configuring a firewall using <code class="language-plaintext highlighter-rouge">firewalld</code>. In
real life a playbook would be written to configuration the entire system.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">configure a web server firewall using firewalld</span>
<span class="na">hosts</span><span class="pi">:</span> <span class="s">testserver</span>
<span class="na">tasks</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install firewalld</span>
<span class="na">apt</span><span class="pi">:</span> <span class="s">name=firewalld state=present</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">ensure that firewalld is started and enable at boot</span>
<span class="na">service</span><span class="pi">:</span> <span class="s">name=firewalld enabled=yes state=started</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">open up port 80 for tcp</span>
<span class="na">firewalld</span><span class="pi">:</span> <span class="s">port=80/tcp permanent=yes state=enabled</span>
<span class="na">notify</span><span class="pi">:</span> <span class="s">restart firewalld</span>
<span class="na">handlers</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">restart firewalld</span>
<span class="na">service</span><span class="pi">:</span> <span class="s">name=firewalld state=restarted</span>
</code></pre></div></div>
<p>There are a few things to note in the playbook above. The host(s) that the
playbook is defined to be applied to are defined in the <code class="language-plaintext highlighter-rouge">hosts</code> entry. In
this case it is only run against the Vagrant <code class="language-plaintext highlighter-rouge">testserver</code> alias we defined in
the inventory earlier. The <code class="language-plaintext highlighter-rouge">apt</code> module is used to install <code class="language-plaintext highlighter-rouge">firewalld</code> and
the <code class="language-plaintext highlighter-rouge">service</code> module is used to ensure that <code class="language-plaintext highlighter-rouge">firewalld</code> is started and
enabled at boot. Finally, we use Ansible’s <code class="language-plaintext highlighter-rouge">firewalld</code> module to open up port</p>
<ol>
<li>Note that this task makes use of the <code class="language-plaintext highlighter-rouge">notify</code> action to trigger the
<code class="language-plaintext highlighter-rouge">restart firewalld</code> handler, which we define at the end of the playbook.</li>
</ol>
<h2 id="what-now">What now?</h2>
<p>There is much more to Ansible than what has been outlined here. The key is to
build things up bit by bit. You don’t need to use every feature of Ansible to
get a job done.</p>
<p>For more information about Ansible have a look at the
<a href="http://docs.ansible.com/">Ansible documentaiton</a>.
If you liked this article you may also be interested in the other Ansible
tutorials on this site, which illustrate the use of Ansible as a tool to install
scientific software.</p>
<ul>
<li><a href="/blog/how-to-create-automated-and-reproducible-work-flows-for-installing-scientific-software/">How to create automated and reproducible work flows for installing scientific software</a></li>
<li><a href="/blog/how-to-create-reusable-ansible-components/">How to create reusable Ansible components</a></li>
<li><a href="/blog/how-to-manage-firewalls-using-ferm-and-ansible/">How to manage firewalls using ferm and Ansible</a></li>
<li><a href="/blog/ansible-playbook-for-installing-the-gbrowse-genome-browser/">Ansible playbook for installing the Gbrowse genome browser</a></li>
</ul>
Biologist's Guide to Computing - a work in progress2015-12-05T00:00:00+00:00http://tjelvarolsson.com/blog/biologists-guide-to-computing<figure>
<img src="/images/biologists-guide-to-computing.png" alt="Binary tree." />
</figure>
<p>The reason there has been a bit of a radio silence on this blog for the past
couple of months is that I have been spending most of my spare time working on
a booklet about computing.</p>
<p>The booklet is intended for biologists that want to learn more about data
analysis. It will provide an introduction to some fundamental aspects of
computing required for learning scripting and programming. Furthermore, as well
as outlining basic principles of programming it will introduce some best
practices for keeping track of work and collaborating on projects.</p>
<p>These days many parts of the biological sciences are become more and more data
driven. Technological advancements have led to a huge increase in the
generation of biological data. Data analysis is required to extract biological
insights from this data. To a large extent the rate limiting factor in
generating biological insight is the lack of appropriate data analysis tools.</p>
<p>In these instances computers can be powerful allies. They are ideal for
automating repetitive tasks. Furthermore, they can perform calculations and
analysis that would be infeasible for the human brain alone.</p>
<p>The purpose of this booklet is not to provide a bundle of useful scripts and
regular expressions. Its purpose, is rather, to outline a more productive way
of working that will make you a better scientist.</p>
<p>If this sounds interesting please encourage me to spend more time writing by
<a href="https://twitter.com/intent/tweet?url=http://tjelvarolsson.com/blog/biologists-guide-to-computing/&text=Biologist's Guide to Computing - a work in progress&via=tjelvar_olsson" target="_blank">
spreading the word</a> on Twitter
and <a href="https://tinyletter.com/tjelvarolsson" target="_blank">signing up</a> for the monthly newsletter.</p>
How to build a basic image viewer using FreeImage and SDL22015-10-10T00:00:00+00:00http://tjelvarolsson.com/blog/how-to-build-a-basic-image-viewer-using-freeimage-and-sdl2<p>In this blog post we will use FreeImage and SDL2 to create a basic image viewer
in C. <a href="http://freeimage.sourceforge.net/">FreeImage</a> is an open source library
for working with image files. It supports over 30 file formats, gives
access to meta-data and provides basic image manipulation routines.
<a href="https://www.libsdl.org/">SDL</a> (Simple DirectMedia Layer) is a cross-platform
library which provides low level access to things like the keyboard and mouse
as well as graphics hardware using OpenGL and Direct3D. SDL provides official
supports for Windows, Mac, Linux, iOS and Android.</p>
<p>By the end of this post we will have created a C program that can be used to
view RGB and grayscale images from the command line.</p>
<h2 id="argument-parsing">Argument parsing</h2>
<p>Let us start by adding some basic argument parsing. Add the code below to a
file named <code class="language-plaintext highlighter-rouge">see.c</code>.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <stdlib.h>
#include <stdio.h>
</span>
<span class="kt">char</span> <span class="o">*</span><span class="nf">parse_args_get_filename</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[])</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">argc</span> <span class="o">!=</span> <span class="mi">2</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Usage: %s FILENAME</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">argv</span><span class="p">[</span><span class="mi">0</span><span class="p">]);</span>
<span class="n">exit</span><span class="p">(</span><span class="mi">2</span><span class="p">);</span>
<span class="p">}</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">filename</span> <span class="o">=</span> <span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">];</span>
<span class="k">return</span> <span class="n">filename</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>There are a few things going on in the above. We include the <code class="language-plaintext highlighter-rouge"><stdlib.h></code> and
<code class="language-plaintext highlighter-rouge"><stdio.h></code> header files. The former provides the <code class="language-plaintext highlighter-rouge">exit()</code> function and the
latter the <code class="language-plaintext highlighter-rouge">fprintf()</code> function.</p>
<p>We also define the function <code class="language-plaintext highlighter-rouge">parse_args_get_filename()</code> that returns a
pointer to a <code class="language-plaintext highlighter-rouge">char</code> array, the <code class="language-plaintext highlighter-rouge">char</code> array that will hold our input file
name. Note that the <code class="language-plaintext highlighter-rouge">argc</code> variable is an integer that holds the number of
arguments supplied from the command line and <code class="language-plaintext highlighter-rouge">argc</code> is a list of pointers to
<code class="language-plaintext highlighter-rouge">char</code> arrays holding the values of the strings supplied from the command
line. In our program <code class="language-plaintext highlighter-rouge">argc</code> will contain two such pointers, one to the name
of the program and one to the filename of the image we want to view.</p>
<h2 id="reading-in-the-image-using-freeimage">Reading in the image using FreeImage</h2>
<p>We will now use FreeImage to read in an image. Let us start by including the FreeImage header.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <stdlib.h>
#include <stdio.h>
</span>
<span class="cp">#include <FreeImage.h>
</span></code></pre></div></div>
<p>Now let us add a function for creating a FreeImage bitmap from an image file.
The function will return a pointer to the bitmap.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/** Initialise a FreeImage bitmap and return a pointer to it. */</span>
<span class="n">FIBITMAP</span> <span class="o">*</span><span class="nf">get_freeimage_bitmap</span><span class="p">(</span><span class="kt">char</span> <span class="o">*</span><span class="n">filename</span><span class="p">)</span> <span class="p">{</span>
<span class="n">FREE_IMAGE_FORMAT</span> <span class="n">filetype</span> <span class="o">=</span> <span class="n">FreeImage_GetFileType</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="n">FIBITMAP</span> <span class="o">*</span><span class="n">freeimage_bitmap</span> <span class="o">=</span> <span class="n">FreeImage_Load</span><span class="p">(</span><span class="n">filetype</span><span class="p">,</span> <span class="n">filename</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">return</span> <span class="n">freeimage_bitmap</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Here we use the <code class="language-plaintext highlighter-rouge">FreeImage_GetFileType()</code> function to determine the image
file type by analysing the bitmap signature. According to the
<a href="http://freeimage.sourceforge.net/documentation.html">FreeImage documentation</a>
the second parameter (<code class="language-plaintext highlighter-rouge">size</code>) is currently not in use and can
be set to <code class="language-plaintext highlighter-rouge">0</code>.</p>
<p>We then use the <code class="language-plaintext highlighter-rouge">FreeImage_Load()</code> function to initialise the bitmap from the
input file name. The third parameter (<code class="language-plaintext highlighter-rouge">flags</code>) can be used to change the
loading behaviour for certain file types. As we are not interested in using
this functionality we can set it to <code class="language-plaintext highlighter-rouge">0</code>.</p>
<p>Note that the FreeImage bitmap is flipped vertically with respect to the
coordinate system used by SDL. We will deal with this in the next section.</p>
<h2 id="creating-a-sdl-surface-from-a-freeimage-bitmap">Creating a SDL surface from a FreeImage bitmap</h2>
<p>It is time to start using SDL. Let us therefore include the SDL header.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include <stdlib.h>
#include <stdio.h>
</span>
<span class="cp">#include <FreeImage.h>
</span>
<span class="cp">#include <SDL2/SDL.h>
</span></code></pre></div></div>
<p>Now we will create a function that takes our FreeImage bitmap and returns a
<a href="https://wiki.libsdl.org/SDL_Surface"><code class="language-plaintext highlighter-rouge">SDL_Surface</code></a> (a structure containing
pixel information). To achieve this we will make use of the
<a href="https://wiki.libsdl.org/SDL_CreateRGBSurfaceFrom"><code class="language-plaintext highlighter-rouge">SDL_CreateRGBSurfaceFrom()</code></a>
function.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/** Initialise a SDL surface and return a pointer to it.
*
* This function flips the FreeImage bitmap vertically to make it compatible
* with SDL's coordinate system.
*
* If the input image is in grayscale a custom palette is created for the
* surface.
*/</span>
<span class="n">SDL_Surface</span> <span class="o">*</span><span class="nf">get_sdl_surface</span><span class="p">(</span><span class="n">FIBITMAP</span> <span class="o">*</span><span class="n">freeimage_bitmap</span><span class="p">,</span> <span class="kt">int</span> <span class="n">is_grayscale</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Loaded image is upside down, so flip it.</span>
<span class="n">FreeImage_FlipVertical</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">);</span>
<span class="n">SDL_Surface</span> <span class="o">*</span><span class="n">sdl_surface</span> <span class="o">=</span> <span class="n">SDL_CreateRGBSurfaceFrom</span><span class="p">(</span>
<span class="n">FreeImage_GetBits</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">),</span>
<span class="n">FreeImage_GetWidth</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">),</span>
<span class="n">FreeImage_GetHeight</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">),</span>
<span class="n">FreeImage_GetBPP</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">),</span>
<span class="n">FreeImage_GetPitch</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">),</span>
<span class="n">FreeImage_GetRedMask</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">),</span>
<span class="n">FreeImage_GetGreenMask</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">),</span>
<span class="n">FreeImage_GetBlueMask</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">),</span>
<span class="mi">0</span>
<span class="p">);</span>
<span class="k">if</span> <span class="p">(</span><span class="n">sdl_surface</span> <span class="o">==</span> <span class="nb">NULL</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Failed to create surface: %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">SDL_GetError</span><span class="p">());</span>
<span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">if</span> <span class="p">(</span><span class="n">is_grayscale</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// To display a grayscale image we need to create a custom palette.</span>
<span class="n">SDL_Color</span> <span class="n">colors</span><span class="p">[</span><span class="mi">256</span><span class="p">];</span>
<span class="kt">int</span> <span class="n">i</span><span class="p">;</span>
<span class="k">for</span><span class="p">(</span><span class="n">i</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span> <span class="n">i</span> <span class="o"><</span> <span class="mi">256</span><span class="p">;</span> <span class="n">i</span><span class="o">++</span><span class="p">)</span> <span class="p">{</span>
<span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">r</span> <span class="o">=</span> <span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">g</span> <span class="o">=</span> <span class="n">colors</span><span class="p">[</span><span class="n">i</span><span class="p">].</span><span class="n">b</span> <span class="o">=</span> <span class="n">i</span><span class="p">;</span>
<span class="p">}</span>
<span class="n">SDL_SetPaletteColors</span><span class="p">(</span><span class="n">sdl_surface</span><span class="o">-></span><span class="n">format</span><span class="o">-></span><span class="n">palette</span><span class="p">,</span> <span class="n">colors</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">256</span><span class="p">);</span>
<span class="p">}</span>
<span class="k">return</span> <span class="n">sdl_surface</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>We start off by flipping the bitmap vertically. Since this is done in memory it
is a side-effect of the function.</p>
<p>We then create the SDL surface using the
<a href="https://wiki.libsdl.org/SDL_CreateRGBSurfaceFrom"><code class="language-plaintext highlighter-rouge">SDL_CreateRGBSurfaceFrom()</code></a>
function, which (amongst others) takes as input the red, green and blue masks
of the FreeImage bitmap. The functions for accessing these masks
(<code class="language-plaintext highlighter-rouge">FreeImage_GetRedMask()</code>, etc) work even if the FreeImage bitmap comes from a
single channel input image (gray scale). If the input image is in gray scale we
therefore need to create a custom palette for it and associate this with the
SDL surface that we have created.</p>
<h2 id="creating-a-sdl-window">Creating a SDL window</h2>
<p>Our image viewer will display the image in a SDL window. For sake of
minimalism (and simplicity) this will be a border-less window displayed in the
centre of the screen.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/** Initialise a SDL window and return a pointer to it. */</span>
<span class="n">SDL_Window</span> <span class="o">*</span><span class="nf">get_sdl_window</span><span class="p">(</span><span class="kt">int</span> <span class="n">width</span><span class="p">,</span> <span class="kt">int</span> <span class="n">height</span><span class="p">)</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">SDL_Init</span><span class="p">(</span><span class="n">SDL_INIT_VIDEO</span><span class="p">)</span> <span class="o"><</span> <span class="mi">0</span><span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"SDL couldn't initialise: %s.</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">SDL_GetError</span><span class="p">());</span>
<span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">SDL_Window</span> <span class="o">*</span><span class="n">sdl_window</span><span class="p">;</span>
<span class="n">sdl_window</span> <span class="o">=</span> <span class="n">SDL_CreateWindow</span><span class="p">(</span> <span class="s">"Image"</span><span class="p">,</span>
<span class="n">SDL_WINDOWPOS_CENTERED</span><span class="p">,</span>
<span class="n">SDL_WINDOWPOS_CENTERED</span><span class="p">,</span>
<span class="n">width</span><span class="p">,</span>
<span class="n">height</span><span class="p">,</span>
<span class="n">SDL_WINDOW_BORDERLESS</span><span class="p">);</span>
<span class="k">return</span> <span class="n">sdl_window</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="rendering-the-surface-as-a-texture-in-the-window-aka-displaying-the-image">Rendering the surface as a texture in the window (a.k.a. displaying the image)</h2>
<p>We now need some code to render the surface as a texture in the window. In the
code we will do this back to front, by using the window to generate a renderer
and using the renderer to generate a texture. Finally, the renderer is cleared
before adding the texture and presenting it.</p>
<p>It is worth noting that a
<a href="https://wiki.libsdl.org/SDL_Texture"><code class="language-plaintext highlighter-rouge">SDL_Texture</code></a> is <q>a structure that
contains an efficient, driver-specific representation of pixel data</q>.
Which means that, unlike a
<a href="https://wiki.libsdl.org/SDL_Surface"><code class="language-plaintext highlighter-rouge">SDL_Surface</code></a>,
it can be processed by the GPU.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/** Display the image by rendering the surface as a texture in the window. */</span>
<span class="kt">void</span> <span class="nf">render_image</span><span class="p">(</span><span class="n">SDL_Window</span> <span class="o">*</span><span class="n">window</span><span class="p">,</span> <span class="n">SDL_Surface</span> <span class="o">*</span><span class="n">surface</span><span class="p">)</span> <span class="p">{</span>
<span class="n">SDL_Renderer</span><span class="o">*</span> <span class="n">renderer</span> <span class="o">=</span> <span class="n">SDL_CreateRenderer</span><span class="p">(</span><span class="n">window</span><span class="p">,</span> <span class="o">-</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span> <span class="n">renderer</span> <span class="o">==</span> <span class="nb">NULL</span> <span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Failed to create renderer: %s</span><span class="se">\n</span><span class="s">"</span><span class="p">,</span> <span class="n">SDL_GetError</span><span class="p">());</span>
<span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">SDL_Texture</span><span class="o">*</span> <span class="n">texture</span> <span class="o">=</span> <span class="n">SDL_CreateTextureFromSurface</span><span class="p">(</span><span class="n">renderer</span><span class="p">,</span> <span class="n">surface</span><span class="p">);</span>
<span class="k">if</span> <span class="p">(</span> <span class="n">texture</span> <span class="o">==</span> <span class="nb">NULL</span> <span class="p">)</span> <span class="p">{</span>
<span class="n">fprintf</span><span class="p">(</span><span class="n">stderr</span><span class="p">,</span> <span class="s">"Failed to load image as texture</span><span class="se">\n</span><span class="s">"</span><span class="p">);</span>
<span class="n">exit</span><span class="p">(</span><span class="mi">1</span><span class="p">);</span>
<span class="p">}</span>
<span class="n">SDL_RenderClear</span><span class="p">(</span><span class="n">renderer</span><span class="p">);</span>
<span class="n">SDL_RenderCopy</span><span class="p">(</span><span class="n">renderer</span><span class="p">,</span> <span class="n">texture</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">,</span> <span class="nb">NULL</span><span class="p">);</span>
<span class="n">SDL_RenderPresent</span><span class="p">(</span><span class="n">renderer</span><span class="p">);</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Note that the third parameter of the
<a href="https://wiki.libsdl.org/SDL_CreateTextureFromSurface"><code class="language-plaintext highlighter-rouge">SDL_RenderCopy()</code></a>
function (<code class="language-plaintext highlighter-rouge">srcrect</code>) is a pointer to the source rectangle and can be used to
implement zooming. However, here we set it to <code class="language-plaintext highlighter-rouge">NULL</code> to display the entire
texture.</p>
<h2 id="giving-the-user-the-chance-to-view-the-image">Giving the user the chance to view the image</h2>
<p>At this point we need some sort of event loop to make sure that the image does
not vanish instantaneously after having been rendered in the window. Below is
is a simple event loop that ends when the user presses a key on the keyboard.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cm">/** Loop until a key is pressed. */</span>
<span class="kt">void</span> <span class="nf">event_loop</span><span class="p">()</span> <span class="p">{</span>
<span class="kt">int</span> <span class="n">done</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="n">SDL_Event</span> <span class="n">e</span><span class="p">;</span>
<span class="k">while</span> <span class="p">(</span><span class="o">!</span><span class="n">done</span><span class="p">)</span> <span class="p">{</span>
<span class="k">while</span> <span class="p">(</span><span class="n">SDL_PollEvent</span><span class="p">(</span><span class="o">&</span><span class="n">e</span><span class="p">))</span> <span class="p">{</span>
<span class="k">if</span> <span class="p">(</span><span class="n">e</span><span class="p">.</span><span class="n">type</span> <span class="o">==</span> <span class="n">SDL_KEYDOWN</span><span class="p">)</span> <span class="p">{</span>
<span class="n">done</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
<span class="p">}</span>
</code></pre></div></div>
<h2 id="putting-it-all-together">Putting it all together</h2>
<p>Finally we add the main logic of the code. This includes some functionality for
checking if the input image is gray scale. We achieve this by checking if the
FreeImage colour type is <code class="language-plaintext highlighter-rouge">FIC_MINISBLACK</code>. Other colour types include
<code class="language-plaintext highlighter-rouge">FIC_RGB</code> and <code class="language-plaintext highlighter-rouge">FIC_CMYK</code>.</p>
<p>As gray scale images can be more than 8-bits (quite common when dealing with
microscopy images) we make sure that we compress the data using the
<code class="language-plaintext highlighter-rouge">FreeImage_ConvertToGreyscale()</code> function.</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">int</span> <span class="nf">main</span><span class="p">(</span><span class="kt">int</span> <span class="n">argc</span><span class="p">,</span> <span class="kt">char</span> <span class="o">*</span><span class="n">argv</span><span class="p">[])</span> <span class="p">{</span>
<span class="kt">char</span> <span class="o">*</span><span class="n">filename</span> <span class="o">=</span> <span class="n">parse_args_get_filename</span><span class="p">(</span><span class="n">argc</span><span class="p">,</span> <span class="n">argv</span><span class="p">);</span>
<span class="n">FIBITMAP</span> <span class="o">*</span><span class="n">freeimage_bitmap</span> <span class="o">=</span> <span class="n">get_freeimage_bitmap</span><span class="p">(</span><span class="n">filename</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">is_grayscale</span> <span class="o">=</span> <span class="mi">0</span><span class="p">;</span>
<span class="k">if</span> <span class="p">(</span><span class="n">FreeImage_GetColorType</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">)</span> <span class="o">==</span> <span class="n">FIC_MINISBLACK</span><span class="p">)</span> <span class="p">{</span>
<span class="c1">// Single channel so ensure image is compressed to 8-bit.</span>
<span class="n">is_grayscale</span> <span class="o">=</span> <span class="mi">1</span><span class="p">;</span>
<span class="n">FIBITMAP</span> <span class="o">*</span><span class="n">tmp_bitmap</span> <span class="o">=</span> <span class="n">FreeImage_ConvertToGreyscale</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">);</span>
<span class="n">FreeImage_Unload</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">);</span>
<span class="n">freeimage_bitmap</span> <span class="o">=</span> <span class="n">tmp_bitmap</span><span class="p">;</span>
<span class="p">}</span>
<span class="kt">int</span> <span class="n">width</span> <span class="o">=</span> <span class="n">FreeImage_GetWidth</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">);</span>
<span class="kt">int</span> <span class="n">height</span> <span class="o">=</span> <span class="n">FreeImage_GetHeight</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">);</span>
<span class="n">SDL_Window</span> <span class="o">*</span><span class="n">sdl_window</span> <span class="o">=</span> <span class="n">get_sdl_window</span><span class="p">(</span><span class="n">width</span><span class="p">,</span> <span class="n">height</span><span class="p">);</span>
<span class="n">SDL_Surface</span> <span class="o">*</span><span class="n">sdl_surface</span> <span class="o">=</span> <span class="n">get_sdl_surface</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">,</span> <span class="n">is_grayscale</span><span class="p">);</span>
<span class="n">render_image</span><span class="p">(</span><span class="n">sdl_window</span><span class="p">,</span> <span class="n">sdl_surface</span><span class="p">);</span>
<span class="n">event_loop</span><span class="p">();</span>
<span class="n">FreeImage_Unload</span><span class="p">(</span><span class="n">freeimage_bitmap</span><span class="p">);</span>
<span class="n">SDL_FreeSurface</span><span class="p">(</span><span class="n">sdl_surface</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Note that we free up the dynamically allocated bitmap and surface memory using
<code class="language-plaintext highlighter-rouge">FreeImage_Unload()</code> and
<a href="https://wiki.libsdl.org/SDL_FreeSurface"><code class="language-plaintext highlighter-rouge">SDL_FreeSurface()</code></a>
before we exit the program.</p>
<h2 id="compiling-and-linking">Compiling and linking</h2>
<p>Now we can compile the code.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -c see.c
</code></pre></div></div>
<p>This creates the object file <code class="language-plaintext highlighter-rouge">see.o</code> which contains machine code as well as
information that allows a linker to find out which symbols (global objects,
functions, etc) it requires in order to work.</p>
<p>Let us link our object file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ gcc -o see see.o -lfreeimage -lsdl2
</code></pre></div></div>
<p>This produces the executable file <code class="language-plaintext highlighter-rouge">see</code>, which we can test using the command
below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./see image.png
</code></pre></div></div>
<h2 id="conclusion">Conclusion</h2>
<p>FreeImage and SDL are useful C libraries for working with images and graphical
user interfaces, respectively. In this post we have used the two in combination
to create a basic image viewer that can parse over 30 image file formats and
display both RGB and gray scale images correctly.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>This blog post was inspired by and based on some of the code in the
<a href="https://github.com/JIC-CSB/eye">github.com/JIC-CSB/eye</a> project.</p>
How to continuously test your Python code on Windows using AppVeyor2015-09-04T00:00:00+00:00http://tjelvarolsson.com/blog/how-to-continuously-test-your-python-code-on-windows-using-appveyor<p>In the
<a href="/blog/five-steps-to-add-the-bling-factor-to-your-python-package/">previous post</a>
I illustrated how to setup continuous integration testing of your Python code
using <a href="https://travis-ci.org">Travis CI</a>. Travis CI is great when working on
Linux. However, what can you do if you wanted to setup automated continuous
integration testing on Windows?</p>
<p><em>To me, a Linux enthusiast, this problem sounded almost insurmountable…</em></p>
<h2 id="appveyor-to-the-rescue">AppVeyor to the rescue</h2>
<p>However, it turns out that
<a href="http://www.appveyor.com">AppVeyor</a>
has provided a service for solving this problem.</p>
<p>One simply needs to create an <code class="language-plaintext highlighter-rouge">appveyor.yml</code> file to configure the running of
the test suite. The code below creates a testing matrix for running the test
suite on 32-bit Python 2.7, 3.3 and 3.4 using the <code class="language-plaintext highlighter-rouge">nosetests</code> test runner.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">build</span><span class="pi">:</span> <span class="no">false</span>
<span class="na">environment</span><span class="pi">:</span>
<span class="na">matrix</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">PYTHON</span><span class="pi">:</span> <span class="s2">"</span><span class="s">C:</span><span class="se">\\</span><span class="s">Python27"</span>
<span class="na">PYTHON_VERSION</span><span class="pi">:</span> <span class="s2">"</span><span class="s">2.7.8"</span>
<span class="na">PYTHON_ARCH</span><span class="pi">:</span> <span class="s2">"</span><span class="s">32"</span>
<span class="pi">-</span> <span class="na">PYTHON</span><span class="pi">:</span> <span class="s2">"</span><span class="s">C:</span><span class="se">\\</span><span class="s">Python33"</span>
<span class="na">PYTHON_VERSION</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.3.5"</span>
<span class="na">PYTHON_ARCH</span><span class="pi">:</span> <span class="s2">"</span><span class="s">32"</span>
<span class="pi">-</span> <span class="na">PYTHON</span><span class="pi">:</span> <span class="s2">"</span><span class="s">C:</span><span class="se">\\</span><span class="s">Python34"</span>
<span class="na">PYTHON_VERSION</span><span class="pi">:</span> <span class="s2">"</span><span class="s">3.4.1"</span>
<span class="na">PYTHON_ARCH</span><span class="pi">:</span> <span class="s2">"</span><span class="s">32"</span>
<span class="na">init</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">ECHO</span><span class="nv"> </span><span class="s">%PYTHON%</span><span class="nv"> </span><span class="s">%PYTHON_VERSION%</span><span class="nv"> </span><span class="s">%PYTHON_ARCH%"</span>
<span class="na">install</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">%PYTHON%/Scripts/pip.exe</span><span class="nv"> </span><span class="s">install</span><span class="nv"> </span><span class="s">nose"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">%PYTHON%/Scripts/pip.exe</span><span class="nv"> </span><span class="s">install</span><span class="nv"> </span><span class="s">coverage"</span>
<span class="na">test_script</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">%PYTHON%/Scripts/nosetests"</span>
</code></pre></div></div>
<p>Note that we use <code class="language-plaintext highlighter-rouge">pip</code> to install the <code class="language-plaintext highlighter-rouge">nose</code> and <code class="language-plaintext highlighter-rouge">coverage</code> packages
before we run the test suite.</p>
<p>Commit and push this file and login to AppVeyor using your GitHub account. Sync
your GitHub repositories and then select the project you want AppVeyor to run
continuous integration testing on.</p>
<p><em>Job done!</em></p>
<h2 id="using-minconda-to-test-projects-that-depend-on-the-numpyscipy-stack">Using Minconda to test projects that depend on the <code class="language-plaintext highlighter-rouge">numpy</code>/<code class="language-plaintext highlighter-rouge">scipy</code> stack</h2>
<p>Again testing projects that depend on <code class="language-plaintext highlighter-rouge">numpy</code> and <code class="language-plaintext highlighter-rouge">scipy</code> present problems
in that these packages take too long to build from scratch. However, just like
in the
<a href="/blog/five-steps-to-add-the-bling-factor-to-your-python-package/">previous post</a>
we can make use of <a href="http://conda.pydata.org/docs/index.html">Miniconda</a>.</p>
<p>In fact the kind people at AppVeyor have already deployed Minicoda to their
build workers
(<a href="https://github.com/appveyor/ci/issues/359">github.com/appveyor/ci/issues/359</a>).</p>
<p>So to test a project that depends on <code class="language-plaintext highlighter-rouge">numpy</code> and <code class="language-plaintext highlighter-rouge">scipy</code> one can simply
use the <code class="language-plaintext highlighter-rouge">appveyor.yml</code> file below.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">build</span><span class="pi">:</span> <span class="no">false</span>
<span class="na">environment</span><span class="pi">:</span>
<span class="na">matrix</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">PYTHON_VERSION</span><span class="pi">:</span> <span class="m">2.7</span>
<span class="na">MINICONDA</span><span class="pi">:</span> <span class="s">C:\Miniconda</span>
<span class="pi">-</span> <span class="na">PYTHON_VERSION</span><span class="pi">:</span> <span class="m">3.4</span>
<span class="na">MINICONDA</span><span class="pi">:</span> <span class="s">C:\Miniconda3</span>
<span class="na">init</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">ECHO</span><span class="nv"> </span><span class="s">%PYTHON_VERSION%</span><span class="nv"> </span><span class="s">%MINICONDA%"</span>
<span class="na">install</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">set</span><span class="nv"> </span><span class="s">PATH=%MINICONDA%;%MINICONDA%</span><span class="se">\\</span><span class="s">Scripts;%PATH%"</span>
<span class="pi">-</span> <span class="s">conda config --set always_yes yes --set changeps1 no</span>
<span class="pi">-</span> <span class="s">conda update -q conda</span>
<span class="pi">-</span> <span class="s">conda info -a</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">conda</span><span class="nv"> </span><span class="s">create</span><span class="nv"> </span><span class="s">-q</span><span class="nv"> </span><span class="s">-n</span><span class="nv"> </span><span class="s">test-environment</span><span class="nv"> </span><span class="s">python=%PYTHON_VERSION%</span><span class="nv"> </span><span class="s">numpy</span><span class="nv"> </span><span class="s">scipy</span><span class="nv"> </span><span class="s">nose"</span>
<span class="pi">-</span> <span class="s">activate test-environment</span>
<span class="pi">-</span> <span class="s">pip install coverage</span>
<span class="na">test_script</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">nosetests</span>
</code></pre></div></div>
<p>The script above installs <code class="language-plaintext highlighter-rouge">numpy</code>, <code class="language-plaintext highlighter-rouge">scipy</code> and <code class="language-plaintext highlighter-rouge">nose</code> using the
Conda package manager. However, the Conda package manager does not contain
the <code class="language-plaintext highlighter-rouge">coverage</code> package. We therefore install that using <code class="language-plaintext highlighter-rouge">pip</code> instead after
the virtual environment has been activated.</p>
<p>The fact that Miniconda is included in the AppVeyor makes it trivial to test
Python code with scientific dependencies.</p>
<p><em>Great stuff!</em></p>
<h2 id="see-also">See also</h2>
<ul>
<li>Oliver Grisel and Kyle Kaster’s tutorial
<a href="https://packaging.python.org/en/latest/appveyor.html">Building Binary Wheels for Windows using Appveyor</a></li>
<li>Robert T. McGibbon’s
<a href="https://github.com/rmcgibbo/python-appveyor-conda-example">Python Appveyor Conda example</a></li>
</ul>
Five steps to add the 'bling' factor your Python package2015-08-30T00:00:00+00:00http://tjelvarolsson.com/blog/five-steps-to-add-the-bling-factor-to-your-python-package<h2 id="introduction">Introduction</h2>
<p>In previous posts I have shown how to create a Python package.</p>
<p>We started by
<a href="/blog/using-cookiecutter-a-passive-code-generator/">using Cookicutter to generate a basic structure for our project</a>.
We then looked at
<a href="/blog/begginers-guide-creating-clean-python-development-environments/">how to setup and use clean development environments</a>.
This was followed by an
<a href="/blog/four-tools-for-testing-your-python-code/">outline of Python tools for testing</a>
and the implementation of the Python package using
<a href="/blog/test-driven-develpment-for-scientists/">test-driven development</a>.
Finally we looked at
<a href="/blog/how-to-generate-beautiful-technical-documentation/">how to generate beautiful technical documentation using Sphnix</a>.</p>
<p>Now it is time to show off our hard work. In this post I will show you how to
make use of cloud services to host your documentation, run continuous
integration tests and distribute your package. Furthermore, we will add
neat looking badges to the README file.</p>
<h2 id="step-1-host-the-documentation-on-readthedocs">Step 1: Host the documentation on readthedocs</h2>
<p>You have spent hours documenting your package using Sphinx. It is time to share
it with the world. Register with <a href="https://readthedocs.org">readthedocs</a> and
sync your GitHub account with it. Then you can simply select the project that
you want readthedocs to host documentation for.</p>
<p>If you are using Sphinx’s <a href="http://sphinx-doc.org/ext/autodoc.html">autodoc</a>
functionality and your package depends on <code class="language-plaintext highlighter-rouge">numpy</code>/<code class="language-plaintext highlighter-rouge">scipy</code>/<code class="language-plaintext highlighter-rouge">matplotlib</code>
you may run into trouble as Readthedocs’ server may not be able to compile the
C extensions. The first thing to try is to go into the advanced settings
section of your project in Readthedocs’ web interface and make sure that the
project is set to install into a virtual environment and that the option to
“Give the virtual environment access to the global site-packages dir” is
selected. The system packages now appear to include <code class="language-plaintext highlighter-rouge">numpy</code>, <code class="language-plaintext highlighter-rouge">scipy</code>, and
<code class="language-plaintext highlighter-rouge">matplotlib</code> so this should go a long way. However, if you are still running
into trouble you may need to
<a href="https://read-the-docs.readthedocs.org/en/latest/faq.html#i-get-import-errors-on-libraries-that-depend-on-c-modules">mock out the dependencies</a>.</p>
<h2 id="step-2-set-up-continuous-integration-testing-on-travis-ci">Step 2: Set up continuous integration testing on Travis Ci</h2>
<p>You have spent hours using test-driven development to create a solid Python
package. It is time to automate the running of the test suite and to get
automatic testing of the code on different versions of Python.</p>
<p>Sign into <a href="https://travis-ci.org">Travis CI</a> using your GitHub account. Select
the project that you want to test and add a <code class="language-plaintext highlighter-rouge">.travis.yml</code> file to the root
of your code repository.</p>
<p>Below is a simple setup for testing a Python package with no dependencies on
Python versions 2.7, 3.2, and 3.4 using the <code class="language-plaintext highlighter-rouge">nose</code> test runner.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">language</span><span class="pi">:</span> <span class="s">python</span>
<span class="na">python</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">2.7"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">3.2"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">3.3"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">3.4"</span>
<span class="na">script</span><span class="pi">:</span> <span class="s">nosetests</span>
</code></pre></div></div>
<p>If your code includes dependencies on <code class="language-plaintext highlighter-rouge">numpy</code> and <code class="language-plaintext highlighter-rouge">scipy</code> things get a
little bit trickier as Travis CI can time out trying to install these from source.
The solution is to make use of <a href="http://conda.pydata.org/docs/index.html">Miniconda</a>.</p>
<p>The <code class="language-plaintext highlighter-rouge">.travis.yml</code> file below is based on the template from the
<a href="http://conda.pydata.org/docs/travis.html#using-conda-with-travis-ci">conda documentation</a>
and Dan Blanchard’s post
<a href="https://gist.github.com/dan-blanchard/7045057">Quicker Travis builds that rely on numpy and scipy using Miniconda</a>.
It installs Miniconda with <code class="language-plaintext highlighter-rouge">numpy</code>, <code class="language-plaintext highlighter-rouge">scipy</code> and <code class="language-plaintext highlighter-rouge">nose</code> and runs the test
suite on Python 2.7, 3.3. and 3.4.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">python</span><span class="pi">:</span>
<span class="c1"># We don't actually use the Travis Python, but this keeps it organized.</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">2.7"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">3.3"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">3.4"</span>
<span class="na">install</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">sudo apt-get update</span>
<span class="c1"># We do this conditionally because it saves us some downloading if the</span>
<span class="c1"># version is the same.</span>
<span class="pi">-</span> <span class="s">if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then</span>
<span class="s">wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh;</span>
<span class="s">else</span>
<span class="s">wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;</span>
<span class="s">fi</span>
<span class="pi">-</span> <span class="s">bash miniconda.sh -b -p $HOME/miniconda</span>
<span class="pi">-</span> <span class="s">export PATH="$HOME/miniconda/bin:$PATH"</span>
<span class="pi">-</span> <span class="s">hash -r</span>
<span class="pi">-</span> <span class="s">conda config --set always_yes yes --set changeps1 no</span>
<span class="pi">-</span> <span class="s">conda update -q conda</span>
<span class="c1"># Useful for debugging any issues with conda</span>
<span class="pi">-</span> <span class="s">conda info -a</span>
<span class="pi">-</span> <span class="s">conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION numpy scipy nose</span>
<span class="pi">-</span> <span class="s">source activate test-environment</span>
<span class="pi">-</span> <span class="s">python setup.py install</span>
<span class="c1"># command to run tests</span>
<span class="na">script</span><span class="pi">:</span> <span class="s">nosetests</span>
</code></pre></div></div>
<h2 id="step-3-calculate-your-code-coverage-using-codecov">Step 3: Calculate your code coverage using Codecov</h2>
<p>As you have developed your code using test-driven development you have a high
degree of code coverage. It is time to integrate the code coverage calculation
into the Travis CI testing. We will use <a href="https://codecov.io">Codecov</a> to do
this.</p>
<p>Sign in using your GitHub account, sync your repos and add the project that you
want to measure the code coverage for. Then edit the <code class="language-plaintext highlighter-rouge">.travis.yml</code> file to
look like the below.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="na">language</span><span class="pi">:</span> <span class="s">python</span>
<span class="na">python</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">2.7"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">3.2"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">3.3"</span>
<span class="pi">-</span> <span class="s2">"</span><span class="s">3.4"</span>
<span class="na">script</span><span class="pi">:</span> <span class="s">nosetests</span>
<span class="na">before_install</span><span class="pi">:</span>
<span class="s">pip install codecov</span>
<span class="na">after_success</span><span class="pi">:</span>
<span class="s">codecov</span>
</code></pre></div></div>
<h2 id="step-4-upload-your-package-to-pypi">Step 4: Upload your Package to PyPi</h2>
<p>You have developed a great Python package, it is time to share it with the
world. This is done, most effectively, by uploading it to
<a href="https://pypi.python.org/pypi">PyPi</a>.</p>
<p>Peter Down has written a great post explaining
<a href="http://peterdowns.com/posts/first-time-with-pypi.html">how to submit a package to PyPi</a>.</p>
<p>Hosting your package on PyPi makes it easy for people to install using <code class="language-plaintext highlighter-rouge">pip</code>.</p>
<h2 id="step-5-add-badges-to-your-projects-readme-file">Step 5: Add badges to your project’s README file</h2>
<p>Finally the part that we have all been waiting for: cool looking badges!</p>
<p>Readthedocs, Travis CI and Codecov all provide badges as part of their service. For the PyPi
package we will make use of <a href="http://badge.fury.io">Version Badge</a>.</p>
<p>Below is part of the reStructuredText markup I use for my <code class="language-plaintext highlighter-rouge">tinyfasta</code> package.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.. image:: https://badge.fury.io/py/tinyfasta.svg
:target: http://badge.fury.io/py/tinyfasta
:alt: PyPI package
.. image:: https://travis-ci.org/tjelvar-olsson/tinyfasta.svg?branch=master
:target: https://travis-ci.org/tjelvar-olsson/tinyfasta
:alt: Travis CI build status (Linux)
.. image:: https://codecov.io/github/tjelvar-olsson/tinyfasta/coverage.svg?branch=master
:target: https://codecov.io/github/tjelvar-olsson/tinyfasta?branch=master
:alt: Code Coverage
.. image:: https://readthedocs.org/projects/tinyfasta/badge/?version=latest
:target: https://readthedocs.org/projects/tinyfasta/?badge=latest
:alt: Documentation Status
</code></pre></div></div>
<p>The images in the <a href="https://github.com/tjelvar-olsson/tinyfasta/blob/master/README.rst">README.rst</a>
file gets rendered by GitHub into a neat looking header with the badges below.</p>
<p><img src="http://badge.fury.io/py/tinyfasta.svg" alt="PyPI package" />
<img src="https://travis-ci.org/tjelvar-olsson/tinyfasta.svg?branch=master" alt="Travis CI build status (Linux)" />
<img src="https://codecov.io/github/tjelvar-olsson/tinyfasta/coverage.svg?branch=master" alt="Code Coverage" />
<img src="https://readthedocs.org/projects/tinyfasta/badge/?version=latest" alt="Documentation Status" /></p>
<h2 id="conclusion">Conclusion</h2>
<p>You should now have a Python package that looks loved and cared for.</p>
<ul>
<li>It is easy to install using <code class="language-plaintext highlighter-rouge">pip</code></li>
<li>It has online documentation</li>
<li>It gets tested every time code is pushed to GitHub</li>
<li>It has its code coverage measured</li>
</ul>
Day 12: Multi-level modelling in morphogenesis2015-07-24T00:00:00+00:00http://tjelvarolsson.com/blog/day12-multi-level-modelling-in-morphogenesis<p>The twelth day of the
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>
course was started by
<a href="http://pages.igc.gulbenkian.pt/SCF/christen/Eco-Evo-Devo.html">Dr Christen Mirth</a>
outlining the methods and theory of
<a href="https://en.wikipedia.org/wiki/Evolutionary_developmental_biology">evolutionary-developmental biology</a>
(eco-devo).</p>
<p>In nature (and the lab) one can observe
<a href="https://en.wikipedia.org/wiki/Polymorphism_(biology)">polymorphism</a>. Sources
of these polymorphic variations include both genetic and environmental factors.
Meaning that the phenotype is composed of interactions between the
genotype and the environment. This results in
<a href="https://en.wikipedia.org/wiki/Phenotypic_plasticity">phenotypic plasticity</a>,
the ability of an organism to react to an environmental input with a change
in form, state, or behaviour.</p>
<p>Many different organisms show striking examples of phenotypic plasticity.
Famous examples include:</p>
<ul>
<li><em>Daphnia cuclatta</em></li>
<li><em>Nemorai arizonaria</em></li>
<li><em>Priceis coenia</em></li>
</ul>
<p>Furthermore there are many different inducers of phenotypic plasticity:
predators, food, temparture and day length respectively for the examples
listed above.</p>
<p>Looking at different phenotypes one can observe that they can have differing
degrees of plasticity.</p>
<p>In order to go from an environmental queue to a difference in phenotype
several mechanistic steps are required. The queue must be sensed by the
organism. Some sort of signal must then be sent to the relevant tissue.
The tissue must then interpret the signal and respond accordingly.</p>
<p>Dr Mirth then used the example of horn development in male dung beetles
as a case study to illustrate how these processes may occur. Male
dung beetles can either develop horns or become hornless. This
dimorphism depends on larval nutrition provided by the mother.</p>
<p>One of the central problems for this particular example is that the
horned and hornless dung beetles grow to the same size given the same
amount of nutrition. So how does it regulate the size of the horn independently
of the body size?</p>
<p>This is where the degrees of plasticity come into play. Where the
initial levels of nutrients during development affect the plasticity
of the horn growth relative growth.</p>
<p>In the last section of her talk Dr Mirth discussed the patterning
of Drosophila wings, focussing in particular on the question of
pattern coordination. From an impressive set of experiments looking
at the expression levels of different hormones during development
Dr Mirth managed to show, by perturbing the system, that the organ
patterning is coordinated by a set of specific milestones.</p>
<p>The participants were then invited to experiment with computational
models linking network evolutions to cellular behaviour.</p>
<p>After lunch Dr Mirth gave a keynote lecture expanding on the concept outlined
in the morning describing phenotypic plasticity and the evolution of
polyphenisms.</p>
<p>During her talk Dr Mirth described how nutrition can affect two different
aspects of plasticity in Drosophila:</p>
<ul>
<li>Body size</li>
<li>Ovarian size</li>
</ul>
<p>The latter example was analogous to the male dung beetle horn development
where the size of the tissue is reprogrammed in fashion that is independent of
the whole body size.</p>
<p>Through a set of beautiful experiments Dr Mirth was able to show that
<strong>different processes during different stages can be used to reprogram tissue growth</strong>.</p>
<p>After the keynote the course was wrapped up by the participants being handed their
certificates and everyone breathing a sigh of relief before saying goodbye to their
new found friends.</p>
<p><em>Stay in touch!</em></p>
Day 11: Multi-level modelling in morphogenesis2015-07-23T00:00:00+00:00http://tjelvarolsson.com/blog/day11-multi-level-modelling-in-morphogenesis<p>The eleventh day of the
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>
course was started by
<a href="http://lander-office.bio.uci.edu/landerfacts.html">Dr Arthur D. Lander</a>
giving pedagogical lecture on morphogen gradients from an engineering
viewpoint.</p>
<p>Dr Lander started off by showing a painting by Hanbusa Itcho representing the
allegory of
<a href="https://en.wikipedia.org/wiki/Blind_men_and_an_elephant">blind monks examining an elephant</a>.
Where each monk, holding onto a different part of the elephant, is convinced that
his “view” of the elephant is the truth.</p>
<p>Many people working on biological problems approach them using a physics mindset.
In physics the goal is to understand the phenomena. Organisation arises by emergence.
Understanding therefore means grasping how casual laws produce complex phenomena.
In essence the question under investigation is: <em>how does it work?</em></p>
<p>An alternative approach is that used in engineering. In engineering the goal is to
understand performance. Organisation arises by selection for performance.
Understanding therefore means grasping how complex systems achieve specific goals.
In essence the question under investigation is: <em>why is it built that way?</em></p>
<p>These different approaches have led to contrasting views of how biological complexity
has arisen. The physics mindset can lead to thinking that complexity has arisen
by chance as “frozen accidents”, whereas complexity using the engineering mindset appears
to be something that has arisen out of necessity. These two contrasting views are
very similar to the blind monks examining the elephant.</p>
<p>Dr Lander then turned his attention to the vein patterning of Drosophila wings,
which is controlled by two morphogen gradients.
The “textbook” representation of the gene regulatory network involved in this
wing patterning is deceptively simple. But in fact when one looks at all the
data available for the network it balloons into a complex “hairball” diagram.</p>
<p>Creating a morphogen gradient is not inherently difficult. All you need is:</p>
<ul>
<li>Localised production</li>
<li>Random diffusion</li>
<li>Any sort of decay process</li>
</ul>
<p>So in this instance the question is not <em>how</em>, but <em>why</em>. Why is the gene
regulatory network for patterning the Drosophila wing so complex if it is easy
to create a morphogen gradient?</p>
<p>To answer this one needs to consider what the performance objectives might be.
Reasonable biological performance objectives include:</p>
<ul>
<li>Stability</li>
<li>Efficiency (especially important for bacteria)</li>
<li>Timing</li>
<li>Evolvability</li>
<li>Robustness</li>
</ul>
<p>The reliability of morphogenesis suggests it has high robustness.
Examples include the formation of complex tissues such as the heart and the
striking similarities of monozygotic twins.</p>
<p>Robustness can be defined as being able to achieve a desired goal in face of
perturbations.</p>
<p>So what might the goals be? Some reasonable goals include:</p>
<ul>
<li>Constancy (homeostasis)</li>
<li>Accurately reaching an endpoint</li>
<li>Adaptation</li>
</ul>
<p>Perturbations include:</p>
<ul>
<li>Altered system parameters (e.g. temperature)</li>
<li>Altered initial conditions</li>
<li>Noise (extrinsic or intrinsic)</li>
</ul>
<p>Dr Lander then went into some detail on how one can use the engineering concept
of sensitivity coefficients to quantify robustness in a unitless way. This looks
at what the fold change in the output is in response to any fold change of the input.
In engineering, a process is considered reasonably robust if the sensitivity coefficient is
lower than 0.3.</p>
<p>Dr Lander then illustrated how one could analytically evaluate the robustness of
different models for generating morphological gradients using sensitivity.
This concept was expanded on, after some coffee, when the participants of the
course were invited to experiment with the concept analytically using
<a href="http://www.wolfram.com/mathematica/">Mathematica</a>.</p>
<p>After lunch
<a href="http://nick-monk.staff.shef.ac.uk">Dr Nick Monk</a>
gave a talk outlining some of the concepts from his paper
<a href="http://onlinelibrary.wiley.com/doi/10.1002/jez.b.22468/abstract">The Inheritance of Process: A Dynamical Systems Approach</a>.</p>
<p>As scientists we want to understand evolution. So far most of our understanding
has been based on the assumption that there map relating
<a href="https://en.wikipedia.org/wiki/Genotype">genotype</a> to
<a href="https://en.wikipedia.org/wiki/Phenotype">phenotype</a>
has got a one-to-one
(<a href="https://en.wikipedia.org/wiki/Bijection">bijective</a>) mapping.</p>
<p>However, the “traditional” view of the genotype-phenotype map as a one-to-one
relationship is probably too simplistic. Take for example the case of
<a href="https://en.wikipedia.org/wiki/Polyphenism">polyphenism</a>, an
ecologically important trait that helps organisms adapt to variable
environment. Polyphenism is widespread and occurs in many plants and animals.</p>
<p>This means that one needs to take the environment into account when thinking
about the genotype-phenotype map. Where a genotype actually encodes a set of
potentialities. Which potentiality is realised is influenced by the environment
and can be stochastic. It is not a simple bijective mapping.</p>
<p>Dr Monk continued to illustrate how the traditional view of the
genotype-phenotype map falls short in terms of helping us understand
evolution as a process before presenting a new formalism
for genotype-phenotype maps by representing them as dynamical systems.</p>
<p>In this formalism a genotype is specified by:</p>
<ol>
<li>A network topology</li>
<li>A set of interaction parameters</li>
</ol>
<p>Which can be written down using a network model (e.g. boolean network, or a set of ODEs).
The genotype-phenotype map can then be represented using
<a href="https://en.wikipedia.org/wiki/Phase_space">phase space</a>, which has a number of
descriptors that can be useful for understanding the mapping.</p>
<ol>
<li>Attractors (fixed points, cycels, chaotic)</li>
<li>Basins of attraction</li>
<li>Separatrices</li>
<li>Trajectories</li>
<li>Initial conditions</li>
</ol>
<p>Thinking of a genotype as a network means that one has the possibility to trigger
a number of different phenotypes. In other words a single genotype (network) can
have multiple attractors (end points) and phenotypes correspond to trajectories
(not just to attractors). The latter is useful as phenotypes are plastic.</p>
<p>Dr Monk finished by stating that rather than focussing on how evolution can
change the genotype-phenotype, perhaps we should be thinking about how evolution
can change phase space. Perhaps we should think of evolution as an inheritance
of process where discrete changes in genome sequence and allele frequency
results in a change to an underlying continuous dynamics.</p>
<p>Dr Monk’s talk was followed by a keynote talk by Dr Lander which expanded on some
of the concepts outlined in the morning’s lecture.</p>
<p>Biology is driven by performance objectives. The mechanisms that exist to achieve
these kinds of gaols are collectively referred to as <strong>control</strong>.</p>
<p>What are the problems that you get into when you want to achieve “control”?</p>
<p>Basically control makes things more complex. In fact there is a very strong
relationship between the two and in engineering this is referred to as the
“no-free-lunch principle”. In other words, achieving good performance in one
arena often comes at the expense of good performance in another.</p>
<p>In fact curious things happen when you try to achieve control over a great many
things at the same time. The possible solutions start diverging and scatter
exponentially (for more detail see
<a href="http://journals.aps.org/pre/abstract/10.1103/PhysRevE.76.021122">Landscape analysis of constraint satisfaction problems; Krzakala and Kurchan; 2007</a>).</p>
<p>Dr Lander then illustrated the interplay between control, complexity and
tradeoffs using the example of Drosophila wing patterning. Using modelling, Dr
Lander showed many examples of how introducing a process for controlling a
particular aspect of the morphogen gradient also resulted in loss of control
for a different aspect.</p>
<p>A fundamental issue, in terms of Drosophila wing patterning, may be that there
is not be enough information in a single morphogen gradient. Dr Lander then
illustrated how more control can be realised by using two morphogen gradients,
particularly in conjunction with the toggle-switch architecture.</p>
<p>Dr Lander concluded by stating that if we wish to understand not just what
happens in biology, but why biological systems are built the way they are, we
need to interpret biological organisation in light of principles of control,
and the constrains imposed by selection for control. Trade-offs (the
no-free-lunch principle) are likely to drive the evolution of complexity. By
focussing on performance, trade-offs and control, we can find <em>potential</em>
explanations for at least some of the intricate feedback and feed-forward
interactions that we observe in patterning systems.</p>
Day 10: Multi-level modelling in morphogenesis2015-07-22T00:00:00+00:00http://tjelvarolsson.com/blog/day10-multi-level-modelling-in-morphogenesis<p>The tenth day of the
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>
course was started by
<a href="https://www.jic.ac.uk/directory/enrico-coen/">Professor Enrico Coen</a>
giving a pedagogical lecture on modelling growth and deformation.</p>
<p>Most of us are used to thinking about geometrical deformations. However, the
mathematical principles behind geometrical deformations are quite different
from the principles behind biological deformation. For example, transforming a
square into a trapezoid, as if viewing the box from a perspective, is
easy using a geometrical deformation/transformation. However, achieving this
type of deformation in a biological system using growth is non-trivial. One
of the complicating factors is that the tissue in a biological system is
an interconnected material.</p>
<p>What is growth anyway? Does it depend on the number of cells? Is it something
continuous or something discrete? Is the process of growth absolute (accretion)
or relative to the size of the tissue? Much time was spent discussing the
implications of these questions with regards to coming up with a definition of
growth.</p>
<p>For the purposes of his talk Professor Coen defined growth as a tissue getting bigger,
independently of the number of cells. And on the flip-side he defined a tissue
reducing its size as “shrinkage”.</p>
<p>Professor Coen then expanded on the concept of growth rates, which he defined
to be relative and continuous in the context of the growth of a tissue. He then
explained how the growth rates can be inferred from velocities. If a velocity
is changing one is observing either growth or shrinkage. Similar to Dr Kabla,
the preivous day, professor Coen took the derivative of the velocity field to
get a growth tensor. The growth tensor can be represented as:</p>
<ul>
<li>Growth rate</li>
<li>Anisotropy</li>
<li>Direction</li>
<li>Rotation</li>
</ul>
<p>The growth tensor can be estimated by microscopy time laps movies, where
one can track cell vertices over time.</p>
<p>Furthermore, the growth tensor concept can be used to model tissue growth.
By specifying the growth rate, anisotropy and direction one can get deformations,
conflicts and rotations as emergent features.</p>
<p>The participants where then invited to experiment with these concepts by modelling
tissue growth in three dimensions.</p>
Day 9: Multi-level modelling in morphogenesis2015-07-21T00:00:00+00:00http://tjelvarolsson.com/blog/day9-multi-level-modelling-in-morphogenesis<p>The ninth day of the
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>
course was started by
<a href="https://www.jic.ac.uk/directory/enrico-coen/">Professor Enrico Coen</a>
who continued on the theme of polarity.</p>
<p>Professor Coen did not set out to study polarity. He was interested in how
tissues grew. However, after some time he came to the conclusion that to
understand tissue growth he would have to understand polarity.</p>
<p>One of the main players of polarity in plants is the hormone auxin.
In fact some of the main markers for polarity in plants are the PIN proteins,
which actively transports auxin across the plasma membrane. By coordinating
the location of PIN proteins to one side of all cells a tissue
can create a polarity field. Furthermore, auxin gradients have been shown
to regulate tissue polarity.</p>
<p>At the time there where two main models describing auxin regulated tissue cell
polarity:</p>
<ul>
<li>Cell-cell comparison, a.k.a. “up-the-gradient” hypothesis</li>
<li>Flux or gradients at the interface, a.k.a. “with-the-flow” hypothesis</li>
</ul>
<p>The former could explain PIN locations and the latter veins in leafs.
However, both models also had issues. How could a cell, in a cell-cell comparison
scenario, know the concentration of its neighbors? On the flip-side, in the
“with-the-flow” hypothesis, it was unclear how the cell would be able to
measure the flux.</p>
<p>Professor Coen then turned the attention to animals models that
had been proposed to coordinate the hair orientation in Drosophila wings.
Again there were two major classes of models:</p>
<ul>
<li>Cell-cell comparison models</li>
<li>Interface models, i.e. receptors intaracting at interfaces between cells</li>
</ul>
<p>These were very similar to the plant models. The main difference being that
plant cells could not interact directly as they are separated by a cell wall.</p>
<p>However, all the plant and animal models had an assumption in common. Namely
that cells are unpolarised in the absence of an asymmetric ligand distribution
or polarised neighbours.</p>
<p>But we know of at least some systems where this assumption is not true.
For example budding yeast and migrating neutrophils.</p>
<p>Dr Coen then showed that by assuming that cells have an intrinsic polarity one
can create a model where the polarised cells arrange themselves through local
interactions with their neighors. For example, in an animal model, one can
imagine a scenario where the cells’ “front” and “back” factors directly bind
with their neighbour’s “front” and “back” factors. This leads to coordination.
However, the emerging pattern is a bunch of spirals. He then showed that very
strict, orientated tissue polarity can be established if organizers are located
somewhere on the tissue, or the polarities interact with some kind of
concentration gradient.</p>
<p>One way to use this to model how a plant organises polarity is to assume high
auxin efflux at one end of the tissue and no export of auxin at the other end.
By reading out the same concentrations between the cells, both cells tend to
align. Thus, using this indirect-signalling mode, one gets a similar result to
the cell-cell comparison models. This is a bit counterintuitive. The
patterning is working to remove the signal that is causing the pattern.</p>
<p>However, one can use the same model to create a different emergent tissue polarity.
A model with high auxin production at one end of the tissue and low
auxin degradation at the other end. In this case one ends up with
results similar to those from the “with-the-flow” hypothesis.</p>
<p>So in essence we have a model that produces two different behaviours,
previously thought to be two different processes altogether, and consolidates
them in a parsimonious and locally-based manner.</p>
<p>Professor Coen’s talk was followed by a presentation by
<a href="http://kalab.emma.cam.ac.uk/index.php">Dr Alexandre J. Kabla</a>
on the mechanobiology of cell migration and cell rearrangements.</p>
<p>Dr Kabla started off by illustrating that there is a massive amount
of motion during development. In fact most shapes are created by
cell migration.</p>
<p>This motion arises from several different processes:</p>
<ul>
<li>Sheet bending/folding</li>
<li>Convergence extension</li>
<li>Collective migration</li>
</ul>
<p>Dr Kabla then described a methodology for understanding some of these
processes, in particular the latter two.</p>
<p>Using microscopy one can record time laps movies of developing embryos.
These images can then be segmented into individual cells, which can be
tracked over time. By looking at the motions of individual cells one can
calculate velocities. All of the velocities can then be used to
create a velocity field. By differentiating the velocity field one
obtains a deformation field, which is a useful representation for
trying to understand tissue formation by motion during development.
The deformation field can, in fact, be used to identify separate
tissues from a blob of cells.</p>
<p>Dr Kabla then went on to describe how one could analyse cell interacalation
(convergence extension) in more detail using the deformation field
representation.</p>
<p>The talk was followed by lunch, which was followed by another talk by Dr Kabla
describing on how modelling can be used to study collective migration. After
the talk the participants of the course were invited to try out some of
these analysis using data simulated using cellular Potts model programs used
earlier in the course.</p>
Day 8: Multi-level modelling in morphogenesis2015-07-20T00:00:00+00:00http://tjelvarolsson.com/blog/day8-multi-level-modelling-in-morphogenesis<p>The eight day of the
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>
course was started by
<a href="http://www.crg.eu/en/johannes_jaeger">Dr Yogi Jaeger</a>
giving an introductory lecture to
parameter estimation using reverse-engineering.
The talk was illustrated using a case study of
<a href="https://en.wikipedia.org/wiki/Segmentation_(biology)">segmentation</a>
during
<a href="https://en.wikipedia.org/wiki/Drosophila_embryogenesis">fly embrogenesis</a>.</p>
<p>Along the way Dr Jaeger highlighted many of the pitfalls that one needs to be
aware of when modelling biological systems. Particular emphasis was put on the
model being a tool, not reality! An outcome of this is that one needs to
pay attention to the model and understand its limitations.</p>
<p>In its most simple form reverse-engineering can be split into three stages:</p>
<ol>
<li>Creating a dynamical model</li>
<li>Obtaining quantitative measurements of data</li>
<li>Fitting the model to the data</li>
</ol>
<p>When fitting the model to the data there are two main questions to consider.</p>
<ol>
<li>How do you measure the similarity between the data and the model?</li>
<li>Which algorithm are you going to use to fit the data?</li>
</ol>
<p>One of the simplest ways of measuring the similarity between the model and the
data is to calculate the root mean square residual. However, other
measures are available and the selection of one over another is context
dependent. It is therefore something that one needs to pay attention to.</p>
<p>Fitting the model to the data, i.e. estimating the parameters, is a global
optimisation problem and there are a number of algorithms available to tackle
it. Traditionally people have been using evolutionary strategy and simulated
annealing algorithms for these types of problems. Evolutionary strategy algorithms are
relatively quick. However, when using them one suffers from not knowing whether
or not the solution identified is the real global minimum. Simulated annealing
algorithms can be more robust, but they are also slower.</p>
<p>Dr Jaeger then mentioned that his lab has had great success with the
<a href="http://www.cleveralgorithms.com/nature-inspired/stochastic/scatter_search.html">scatter search algorithm</a>.
In his hands it can be up to ten times quicker than simulated annealing.</p>
<p>Once one has found a solution one needs to ask whether or not
it is appropriate. This can be achieved by parameter identifiability analysis though bootstrapping, i.e.
fitting the model to noisy data. The results can be projected
onto the parameter landscape as ellipsoid confidence regions.
However, this can be slow. A quicker way to estimate these confidence
regions is to calculate the Hessian matrix of the system using linear
approximation.</p>
<p>After lunch
<a href="https://www.jic.ac.uk/directory/veronica-grieneisen/">Dr Veronica Grieneisen</a>
gave a talk about cell polarity and how one can understand it through breaks of
symmetry.</p>
<p>If one considers a morphogen gradient, how can it be “read” by cells? Further,
how can this lead to coordinated cell orientations? Any solution will require
some process of comparison.</p>
<p>The talk then took a slight detour into physics.</p>
<p>How can you make a compass? You can take a needle and magnetize it with an
external field. If you have many needles they will all align in the field.
Importantly each subunit (needle) will have a “north-south” polarity in the
magnetic field.</p>
<p>Without going too far with the analogy Dr Grieneisen noted that by giving a
cell the concept of polarity it is given a mechanism for aligning within
a larger polarity such as a chemical gradient or a tissue polarity.</p>
<p>Dr Grieneisen then presented work using the cellular Potts model illustrating
how small G-proteins, which can act as molecular switches, can give rise to
cell polarity. However, the modelling found that there was an additional
requirement. The inactive form had to be able to diffuse on a quicker
time-scale than the active form. In this particular case this was achieved by
the active form being constitutively membrane bound, whereas the inactive form
was able to diffuse freely in the cytosol.</p>
<p>Dr Grieneisen pedagogical lecture was followed by a keynote lecture by Dr
Jaeger giving a more detailed exposition of his top-down approach to extracting
structures of networks from gene expression data and his analysis of the model
by looking at phase space and the attractors within it.</p>
<p>By using this reverse-engineering approach Dr Jaeger managed to establish
that the
<a href="http://rsif.royalsocietypublishing.org/content/10/79/20120826">AC/DC circuit</a>
is a recurring motif in the
<a href="https://en.wikipedia.org/wiki/Gap_gene">gap gene</a> network. The AC/DC circuit
is interesting in that it can act both as a positive and a negative feedback
loop. By analysing the phase space and attractors in the AC/DC circuit one
finds that this simple network can give rise to different functions.
Specifically it can act as a:</p>
<ul>
<li>switch</li>
<li>oscillator</li>
<li>damped oscillator</li>
</ul>
<p>Dr Jaeger then showed that the AC/DC circuit in the gap system could be used to create:</p>
<ul>
<li>Stable boundaries in the anterior of the fly embryo, set by attractors</li>
<li>Moving boundaries in the posterior of the fly embryo, governed by a damped oscillator</li>
</ul>
<p>The participants of the course were then invited to an open-panel session to discuss
“what models are for”. This was followed by more hands-on computational exercises
simulating cell polarity in animal and plant cells.</p>
Day 5: Multi-level modelling in morphogenesis2015-07-17T00:00:00+00:00http://tjelvarolsson.com/blog/day5-multi-level-modelling-in-morphogenesis<p>The fifth day of the
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>
course was started by
<a href="http://www.cpsc.ucalgary.ca/~pwp">Professor Przemyslaw Prusinkiewicz</a>
giving an introductory lecture to
<a href="https://en.wikipedia.org/wiki/L-system">L-systems</a>.</p>
<p>However, before going into L-systems Professor Prusinkiewicz introduced the
topic of computational modelling more generally. In particular he highlighted
<a href="https://en.wikipedia.org/wiki/J._C._R._Licklider">J. C. R. Licklider</a>, who amongst
other things was one of the founders of the internet. In his essay
<a href="http://groups.csail.mit.edu/medg/people/psz/Licklider.html">Man-Computer Symbiosis</a>
Licklinder had the vision that man will start interacting with computers
in the same way as they would interact with a colleague.</p>
<p>Professor Prusinkiewicz also highlighted the work of
<a href="https://en.wikipedia.org/wiki/Alan_Kay">Alan Kay</a> who had the vision of
<a href="http://www.mprove.de/diplom/gui/Kay72a.pdf">A Personal Computer for Children of All Ages</a> in 1972.</p>
<p>At the time the computational resources were not available for either of these
visions. However, now they are! Which means that this is a great time to be a
modeller. <em>You can now treat your laptop as a colleague whose skills
supplement your own.</em></p>
<p>Professor Prusinkiewicz then went on to discuss some of the issues of modelling
in developmental biology. There are two main issues. First of all development is
a spatio-temporal process. Secondly, a developing organism is a dynamical
<strong>system</strong> with a dynamical <strong>structure</strong>. For example, one could look at a plant
as a system of components that are all developing over time.</p>
<p>These issues were dealt with by
<a href="https://en.wikipedia.org/wiki/Aristid_Lindenmayer">Astrid Lindenmayer</a> in
<a href="http://www.sciencedirect.com/science/article/pii/0022519368900799">Mathematical models for cellular interactions in development</a>.
The formalisms developed by Lindenmayer are now often referred to as
L-systems or Lindenmayer systems.</p>
<p>A L-system basically consists of an alphabet, a set of productions (rules for
converting letters of the alphabet into new strings of letters from the
alphabet), and an axiom (the starting string). Using a recursive algorithm a
string, starting from the axiom, is continually transformed by the production
rules.</p>
<p>Using
<a href="https://en.wikipedia.org/wiki/Turtle_Geometry">turtle geometry</a> L-systems can
used to create fractal structures using very simple rules.</p>
<p>Professor Prusinkiewicz then went on to describe how the L-system can be used
to model the development of plants by having an alphabet of basic plant modules
such as stems, branches, flowers, etc.</p>
<p>This was followed by a practical session where people got to experiment with
L-system modelling using the
<a href="http://algorithmicbotany.org/virtual_laboratory/">Vlab</a> software.</p>
<p>The practical session was followed by a pedagogical lecture on phyllotaxis
by
<a href="http://www.msc.univ-paris-diderot.fr/spip.php?rubrique140&lang=en">Dr Yves Couder</a>.</p>
<p><a href="https://en.wikipedia.org/wiki/Phyllotaxis">Pyllotaxis</a> is the arrangement of
leaves on a plant stem. There are a very small number of archetypes of organisations.
Leaves can be organised into spiral nodes or whorled modes.</p>
<p>Spiral patterns are also important in parastichy, which can be observed for example
in pine cones and the organisation of sun flower seeds.</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/a/aa/Pflanze-Sonnenblume1-Asio.JPG" alt="Parastichy pattern" /></p>
<p>Interestingly the spiral node pattern of leaves can be related to parastichy by
compressing the stem.</p>
<p>Previously many people thought that these types of complex patterns could only
be the result of complex biological processes. Particularly as these patterns
where only observed in botany.</p>
<p>However, Dr Couder illustrated how he could reproduce these pattern using a
physical system consisting of a ferrous material dropped into oil on a petri
dish. The ferrous drops where pushed towards the edge of the petri dish by a
magnetic field. As more ferrous drops were added the parastichy pattern emerged
by virtue of the repulsive dipole interactions with the nearest neighbor drops
that had already been deposited in the oil.</p>
<p>Dr Couder then went into a more mathematical
description of these patterns and how they relate to
<a href="https://en.wikipedia.org/wiki/Fibonacci_number">Fibonacci numbers</a>
and the
<a href="https://en.wikipedia.org/wiki/Golden_ratio">golden ratio</a>.
For more details on this fascinating work relating botany to mathematics see
the excellent Science News article
<a href="https://www.sciencenews.org/article/mathematical-lives-plants">The Mathematical Lives of Plants</a>.</p>
<p>After lunch the participants were invited to continue thinking about modelling
by going out into the field and taking photographs of interesting plant
patterns. This was followed by a show and tell session where people were
encouraged to think about and discuss possible mechanisms that could be used to
produce the pattern in question. This was followed by more computer modelling
using L-systems.</p>
Day 4: Multi-level modelling in morphogenesis2015-07-16T00:00:00+00:00http://tjelvarolsson.com/blog/day4-multi-level-modelling-in-morphogenesis<p>The fourth day of the
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>
course was started by
<a href="https://www.jic.ac.uk/directory/stan-maree/">Dr Stan Maree</a>
giving a talk outlining how modelling has helped us understand the life-cycle of
<a href="https://en.wikipedia.org/wiki/Slime_mold">cellular slime mold</a>.</p>
<p>Under normal conditions cellular slime molds act as individual cells feeding on
bacterial. However, during starvation hundred of thousands of these cells come
together to form a slug like creature that can migrate across a surface guided
by, amongst other, thermotaxis. Finally the slug culminates by transforming
itself into a fruiting body consisting of a spore head on a small, tapering
stalk.</p>
<p>Dr Maree then illustrated, though modelling, that all of the aspects of the
life-cycle unfold by combining only a few processes. The main processes being:</p>
<ul>
<li>excitable media through cAMP</li>
<li>chemotaxis towards cAMP</li>
<li>differential cell adhesion</li>
<li>cell differentiation</li>
</ul>
<p>The fact that this parsimonious description can account for the complex
behaviour of development is mainly due to the fact that one can get differing
behaviours by operating on different levels, e.g. individual cells versus
clusters of cells versus a slug of cells. For example pressure waves emerge and
guide the culmination stage solely due to cell adhesion and excitable media.
While adhesion is a property linked to cell membranes which ensures that cells
adopt different shapes and that clusters of cells also develop certain
topologies. At the highest level the Dictyostelium slug can even act as a lens
and use this effect to be able navigate up light gradients, which could be
understood through modelling. This combination of modelling at many different
scales is a core aspect of this course.</p>
<p>The participants were then invited to work though workshop material on
aggregation and cAMP waves, aggregation and slug formation, thermotaxis
and culmination.</p>
<p>In the afternoon there was a keynote lecture by
<a href="http://web.stanford.edu/group/bergmann/cgi-bin/bergmannlab/">Dominique Bergmann</a>.</p>
<p>Dr Bergmann’s group is interested in the the
<a href="http://web.stanford.edu/group/bergmann/cgi-bin/bergmannlab/research">development of stomata</a>,
in particular the spatial organization of the stomatal lineage.
Dr Bermann’s group is largely experimental. However she is very keen to
interact with modellers and it was great to see her inviting the participants
of the course to tackle questions that her group are currently battling with.
Questions which could potentially be answered by modelling.</p>
<p>Dr Bergmann talk gave fascinating insight into the mechanisms by which stomata
are patterned across leaves, starting from the simple rule that no stomata may
touch each other. The talk revealed that flexible patterned development can
arise by regulating the expression of key transcription factors through
positive an negative feedback loops. Furthermore by studing the differences
between the regulatory networks in grass and Arabidopsis Dr Bermann managed to
refine important features of the mechanism of patterning, highlighting the
value of studying different organisms.</p>
Day 3: Multi-level modelling in morphogenesis2015-07-15T00:00:00+00:00http://tjelvarolsson.com/blog/day3-multi-level-modelling-in-morphogenesis<p>The third day of the
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>
course started off with a lecture by
<a href="https://www.jic.ac.uk/directory/veronica-grieneisen/">Dr Veronica Grieneisen</a>,
giving an introduction to the
<a href="https://en.wikipedia.org/wiki/Cellular_Potts_model">cellular Potts model</a>.</p>
<p>The talk started off with a statement that biophysics, or simply physics,
constrains and drives tissue development. And that (embryonic) tissues
share properties with fluids.</p>
<p>So why do clusters of cells form similar structures to froths and bubbles?
Basically some of the principles are the same: the tendency to minimise
area and conserve (topological) constraints give rise to a frustration which
generates characteristics configurations.</p>
<p>In 1964 Steinberg formulated the
<a href="https://en.wikipedia.org/wiki/Differential_adhesion_hypothesis">differential adhesion hypothesis</a>
in which he made the comparison between cells and immiscible fluids. He used the idea
that cell types present different adhesive and cohesive interactions to postulate
that the final configurations are established by obtaining a minimal interface
free energy through successive changes in cell contacts.</p>
<p>This means that differential inter-cellular adhesion is one of the most
important factors in cell sorting.</p>
<p>It did however take some time before this could be modelled. People tried doing
it using cellular automata. However, this turned out to be (close to)
impossible.</p>
<p>Some time later people started experimenting with the cellular Potts model.
A significant difference between the cellular Potts model and cellular automata is
the representation of a biological cell. In cellular automata a cell is
represented by a single pixel, whereas in the cellular Potts model a biological
cell is represented by lots and lots of pixels. The latter is therefore able to
represent cell shape.</p>
<p>The
<a href="https://en.wikipedia.org/wiki/Cellular_Potts_model">cellular Potts model</a>
is simply a Hamiltonian consisting of terms for adhesion, volume conservation
and cortical tension. The latter being a more recent addition.</p>
<p>The cellular Potts model is driven by Monte Carlo sampling where the edges of the
cells are allowed to change state into neighboring cells. The energy of each pixel
is then evaluated and if the energy is lowered the change is accepted. If a
change increases the energy the change can still be accepted. The likelihood of
accepting an energetically unfavourable change is evaluated using a probability
function that makes it more unlikely as the energy difference increases.
The reason for accepting some energetically unfavoured changes is so that the
system can be driven towards the global minimum over time.</p>
<p>As it turns out the cellular Potts model can capture both the stochastic nature
of cell dynamics and cell sorting behaviour.</p>
<p>The formalisms of the cellular Potts model were then described in some detail
before the talk was concluded by stating that the cellular Potts model is an
energy-based model that describes surface mechanics (adhesion, membrane
fluctuations, internal pressures, cortical tension). As a result one can use it
to talk about macroscopic phenomena such as cell shape, cell sorting, tissue
surface tension, cell movement, stresses and shape changes, as well as
stresses and strains through a tissue.</p>
<p>It is also possible to extend the cellular Potts to model specific biological problems
such as chemotaxis and cell differentiation. Furthermore it can be used in conjunction with
other models to look at sub-celluar details such as gene regulatory networks. <strong>It therefore
serves as a great tool for cell-based modelling of morphogenesis.</strong></p>
<p>The lecture was followed by a practical session where the participants got
hands on experience of using the cellular Potts model for studying cell
sorting.</p>
<p>After lunch
<a href="https://www.jic.ac.uk/directory/stan-maree/">Dr Stan Maree</a>
took over where the morning’s lecture had ended by illustrating how
the cellular Potts model can be used to model cellular movement and
morphogenesis.</p>
<p>The first example illustrated how the cellular Potts model could be extended to
model movements of cells in lymph nodes. The movement of T-cells in the lymph
node can be described as random persistent motion. The reason for this type of
movement is that T-cells want to move past as many dendritic cells as possible
(and vice versa) in a short a time as possible.</p>
<p>The model was created from:</p>
<ul>
<li>T-cells set to be persistently moving</li>
<li>Dendritic cells (including extensions)</li>
<li>Reticular network (undeformable)</li>
<li>Correct sizes, densities, and shapes of cells</li>
<li>Fitting speed and motility</li>
</ul>
<p>Where the persistent T-cell movement was created from:</p>
<ul>
<li>Continuously adjusting the target direction</li>
<li>Continuously adjusting the directional persistence</li>
<li>Adjustment according to the reticular network</li>
</ul>
<p>The model created managed to reproduce: short term persistent motion,
long term random motion, and the experimentally observed “stop-and-go”
behaviour. The latter had not been incorporated into the model and was
previously thought to occur from a syncronised clock in the T-cells. These and
other simulations then promted further and longer time-lapse experiments from
the experimental biologists that disproved the internal clock hypothesis.</p>
<p>The second example illustrated how the cellular Potts model could be combined
with chemotaxis, cell differentiation and gene regulatory networks to model
complex developmental changes during
<a href="https://en.wikipedia.org/wiki/Gastrulation">gastrulation</a>.</p>
<p>Dr Maree’s lecture was followed by a keynote talk by
<a href="http://www.fbs.osaka-u.ac.jp/labs/skondo/indexE.html">Professor Shigeru Kondo</a>.</p>
<p>Professor Kondo gave a fascinating talk describing his quest to find evidence of
reaction-diffusion system animals.</p>
<p>He did this by turning the traditional work flow of a molecular biologist on its head.
Usually a molecular biologist would:</p>
<ol>
<li>Find mutants</li>
<li>Identify all the genes involved in the phenomenon</li>
<li>Identify the functions of all those genes</li>
<li>Clarify the whole interaction network</li>
<li>Do calculation to make sure the identified system can reproduce the interested phenomenon</li>
</ol>
<p>However professor Kondo decided that if he was to find evidence for the reaction-diffusion
system he would need to use the theory before doing the experiments. As such he set out to:</p>
<ol>
<li>Do many computer simulations</li>
<li>Extract important characteristics</li>
<li>Predict something unexpected</li>
<li>Show that it can happen!</li>
</ol>
<p>He then illustrated how his group had applied this methodology of modelling
first and experimenting second to find extraordinary evidence of the
reaction-diffusion system in fish.</p>
<p>For example, one prediction made from the Turing reaction-diffusion system
was that fish stripes should be able to migrate, specifically stripes
that bifurcate. At the time Professor Kondo asked leading experts of
fish developmental biology if this had every been observed. However, they
replied that it had not. Undeterred Dr Kondo started looking
for evidence of this behaviour and was able to find it in the skin of
maring anglefish, see
<a href="http://www.fbs.osaka-u.ac.jp/labs/skondo/paper/kondo%20Nature%201995.pdf">Kondo and Asai; Nature (1995)</a>.</p>
<p>After several other striking examples of prediction followed by experimental
evidence Professor Kondo concluded by stating that Turing systems had proved an
effective tool for understanding patterning. However, he made the point clear
that the underlying mechanism is probably encoded in cell motility rules rather
than necessarily in an activator and inhibitor.</p>
Day 2: Multi-level modelling in morphogenesis2015-07-14T00:00:00+00:00http://tjelvarolsson.com/blog/day2-multi-level-modelling-in-morphogenesis<p>The theme of the second day of the
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>.
course was
<em>emergent patterns and morphogens</em>.</p>
<p>The day started with a presentation by
<a href="https://www.jic.ac.uk/directory/stan-maree/">Dr Stan Maree</a>
who posed the question:</p>
<blockquote>
Can we get spontaneous formation of patterns out of nothing?
</blockquote>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/a/af/ZebraLudolphus.jpg" alt="Zebra" /></p>
<p>During the
<a href="/blog/day1-multi-level-modelling-in-morphogenesis/">previous day</a>
we learnt that equilibria can be stable or unstable. Now imagine a reaction with a
stable equilibrium and combine it with diffusion. Can we get patterns from that
combination?</p>
<p>In 1952 Alan Turing showed, theoretically, that a stable equilibrium could become
unstable solely due to the diffusion of the chemicals involved.</p>
<p>Dr Maree then worked through the logic of Turing’s original paper
<a href="http://rstb.royalsocietypublishing.org/content/237/641/37">The Chemical Basis of Morphogenesis</a>.
In the end Turing showed that in a systems where <code class="language-plaintext highlighter-rouge">A</code> and <code class="language-plaintext highlighter-rouge">I</code> activate
and inactivate each other (or vice versa) an instability will occur if <code class="language-plaintext highlighter-rouge">A</code>
also activates its own production, while <code class="language-plaintext highlighter-rouge">I</code> also inhibits its own production
and the diffusion of <code class="language-plaintext highlighter-rouge">I</code> is sufficiently faster than the diffusion of <code class="language-plaintext highlighter-rouge">A</code>.</p>
<p>The patterns that are formed by the
<a href="https://en.wikipedia.org/wiki/Reaction–diffusion_system">reaction-diffusion system</a> are
known as Turing patterns.</p>
<p>Dr Stan Maree’s talk was followed by a more sociological talk by
<a href="http://nick-monk.staff.shef.ac.uk">Dr Nick Monk</a>
who examined the interactions between experimental and theoretical biologists
in the context of reaction-diffusion systems.</p>
<p>In the early days the reaction-diffusion system did not really gain traction
with embryologist who did not like it as a model for development.
They thought it was too sensitive to environmental conditions and as such
too messy for development.</p>
<p>In 1970 Turing patterns started to gain traction with experimental biologists
primarily through the work of Hans Meinhardt. These models did not have a
molecular basis that could be linked to experimental data - in the spirit of
the time, the were posed in terms of “activity”. A significant problem was
that there was no easy/obvious way of linking experimental data to the
models.</p>
<p>Unfortunately things took a turn for the worse in the 1980s, where perhaps
modellers tried to over reach. At the time molecular genetics really got into
its stride. For a few systems, intense effort brought into focus the molecular
complexity of processes such as Drosophila segmentation and limb development.
At the same time, some reaction-diffusion modellers became bolds about the role
of their models. Unfortunately there was a lack of iterative modeling and
experimentation. Some modellers also failed to realise that a similarity
between ones model and an experimental system does not mean that it is the
model is correct.</p>
<p>At the time George Oester provided a voice of reason:</p>
<blockquote cite="http://www.sciencedirect.com/science/article/pii/0025556488900703">
However, many developmental biologists are now talking a hard look at
the actual contribitnos pattern formaiton models have made ot their field,
and i sense some disillusionment."
</blockquote>
<p>He used lots of data and examined lots of models to find that:</p>
<blockquote cite="http://www.sciencedirect.com/science/article/pii/0025556488900703">
physical and chemical mechanisms hypothesized by the models may be quite different, they
all predict very similar kinds of spatial patterns. Therefore, since the underlying mechanism
cannot in general be deduced from the pattern itself, other criteria must be applied in
evaluation the usefulness of pattern formation models.
</blockquote>
<p>Unfortunatley the paper was published in
<a href="http://www.sciencedirect.com/science/article/pii/0025556488900703">Mathematical Biosciences 90: 265-286 1988</a>
and as such received little attention by experimental biologists.
Instead a paper by Michael Akam
<a href="http://www.nature.com/nature/journal/v341/n6240/abs/341282a0.html">Making stripes inelegantly</a>
made the headlines. The paper which appeared as a News and Views article in
Nature claimed that patterns are just made by messy specific systems and that
reaction-diffusion systems were of little importance. This paper
had big negative impact on modellers being able to be listened to by
experimentalists for most of the 1990s.</p>
<p>However, things are getting much better now. Experimentalists are starting to
take reaction-diffusion systems seriously again and the interpretation of
the models are becoming more nuanced. Some experimentalists are even
being inspired by Turing to measure diffusion rates in order to be able
to create better models. Furthermore people are realising that
reaction-diffusion systems act in concert with other mechanisms such as
gene regulatory networks. The later point was something that Turing pointed
out in his original paper:</p>
<blockquote cite="http://rstb.royalsocietypublishing.org/content/237/641/37">
most of an organism, most of the time, is developing from one pattern into
another, rather than from homogeneity into a pattern
</blockquote>
<p>Dr Monk finished off by re-iterating the key message of his talk, a message
that was made by Oester in 1988: it is the type of instability that is
important. That is what will give you insight. Then you can try to understand
what the underlying players are which allows the mechanism give rise to the
right pattern.</p>
<p>The participants were then invited to work through a number of exercises
and play around with various programs illustrating various aspects of
Turing patterns.</p>
<p>After lunch
<a href="https://www.jic.ac.uk/directory/veronica-grieneisen/">Dr Veronica Grieneisen</a>,
gave a lecture on morphogen gradients and plant development.</p>
<p>The lecture started off by outlining the
<a href="https://en.wikipedia.org/wiki/French_flag_model">French Flag Theory</a>
created by
<a href="http://www.sciencedirect.com/science/article/pii/S0022519369800160">Wolpart in 1969</a>.
Noting in particular that the original paper was not originally concerned with
morphogen gradients per say, but with how morphogen gradients can be used to
tackle the issue of scaling.</p>
<p>Important considerations were then outlined in terms of:</p>
<ul>
<li>Spatial scales: what are the characteristic length and relevant tissue growth?</li>
<li>Temporal scales: what is the time required to spread a signal growth?</li>
<li>Robustness: how sensitive is the system’s behaviour to perturbations?</li>
</ul>
<p>The latter point of robustness is multi faceted. It can be parametric
robustness (dosage of genes, levels or rates of enzymes). As well as the
precision of gradients which needs to be considered in terms of having natural
variation among individuals vs. stochasticity within the individual growth</p>
<p>These topics were then considered in the context of a mesoscopic modelling
exercise of the auxin levels at the quiescent center of the root tip during
root growth. As it turned out the high concentration of auxin at the quiescent
centre were achieved by a reflux-driven maximum. Interestingly, this type of
maximum turned out to have isomorphic counterparts in volcanic micro currents
as well as in counter current transport in kidneys.</p>
<p>Dr Grieneisen lecture was followed by a keynote lecture by
<a href="http://www.msc.univ-paris-diderot.fr/spip.php?rubrique140&lang=en">Dr Yves Couder</a>
in which he outlined how the patterns formed by
<a href="https://en.wikipedia.org/wiki/Leaf#Veins">leaf venation</a> are similar
to the cracks observed in old porcelain, as well as dried mud. In all cases
the cracks join up with each other, as opposed to the patterns observed
during crystal growth which does not join up.</p>
<p>These cracks observed in old porcelain can be explained by growth in a tensor
field.
Dr Couder then went on to illustrate several beautiful examples where they manged
to replicate various leaf vein patterns by drying gels on glass plates (where
the static glass plate provided the stress for the tensor field).</p>
<p>The talk then went into more theory and experiments looking at the role for
mechanical stress in growth. Concluding that:</p>
<ul>
<li>Externally applied mechanical stress generates an orientation of the
microtubules (and cell divisions) along the direction of main stress</li>
<li>In a normal meristem the microtubules become oriented along the direction of
the mains stresses induced in the L1 layer by turgor pressure</li>
<li>The deposited microfibrils and the new cell walls strengthen the tissue
along the direction of largest stress</li>
</ul>
<p>After all the talks the participants of the course were invited back to
experiment with workshop material on morphogen gradients, diffusion and
permeability, source-decay models, directed transport and the reflux
model.</p>
Day 1: Multi-level modelling in morphogenesis2015-07-13T00:00:00+00:00http://tjelvarolsson.com/blog/day1-multi-level-modelling-in-morphogenesis<p>Today was the first day of a two week course on
<a href="https://www.jic.ac.uk/whats-on/events/2015/07/embo-practical-course-2015/">multi-level modelling in morphogenesis</a>.</p>
<p>During the introductory lecture, given by
<a href="https://www.jic.ac.uk/directory/veronica-grieneisen/">Dr Veronica Grieneisen</a>,
the goals of the course were outlined.</p>
<p>A goal of the course is to spread a broader understanding of
developmental biology and biological modelling. More specifically to:</p>
<ul>
<li>Define and understand generic principles guiding developmental biology</li>
<li>Learn how to identify and unravel processes</li>
<li>Understand and discuss at what level one should “model” a phenomena</li>
<li>Get exposed to different biological paradigm systems, as well as modelling formalisms</li>
<li>How to obtain models <strong>with predictive value</strong> and <strong>explanatory power</strong> that <strong>create isomorphisms</strong></li>
</ul>
<p>In this context isomorphisms are corresponding abstractions and conceptual
models that can be applied to different phenomenon.</p>
<p>At another level a goal of the course is to discuss practical aspects of
modelling. As scientists we need techniques to be able to express ourselves. As
such the course aims to open the black boxes that biologists often make use of.</p>
<p>At yet another level the course is about communication. In particular enabling
communication with a common language. The participants of the course are
intentionally a mix of experimental and computational biologists from many
different fields of biology. Bridging the gap between experimental and
computational biologists is a central theme of the course. As is learning how
to express oneself to people from different fields.</p>
<p>The section describing the goals was followed by a brief introduction
to partial differential equations starting from the conservation equation (the
differentiation form of the
<a href="https://en.wikipedia.org/wiki/Continuity_equation">continuity equation</a>),
leading into a discussion about
<a href="https://en.wikipedia.org/wiki/Flux">flux</a>, Fick’s first law and
<a href="https://en.wikipedia.org/wiki/Fick%27s_laws_of_diffusion">it’s relation to diffusion</a>.</p>
<p>The overview of partial differential equations was followed by a discussion on
<a href="https://en.wikipedia.org/wiki/Cellular_automaton">cellular automata</a>,
“to have an object or not to have an object”. Three examples were described:</p>
<ol>
<li><a href="http://demonstrations.wolfram.com/CellularAutomataWithMajorityRule/">Majority voting rule</a></li>
<li><a href="https://en.wikipedia.org/wiki/Conway%27s_Game_of_Life">Conway’s game of life</a></li>
<li><a href="http://cell-auto.com/neighbourhood/margolus/">Margolous diffusion/alternation</a></li>
</ol>
<p>The participants were then immersed in a hands on workshop exploring majority
voting and Conway’s game of life. Followed by an exploration of diffusion simulated by
partial differential equations and Margolous alternation. The latter was accomplished
by an exercise exploring
<a href="https://en.wikipedia.org/wiki/Diffusion-limited_aggregation">diffusion-limited aggregation</a>.
The purpose of these exercises was to make biologist more familiar with
thinking algorithmically and for everyone to think critically about the impact
of the choice of modelling technique. What can be seen as a feature in one
instance can be an artifact in another. It all depends on the phenomena that
one is trying to model.</p>
<p>Then it was time for lunch and socialising.</p>
<p>After lunch
<a href="https://www.jic.ac.uk/directory/stan-maree/">Dr Stan Maree</a>
introduced three seemingly different phenomenon: cellular slime mold chemotaxis,
Belousov-Zhabotinsky reaction (chemistry) and action potentials in neurophysiology.</p>
<p>The
<a href="https://en.wikipedia.org/wiki/Hodgkin–Huxley_model">Hodgkin-Huxley model</a>
was described in all its complexity. Followed by a statement that it can be
described as “unpleasantly complex” and a quote from FitzHugh that
“the usefulness of an equation to an experimental physiologist (…) depends
on his understanding of how it works”. The
<a href="https://en.wikipedia.org/wiki/FitzHugh–Nagumo_model">FitzHugh-Nagumo model</a>
was then briefly introduced. However, the details of it and the implications
of the model were not described as it was to be explored during the
afternoons practical session.</p>
<p>Instead the focus shifted to how one can gain an understanding of systems of
linear ordinary differential equations. Time plots were contrasted with
<a href="https://en.wikipedia.org/wiki/Phase_plane">phase plane plots</a>. And the importance
of visualising
<a href="https://en.wikipedia.org/wiki/Nullcline">nullclines</a> as lines of zero change
for a particular parameter was highlighted. In particular the fact that
one can identify all equilibria from the intersections of nullclines in
a phase plane plot.</p>
<p>Stability of equilibria was then discussed and simple rules for quickly analysing
the stability of equilibria were derived from the fact that:</p>
<ol>
<li>For an equilibria to be stable its eigenvalues need to be negative</li>
<li>Summing two eigenvalues results in the trace</li>
<li>Multiplying two eigenvalues results in the determinant</li>
</ol>
<p>So by plotting the trace vs the determinant we can get a plot illustrating
different types of equilibria,
<a href="https://en.wikipedia.org/wiki/Phase_plane#Eigenvectors_and_nodes">see also</a>.</p>
<p><img src="https://upload.wikimedia.org/wikipedia/commons/3/35/Phase_plane_nodes.svg" alt="trace-determinant plot" /></p>
<p>The
<a href="https://en.wikipedia.org/wiki/Jacobian_matrix_and_determinant">Jacobian matrix</a>
was then introduced and the idea that the Jacobian can be approximated by
plotting the nullclines on the phase plane plot and making small
perturbations around the equilibria.</p>
<p>The participant where then invited to explore the temporal dynamics of the FitzHugh-Nagumo
model using the software
<a href="http://www-binf.bio.uu.nl/rdb/grind.html">grind</a>. This was followed by exercises looking
at the spatio-temporal dynamics of the same model using partial differential equations.
This led on to looking at spirals formed when introducing introducing a temporary barrier
and it was highlighted that these spirals could never have been identified if
one did not take the spatial regime into account. Finally the link to the
other phenomenon outlined at the beginning of the afternoon session, slime mold
chemotaxis and
Belousov-Zhabotinsky reaction was pointed out and the isomorphic nature of these
phenomenon was highlighted.</p>
How to generate beautiful technical documentation2015-07-11T00:00:00+00:00http://tjelvarolsson.com/blog/how-to-generate-beautiful-technical-documentation<p>In the
<a href="/blog/five-tips-to-help-you-document-your-coding-project/">previous post</a>
I gave some motivational tips to inspire you to document your coding
project. In this post I will illustrate how you can convert documentation
written as plain text files into beautiful HTML documentation using a
tool called
<a href="http://sphinx-doc.org">Sphinx</a>.</p>
<h2 id="installing-sphinx">Installing Sphinx</h2>
<p>Sphinx is a documentation generation tool written in Python and it can be
installed using <code class="language-plaintext highlighter-rouge">pip</code>. If you do not yet have <code class="language-plaintext highlighter-rouge">pip</code> installed on your
system please have a look at the
<a href="https://pip.pypa.io/en/stable/installing.html">pip installation notes</a>.</p>
<p>Let us install Sphinx.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo pip install -U Sphinx
</code></pre></div></div>
<h2 id="generating-boilerplate-files-for-the-documentation">Generating boilerplate files for the documentation</h2>
<p>Suppose that we are at the early stages of our project. All we have is
a <code class="language-plaintext highlighter-rouge">README</code> file with the content below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>README
======
This project aims to inspire people to write more and better documentation.
</code></pre></div></div>
<p>However, we know that we want to store more extensive documentation in
a subdirectory named <code class="language-plaintext highlighter-rouge">docs</code>. Let us create that directory and add
some Sphinx boilerplate files to it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir docs
$ cd docs
$ sphinx-quickstart
</code></pre></div></div>
<p>The last command will prompt you for answers to a bunch of questions on how you
want to setup your documentation and what extensions you want to enable. I tend
to accept the defaults for everything except the question on whether or not I
want to separate the source and build directories.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> Separate source and build directories (y/n) [n]: y
</code></pre></div></div>
<p>The input fields for project name, author name(s) and project version require
you to provide some information. Below are the answers that I gave to these
questions in this instance.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> Project name: Better documentation
> Author name(s): Tjelvar Olsson
> Project version: 0.0.1
</code></pre></div></div>
<p>Let’s see what was generated.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tree
.
├── Makefile
├── build
├── make.bat
└── source
├── _static
├── _templates
├── conf.py
└── index.rst
4 directories, 4 files
</code></pre></div></div>
<p>Let us go through the files one by one. The <code class="language-plaintext highlighter-rouge">Makefile</code> allows us to build the
documentation using <code class="language-plaintext highlighter-rouge">make</code>. The <code class="language-plaintext highlighter-rouge">make.bat</code> file allows us to build the documentation
on Windows based systems. The <code class="language-plaintext highlighter-rouge">source/conf.py</code> file contains configurations for building
the documentation (we will edit this later). The <code class="language-plaintext highlighter-rouge">index.rst</code> file is the root file of
the documentation we are about to write.</p>
<h2 id="lets-build-some-documentation">Let’s build some documentation</h2>
<p>Before we do anything else let us see what we get when we build the documentation.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ make html
</code></pre></div></div>
<p>This will create output in the directory <code class="language-plaintext highlighter-rouge">build/html</code>, open the <code class="language-plaintext highlighter-rouge">build/html/index.html</code> file
in your browser of choice. You should see something along the lines of the below.</p>
<p><img src="/images/sphinx_default_look.jpg" alt="Sphinx default look" /></p>
<p>Now have a look at the content of the <code class="language-plaintext highlighter-rouge">source/index.rst</code> file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.. Better documentation documentation master file, created by
sphinx-quickstart on Mon Jun 29 11:00:21 2015.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Welcome to Better documentation's documentation!
================================================
Contents:
.. toctree::
:maxdepth: 2
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`
</code></pre></div></div>
<p>The first section is a comment (the section starting with <code class="language-plaintext highlighter-rouge">..</code>). This is
followed by a header (denoted by the <code class="language-plaintext highlighter-rouge">=</code> underline). The <code class="language-plaintext highlighter-rouge">.. toctree::</code>
section is Sphinx’s way of denoting that a list of other files should be included
(at the moment we have none). Finally, in the <code class="language-plaintext highlighter-rouge">Indices and tables</code> section
there are links to index, module and search pages. If you are documenting a
Python package the module page will contain links to the modules in your
package.</p>
<h2 id="adding-some-more-content">Adding some more content</h2>
<p>Let us add some more content. Create the file <code class="language-plaintext highlighter-rouge">source/intro.rst</code> and copy and
paste the text below into it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Introduction
============
The purpose of this project is to help scientists write better documentation.
</code></pre></div></div>
<p>Now add a link to it in <code class="language-plaintext highlighter-rouge">source/index.rst</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Contents:
.. toctree::
:maxdepth: 2
intro
</code></pre></div></div>
<p>Note that the reference of the file to be included does not need the <code class="language-plaintext highlighter-rouge">.rst</code> extension.
Furthermore it needs to be indented to the same level as <code class="language-plaintext highlighter-rouge">:maxdepth:</code> (by
default this is three spaces). The latter has caught me out many times as I tend to
indent four spaces.</p>
<p>If you rebuild the documentation using <code class="language-plaintext highlighter-rouge">make html</code> you will see the content
of the <code class="language-plaintext highlighter-rouge">source/intro.rst</code> file included in the documentation.</p>
<h2 id="restructuredtext-markup">reStructuredText markup</h2>
<p>You may have noticed that we can create headers by underlining them with
special characters. Sphinx uses reStructuredText as a markup language.
For a quick introduction to the reStructuredText syntax have a look at
<a href="http://docutils.sourceforge.net/docs/user/rst/quickstart.html">A ReStructuredText Primer</a>
followed by
<a href="http://docutils.sourceforge.net/docs/user/rst/quickref.html">Quick reStructuredText</a>.
Another good source is Sphinx’s
<a href="http://sphinx-doc.org/rest.html">ReStructuredText Primer</a>.</p>
<h2 id="including-code-snippets-in-the-documentation">Including code snippets in the documentation</h2>
<p>Sphinx has taken advantage of the fact that reStructuredText is extensible and
has added directives of its own. We have already seen one of these: the <code class="language-plaintext highlighter-rouge">toctree</code>
directive.</p>
<p>Let us have a look at the <code class="language-plaintext highlighter-rouge">code-block</code> directive, which can be used to include
code snippets. Create the file <code class="language-plaintext highlighter-rouge">source/code_example.rst</code> and add the text
below to it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Code example
============
Here is a Python function.
.. code-block:: python
def greet(name):
print("Hello {}".format(name))
Here is a C function.
.. code-block:: C
int add(int a, int b) {
return a + b;
}
</code></pre></div></div>
<p>Remember to include the file into the <code class="language-plaintext highlighter-rouge">toctree</code> of <code class="language-plaintext highlighter-rouge">source/index.rst</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.. toctree::
:maxdepth: 2
intro
code_example
</code></pre></div></div>
<p>Use <code class="language-plaintext highlighter-rouge">make html</code> to build the documentation and behold the beautifully
generated code snippets included in your documentation.</p>
<p>It is also possible to include whole files of source code in your
documentation. Copy and paste the text below into a file named
<code class="language-plaintext highlighter-rouge">source/example_script.py</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""This is an example script."""</span>
<span class="kn">import</span> <span class="nn">sys</span>
<span class="k">def</span> <span class="nf">greet</span><span class="p">(</span><span class="n">name</span><span class="p">):</span>
<span class="s">"""Return greeting."""</span>
<span class="k">return</span> <span class="s">"Hello {}!"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">name</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">name</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="k">print</span><span class="p">(</span><span class="n">greet</span><span class="p">(</span><span class="n">name</span><span class="p">))</span>
</code></pre></div></div>
<p>Now we will use Sphinx’s <code class="language-plaintext highlighter-rouge">include</code> directive to include the content of this script
into the “Code example” page. Add the lines below to the end of the
<code class="language-plaintext highlighter-rouge">source/code_example.rst</code> file.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Below is the content of a Python sample script.
.. literalinclude:: example_script.py
:language: python
</code></pre></div></div>
<p>Sphinx also has several options for styling the display of your code snippets.
For example you can add line numbers and emphasize particular lines. For more
inspiration on how to include code snippets in your documentation have a look
at
<a href="http://sphinx-doc.org/markup/code.html#includes">Showing code examples</a>
in the Sphinx documentation.</p>
<h2 id="generating-api-documentaiton-for-python-projects">Generating API documentaiton for Python projects</h2>
<p>Sphinx has got particularly good support for documenting Python projects.</p>
<p>Let us create a module named <code class="language-plaintext highlighter-rouge">chemistry</code> for us to document at the root level
of the project .</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cd ../
$ mkdir chemistry
$ ls
README
docs
chemistry
</code></pre></div></div>
<p>Create the file <code class="language-plaintext highlighter-rouge">chemistry/__init__.py</code> and add the code below to it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""Basic chemistry module.
The :mod:`chemistry` module contains three classes:
- :class:`chemistry.Atom`
- :class:`chemistry.Bond`
- :class:`chemistry.Molecule`
One can use the :func:`chemistry.Molecule.add_atom` and
:func:`chemsitry.Molecule.add_bond` functions to build up a molecule.
Example illustrating how to create a methane molecule.
>>> from chemistry import Molecule
>>> mol = Molecule('Methane')
>>> carbon_index = mol.add_atom(atomic_number=6)
>>> hydrogen1_index = mol.add_atom(atomic_number=1)
>>> hydrogen2_index = mol.add_atom(atomic_number=1)
>>> hydrogen3_index = mol.add_atom(atomic_number=1)
>>> hydrogen4_index = mol.add_atom(atomic_number=1)
>>> bond1_index = mol.add_bond(carbon_index, hydrogen1_index)
>>> bond2_index = mol.add_bond(carbon_index, hydrogen2_index)
>>> bond3_index = mol.add_bond(carbon_index, hydrogen3_index)
>>> bond4_index = mol.add_bond(carbon_index, hydrogen4_index)
"""</span>
<span class="k">class</span> <span class="nc">Atom</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class representing an atom."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">atomic_number</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">atomic_number</span> <span class="o">=</span> <span class="n">atomic_number</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bonds</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">def</span> <span class="nf">bond_to</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other_atom</span><span class="p">):</span>
<span class="s">"""Return the :class:`chemistry.Bond` formed between the two atoms.
:param other_atom: :class:`chemistry.Atom` to form :class:`chemistry.Bond` to
:returns: :class:`chemistry.Bond`
"""</span>
<span class="n">bond</span> <span class="o">=</span> <span class="n">Bond</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">other_atom</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bonds</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">bond</span><span class="p">)</span>
<span class="n">other_atom</span><span class="o">.</span><span class="n">bonds</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">bond</span><span class="p">)</span>
<span class="k">return</span> <span class="n">bond</span>
<span class="k">class</span> <span class="nc">Bond</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class representing a bond between two atoms."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">atom1</span><span class="p">,</span> <span class="n">atom2</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">atoms</span> <span class="o">=</span> <span class="p">(</span><span class="n">atom1</span><span class="p">,</span> <span class="n">atom2</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">Molecule</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class representing a molecule consisting of atoms and bonds."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">identifier</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">identifier</span> <span class="o">=</span> <span class="n">identifier</span>
<span class="bp">self</span><span class="o">.</span><span class="n">atoms</span> <span class="o">=</span> <span class="p">[]</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bonds</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">def</span> <span class="nf">add_atom</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">atomic_number</span><span class="p">):</span>
<span class="s">"""Return the list index of the atom added to the molecule.
:param atomic_number: atomic number of the atom to be added
:returns: index of the atom in the molecule
"""</span>
<span class="n">atom</span> <span class="o">=</span> <span class="n">Atom</span><span class="p">(</span><span class="n">atomic_number</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">atoms</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">atom</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">atoms</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span>
<span class="k">def</span> <span class="nf">add_bond</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">atom1_index</span><span class="p">,</span> <span class="n">atom2_index</span><span class="p">):</span>
<span class="s">"""Return the list index of the bond added to the molecule.
:param atom1_index: atom's index in molecule
:param atom2_index: atom's index in molecule
:returns: index of the bond in the molecule
"""</span>
<span class="n">atom1</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">atoms</span><span class="p">[</span><span class="n">atom1_index</span><span class="p">]</span>
<span class="n">atom2</span> <span class="o">=</span> <span class="bp">self</span><span class="o">.</span><span class="n">atoms</span><span class="p">[</span><span class="n">atom2_index</span><span class="p">]</span>
<span class="n">bond</span> <span class="o">=</span> <span class="n">atom1</span><span class="o">.</span><span class="n">bond_to</span><span class="p">(</span><span class="n">atom2</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">bonds</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">bond</span><span class="p">)</span>
<span class="k">return</span> <span class="nb">len</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">bonds</span><span class="p">)</span> <span class="o">-</span> <span class="mi">1</span>
</code></pre></div></div>
<p>We will now use Sphinx’s <code class="language-plaintext highlighter-rouge">autodoc</code> functionality to generate API
documentation for this module. First of all we need to add the
<code class="language-plaintext highlighter-rouge">sphinx.ext.autodoc</code> extension to the <code class="language-plaintext highlighter-rouge">docs/source/conf.py</code> file.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
</span><span class="n">extensions</span> <span class="o">=</span> <span class="p">[</span><span class="s">'sphinx.ext.autodoc'</span><span class="p">]</span>
</code></pre></div></div>
<p>In the same file we also need to specify the path to the module that we want to
generate documentation for.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
</span><span class="n">sys</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">insert</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">abspath</span><span class="p">(</span><span class="s">'../../'</span><span class="p">))</span>
</code></pre></div></div>
<p>Now create the file <code class="language-plaintext highlighter-rouge">docs/source/api.rst</code> and copy and paste the text below
into it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>API documentaiton
=================
.. automodule:: chemistry
:members:
</code></pre></div></div>
<p>We also need to remember to include the <code class="language-plaintext highlighter-rouge">api.rst</code> file in the <code class="language-plaintext highlighter-rouge">toctree</code>.
Edit the <code class="language-plaintext highlighter-rouge">docs/source/index.rst</code> file to match the below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.. toctree::
:maxdepth: 2
intro
code_example
api
</code></pre></div></div>
<p>Finally, regenerate the documentation by running <code class="language-plaintext highlighter-rouge">make html</code> in the <code class="language-plaintext highlighter-rouge">docs</code>
directory and behold the beautifully generated API documentation.</p>
<p>If you interact with the generated HTML documentation you will note that the
constructs following the pattern below have been converted into hyperlinks.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:mod:`chemistry`
:class:`chemistry.Molecule`
:func:`chemistry.Molecule.add_atom`
</code></pre></div></div>
<p>These directives can be used anywhere in your documentation to link to the
relevant section in the API documentation. Having descriptive documentation
that contains links to the more technical API documentation is very
pleasant and these directives make it very easy to do so.</p>
<p>It is also worth commenting on the <code class="language-plaintext highlighter-rouge">:param:</code> and <code class="language-plaintext highlighter-rouge">:returns:</code> directives
used in the docstrings. These are part of a larger set of description directives
that are formatted nicely by Sphinx. For more information have a look at the
<a href="http://sphinx-doc.org/domains.html#info-field-lists">info field list section</a>
in the Sphinx documentation.</p>
<h2 id="what-about-the-original-readme-file">What about the original README file?</h2>
<p>Let us finish off by including the content of the original <code class="language-plaintext highlighter-rouge">README</code> file
into the generated HTML documenation.</p>
<p>Create the file <code class="language-plaintext highlighter-rouge">docs/source/README.rst</code> and copy and paste the text below into it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.. include:: ../../README
</code></pre></div></div>
<p>This will include the content of the top level <code class="language-plaintext highlighter-rouge">README</code> file into the documentation.</p>
<h2 id="styling-the-documentaiton">Styling the documentaiton</h2>
<p>The default theme of Sphinx is currently
<a href="https://github.com/bitprophet/alabaster">Alabaster</a>. It is very beautiful. However,
personally I prefer the
<a href="https://github.com/snide/sphinx_rtd_theme"> Sphinx ReadTheDocs theme</a>. In
particular because of its left hand side navigation bar. Let’s check it out.</p>
<p>First of all we install the theme using <code class="language-plaintext highlighter-rouge">pip</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pip install sphinx_rtd_theme
</code></pre></div></div>
<p>Now we need to edit theme section in <code class="language-plaintext highlighter-rouge">docs/source/conf.py</code> to look like the
below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
#html_theme = 'alabaster'
</span>
<span class="c1"># on_rtd is whether we are on readthedocs.org
</span><span class="n">on_rtd</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">environ</span><span class="o">.</span><span class="n">get</span><span class="p">(</span><span class="s">'READTHEDOCS'</span><span class="p">,</span> <span class="bp">None</span><span class="p">)</span> <span class="o">==</span> <span class="s">'True'</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">on_rtd</span><span class="p">:</span> <span class="c1"># only import and set the theme if we're building docs locally
</span> <span class="kn">import</span> <span class="nn">sphinx_rtd_theme</span>
<span class="n">html_theme</span> <span class="o">=</span> <span class="s">'sphinx_rtd_theme'</span>
<span class="n">html_theme_path</span> <span class="o">=</span> <span class="p">[</span><span class="n">sphinx_rtd_theme</span><span class="o">.</span><span class="n">get_html_theme_path</span><span class="p">()]</span>
<span class="c1"># otherwise, readthedocs.org uses their theme by default, so no need to specify it
</span></code></pre></div></div>
<p>Note that the code above includes some logic for handling the cases where one
hosts the documentation on <a href="https://readthedocs.org">readthedocs</a>.</p>
<p>Regenerate the documentation by running <code class="language-plaintext highlighter-rouge">make html</code> in the <code class="language-plaintext highlighter-rouge">docs</code> directory
and explore the look and feel of this new theme. Note in particular the
behaviour of the left hand side navigation bar and the clear “next” and
“previous” buttons at the bottom of each page.</p>
<p><img src="/images/rtd_theme.jpg" alt="Sphinx rdt theme" /></p>
<p>For more information on the ReadTheDocs theme have a look
<a href="https://read-the-docs.readthedocs.org/en/latest/theme.html">here</a>.</p>
<h2 id="readthedocs">ReadTheDocs</h2>
<p>Whilst on the subject it is worth mentioning the ability to host your documentation
on <a href="https://readthedocs.org">readthedocs</a>. Simply sign-up for an account, link your
GitHub/BitBucket account and then you can select the projects that you want to host
on <a href="https://readthedocs.org">readthedocs</a>. It is great!</p>
<p>It is worth noting that if your project documentation includes links to packages
such as <code class="language-plaintext highlighter-rouge">numpy</code> and <code class="language-plaintext highlighter-rouge">scipy</code> you will need to mock these out in the
<code class="language-plaintext highlighter-rouge">conf.py</code> file. For more information have a look at this
<a href="https://read-the-docs.readthedocs.org/en/latest/faq.html#i-get-import-errors-on-libraries-that-depend-on-c-modules">readthedocs faq</a>.
For a real life example have a look at
<a href="https://github.com/JIC-CSB/jicimagelib/blob/master/docs/source/conf.py">this conf.py</a>.</p>
<h2 id="further-reading">Further reading</h2>
<p>I hope this post has inspired you to try out Sphinx. It is a wonderful tool for
generating beautiful documentation!</p>
<p>Below are a couple links to other resources on how to use Sphinx.</p>
<ul>
<li><a href="http://sphinx-doc.org/index.html">Official Sphinx documentaiton</a></li>
<li><a href="https://pythonhosted.org/an_example_pypi_project/sphinx.html">Andrew Carter’s: Documenting Your Project Using Sphinx</a></li>
</ul>
Five tips to help you document your coding project2015-06-28T00:00:00+00:00http://tjelvarolsson.com/blog/five-tips-to-help-you-document-your-coding-project<p><em>Do you want other people to make use of the project that you are working on?</em></p>
<p>If you answered <strong>yes</strong> to the question above you need to write some form of
documentation outlining how to make use of it.</p>
<p><em>Do you enjoy writing documentation?</em></p>
<p>If you answered <strong>no</strong> to the question above please read on to find some tips
to make the experience more enjoyable.</p>
<h2 id="tip-1-start-early-and-start-small">Tip 1: Start early and start small</h2>
<p>A common scenario is to treat documentation as an afterthought. However, if a
project is nearing completion and one has no documentation, the thought of
writing it can be daunting. As a result one never starts working on the BIG
documentation task, but rather spends time on smaller and more satisfying tasks
such as adding nice-to-have features.</p>
<p>The solution is to start early and to start small. Before you write any code
create a <code class="language-plaintext highlighter-rouge">README</code> file and include a sentence stating what problem the project
solves.</p>
<p>Once you have some code add some basic instructions on how to run it to the
<code class="language-plaintext highlighter-rouge">README</code> file.</p>
<h2 id="tip-2-include-documentation-in-your-definition-of-done">Tip 2: Include documentation in your definition of done</h2>
<p>Suppose that you have implemented a new feature. You are proud of it. You have
even written tests for it! Don’t stop there! Complete the task by writing some
descriptive documentation outlining how to make use of the feature.
Furthermore, if you have release notes add a bullet point with a link to the
section that you have just written.</p>
<h2 id="tip-3-reap-the-benefits-of-explaining-your-code-to-someone-else">Tip 3: Reap the benefits of explaining your code to someone else</h2>
<p>Writing documentation is an act of trying to explain something to someone else.
What often happens when one tries to explain a solution to someone else is that
one finds the solution lacking or sub optimal. I often find that the
act of documenting a feature results in me realising that the feature is not
actually fit for purpose in its current state - giving me the opportunity to fix
it before it is released.</p>
<p>Discovering improvements by writing documentation is similar to
<a href="https://en.wikipedia.org/wiki/Rubber_duck_debugging">rubber duck debugging</a>,
where one tries to discover the source of a bug by explaining code line by line
to a rubber duck.</p>
<h2 id="tip-4-store-your-documentation-alongside-your-code-in-version-control-as-plain-text-files">Tip 4: Store your documentation alongside your code in version control as plain text files</h2>
<p>Documentation should be stored alongside your code in version control as plain
text files.</p>
<p>Storing your code and documentation in the same repository allows them to be kept
in line with each other.</p>
<p>The benefits of plain text files are outlined in
<a href="https://pragprog.com/book/tpp/the-pragmatic-programmer">The Pragmatic Programmer</a>.
In fact the book has an entire chapter devoted to it.</p>
<p><em>What is so special about plain text files?</em></p>
<p>In short: they are portable, easy to use and there is no lock-in. For a more
extensive answer have a look at CM Smith’s Lifehack post
<a href="http://www.lifehack.org/articles/technology/why-geeks-love-plain-text-and-why-you-should-too.html">Why Geeks Love Plain Text (And Why You Should Too)</a>.</p>
<h2 id="tip-5-make-use-of-tools-that-can-convert-your-plain-text-files-to-beautifully-formatted-documents">Tip 5: Make use of tools that can convert your plain text files to beautifully formatted documents</h2>
<p>Although text files have many advantages they are not ideal for consuming
(reading) documentation. When reading documentation you want it to be pleasant
on the eye and easy to navigate.</p>
<p>I highly recommend using
<a href="http://sphinx-doc.org">Sphinx</a>
it is a great tool for writing technical documentation. It can produce a range
of output formats including HTML and PDF. It has great support for
cross-referencing and the HTML output has built-in support for searching.
Furthermore, if you use Sphinx you can host your documentation on
<a href="https://readthedocs.org">Read the Docs</a>. I will explain how to use Sphinx
in my
<a href="/blog/how-to-generate-beautiful-technical-documentation/">next post</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Documentation is a sign that someone cares about a project. This makes it
easier for other people to care about it too.</p>
<p>I hope that you found this post useful. If nothing else I hope that it has
given you the motivation to add a <code class="language-plaintext highlighter-rouge">README</code> file to your current project with
a line explaining what problem the project solves.</p>
Test-driven development for scientists2015-06-13T00:00:00+00:00http://tjelvarolsson.com/blog/test-driven-develpment-for-scientists<figure>
<img src="/images/tdd_cycle.jpg" alt="Test-driven development cycle." />
<figcaption>
The test-driven development cycle.
</figcaption>
</figure>
<h2 id="introduction">Introduction</h2>
<p>In
<a href="/blog/three-essential-tips-for-improving-your-scientific-code/">Three essential tips for improving your scientific code</a>
I talked about the importance of writing tests for your scientific code base.
Tests provide a means to verify that new code does what it is intended to do
and a means to alert you if you inadvertently break an existing piece of
functionality when modifying the code base.</p>
<p>Furthermore, if you have a well tested code base you feel less scared of making
changes to it. Whilst coding have you ever thought to yourself:</p>
<blockquote>
I could really do with re-writing this to make it simpler, but I'm not sure what
else I would break....
</blockquote>
<p>If your code base had better test coverage you would not feel this way.
Having tests give you the ability to make sweeping changes to the code
whilst retaining confidence that you have not broken any vital piece of
functionality.</p>
<p>Tests also provide a type of living documentation of your code base, a
specification of how the code is intended to work.</p>
<p>In fact tests are so important that some people write them before they write
any code in a method known as test-driven development.</p>
<p>In this post we will make use of the skills we built up in
<a href="/blog/four-tools-for-testing-your-python-code/">Four tools for testing your Python code</a>
to explore test-driven development. We will use test-driven development to
create a Python FASTA parser package.</p>
<h2 id="what-is-test-driven-development">What is test-driven development?</h2>
<p>Test-driven development, often abbreviated as TDD, can be thought of as a three
step process.</p>
<ol>
<li>Write a test for the functionality that you have in mind and watch it fail</li>
<li>Write minimal code to make the test pass</li>
<li>Refactor the code if required</li>
</ol>
<p>Don’t worry if the above sounds a bit abstract. The purpose of the rest of this
post is to illustrate how this works in practise.</p>
<h2 id="what-are-the-benefits-of-test-driven-development">What are the benefits of test-driven development?</h2>
<p>The three main reasons I love test-driven development are:</p>
<ul>
<li>It makes me think about how I want my code to behave up front</li>
<li>It makes me write tests</li>
<li>It is fun</li>
</ul>
<p>Of course I could write tests after having implemented a piece of code.
However, in practise when I code first and test later, the “test later” rarely
happens.</p>
<p>This may sound silly, but it is not much fun writing a test for something that
already works. It feels like a menial task. On the other hand, writing a test
before an implementation exists stimulates my brain, I have to think about how
I want my code to behave.</p>
<p>Furthermore, a failing test is like a challenge. In writing a failing test I am
giving myself a tiny puzzle to solve. The test-driven development cycle
essentially gamifies my working day, with the positive side-effect of producing
an extensive test suite.</p>
<p>For a more exhaustive list of benefits of test-driven development have a look at
Mark Levison’s post:
<a href="http://agilepainrelief.com/notesfromatooluser/2008/10/advantages-of-tdd.html">Advantages of TDD</a>.</p>
<p>If you are interested in this topic I also recommend reading Kane Mar’s three part post:
<a href="http://scrumology.com/the-benefits-of-tdd-are-neither-clear-nor-are-they-immediately-apparent/">The benefits of TDD are neither clear nor are they immediately apparent</a>.</p>
<h2 id="spiking">Spiking</h2>
<p>It is not wrong to develop code without tests. However, if you are doing
test-driven development you should treat such exploratory code as “throw away”
and use it as a guide to write tests when doing things properly. In this
context “properly” means writing the tests first. People who practise
test-driven development refer to such exploratory coding as a
<a href="http://stackoverflow.com/questions/249969/why-are-tdd-spikes-called-spikes">spike</a>.
Here we will treat the exploration from the
<a href="2015-03-22-object-oriented-programming-for-scientists">prevoius FASTA post</a> as
a spike.</p>
<h2 id="creating-a-project-template">Creating a project template</h2>
<p>We will start by creating a project template using
<a href="/blog/using-cookiecutter-a-passive-code-generator/">cookiecutter</a></p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
...
repo_name (default is "mypackage")? tinyfasta
version (default is "0.0.1")?
authors (default is "Tjelvar Olsson")?
...
$ cd tinyfasta
</code></pre></div></div>
<p>And setting up a
<a href="/blog/begginers-guide-creating-clean-python-development-environments/">clean Python development environment</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ virtualenv ~/virtualenvs/tinyfasta
$ source ~/virtualenvs/tinyfasta/bin/activate
(tinyfasta)$ python setup.py develop
</code></pre></div></div>
<p>Note that you can view this project and its progression on
<a href="https://github.com/tjelvar-olsson/tinyfasta">GitHub</a>.</p>
<h2 id="start-with-a-functional-test">Start with a functional test</h2>
<p>When practising test-driven development it is often useful to start with a
functional test. A functional test differs from a unit test in that it tests
a slice of functionality in the system as opposed to an individual unit.
The rational for starting with a functional test is that it allows us to take a
step back and think about the larger picture.</p>
<p>We can translate the learning from our spike into a functional test. The code
below parses FASTA records from the <code class="language-plaintext highlighter-rouge">dummy.fasta</code> file and writes the records
to another file <code class="language-plaintext highlighter-rouge">tmp.fasta</code>. The test then ensures that the contents of the
two files are identical.</p>
<p><a href="https://github.com/tjelvar-olsson/tinyfasta/blob/ebcc524fed6f596cc749e9bbb439ab47f4398aeb/tests/tests.py">ebcc524 <code class="language-plaintext highlighter-rouge">tests/tests.py</code></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">test_output_is_consistent_with_input</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="kn">from</span> <span class="nn">tinyfasta</span> <span class="kn">import</span> <span class="n">FastaParser</span>
<span class="n">input_fasta</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">DATA_DIR</span><span class="p">,</span> <span class="s">"dummy.fasta"</span><span class="p">)</span>
<span class="n">output_fasta</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">TMP_DIR</span><span class="p">,</span> <span class="s">"tmp.fasta"</span><span class="p">)</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_fasta</span><span class="p">,</span> <span class="s">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="k">for</span> <span class="n">fasta_record</span> <span class="ow">in</span> <span class="n">FastaParser</span><span class="p">(</span><span class="n">input_fasta</span><span class="p">):</span>
<span class="n">fh</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">"{}</span><span class="se">\n</span><span class="s">"</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">fasta_record</span><span class="p">))</span>
<span class="n">input_data</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_fasta</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="n">output_data</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_fasta</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="n">input_data</span><span class="p">,</span> <span class="n">output_data</span><span class="p">)</span>
</code></pre></div></div>
<p>Here is a link to the input FASTA file
<a href="https://github.com/tjelvar-olsson/tinyfasta/blob/ebcc524fed6f596cc749e9bbb439ab47f4398aeb/tests/data/dummy.fasta">tests/data/dummy.fasta</a>.</p>
<h2 id="start-building-up-functionality-using-unit-tests">Start building up functionality using unit tests</h2>
<p>Another reason for starting with a functional test is that it can act as a
guide for what to implement. When we run the functional test we immediately
find out that we need a class named <code class="language-plaintext highlighter-rouge">FastaParser</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "/Users/olssont/junk/tinyfasta/tests/tests.py", line 31, in test_output_is_consistent_with_input
from tinyfasta import FastaParser
ImportError: cannot import name FastaParser
</code></pre></div></div>
<p>At this point we add a unit test for initialising a <code class="language-plaintext highlighter-rouge">FastaParser</code> instance.</p>
<p><a href="https://github.com/tjelvar-olsson/tinyfasta/blob/a6d2253090fc56b7bd78b146ccf13dc00374fd03/tests/tests.py">a6d2253 <code class="language-plaintext highlighter-rouge">tests/tests.py</code></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">test_FastaParser_initialisation</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="kn">from</span> <span class="nn">tinyfasta</span> <span class="kn">import</span> <span class="n">FastaParser</span>
<span class="n">fasta_parser</span> <span class="o">=</span> <span class="n">FastaParser</span><span class="p">(</span><span class="s">'test.fasta'</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="n">fasta_parser</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'test.fasta'</span><span class="p">)</span>
</code></pre></div></div>
<p>After having run the test and watched it fail we add minimal code to make the
unit test pass.</p>
<p><a href="https://github.com/tjelvar-olsson/tinyfasta/blob/a6d2253090fc56b7bd78b146ccf13dc00374fd03/tinyfasta/__init__.py">a6d2253 <code class="language-plaintext highlighter-rouge">tinyfasta/__init__.py</code></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">FastaParser</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class for parsing FASTA files."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">fpath</span><span class="p">):</span>
<span class="s">"""Initialise an instance of the FastaParser."""</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">fpath</span>
</code></pre></div></div>
<p>The implementation makes the unit test pass. So we continue by running the
functional test again.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "/Users/olssont/junk/tinyfasta/tests/tests.py", line 40, in test_output_is_consistent_with_input
for fasta_record in FastaParser(input_fasta):
TypeError: 'FastaParser' object is not iterable
</code></pre></div></div>
<p>Okay, so we need a test to make sure that the class is iterable.</p>
<p><a href="https://github.com/tjelvar-olsson/tinyfasta/blob/abfdeeea35fe5c5143a1a4831a1ce4d7523b3515/tests/tests.py">abfdeee <code class="language-plaintext highlighter-rouge">tests/test.py</code></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">test_FastaParser_is_iterable</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="kn">from</span> <span class="nn">tinyfasta</span> <span class="kn">import</span> <span class="n">FastaParser</span>
<span class="n">fasta_parser</span> <span class="o">=</span> <span class="n">FastaParser</span><span class="p">(</span><span class="s">'test.fasta'</span><span class="p">)</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertTrue</span><span class="p">(</span><span class="nb">hasattr</span><span class="p">(</span><span class="n">fasta_parser</span><span class="p">,</span> <span class="s">'__iter__'</span><span class="p">))</span>
</code></pre></div></div>
<p>At this point it may be worth reflecting on how we should make this test pass.
In test-driven development we want to add minimal implementation to get the
tests to pass. The code below is pretty minimal and it makes the test pass.</p>
<p><a href="https://github.com/tjelvar-olsson/tinyfasta/blob/abfdeeea35fe5c5143a1a4831a1ce4d7523b3515/tinyfasta/__init__.py">abfdeee <code class="language-plaintext highlighter-rouge">tinyfasta/__init__.py</code></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Yield FastaRecord instances."""</span>
<span class="k">yield</span> <span class="bp">None</span>
</code></pre></div></div>
<p>As the docstring above suggests we want the <code class="language-plaintext highlighter-rouge">FastaParser</code> to yield
<code class="language-plaintext highlighter-rouge">FastaRecord</code> instances. So at this point we can start building up the
<code class="language-plaintext highlighter-rouge">FastaRecord</code> class using small incremental steps of test and code. To get a
feel for this have a look at the commits:</p>
<ul>
<li><a href="https://github.com/tjelvar-olsson/tinyfasta/commit/cd34f8b862974abbe1fad096a89ffc34b537c22b">cd34f8b Added FastaRecord class.</a></li>
<li><a href="https://github.com/tjelvar-olsson/tinyfasta/commit/ef95643fa53e60569a6f48b3275d57b56b5400a0">ef95643 Added sequence logic to FastaRecord.</a></li>
<li><a href="https://github.com/tjelvar-olsson/tinyfasta/commit/873798b7c98606285ea809f9ea87f849bffd59e4">873798b Added string representation of FastaRecord class.</a></li>
</ul>
<p>At this point we have all the functionality we need to add a proper
implementation of the <code class="language-plaintext highlighter-rouge">FastaParser.__iter__()</code> method, which we hope will
make the functional test pass.</p>
<p><a href="https://github.com/tjelvar-olsson/tinyfasta/blob/75e32726b83a674b00bc2eee70d8f7ea3f9906c4/tinyfasta/__init__.py">75e3272 <code class="language-plaintext highlighter-rouge">tinyfasta/__init__.py</code></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Yield FastaRecord instances."""</span>
<span class="n">fasta_record</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">fh</span><span class="p">:</span>
<span class="k">if</span> <span class="n">line</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'>'</span><span class="p">):</span>
<span class="k">if</span> <span class="n">fasta_record</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">fasta_record</span>
<span class="n">fasta_record</span> <span class="o">=</span> <span class="n">FastaRecord</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">fasta_record</span><span class="o">.</span><span class="n">add_sequence_line</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="k">yield</span> <span class="n">fasta_record</span>
</code></pre></div></div>
<p>Let us make sure that all the tests pass.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nosetests
........
Name Stmts Miss Cover Missing
-----------------------------------------
tinyfasta 26 0 100%
----------------------------------------------------------------------
Ran 8 tests in 0.027s
OK
</code></pre></div></div>
<p>Great we have a basic working implementation of our <code class="language-plaintext highlighter-rouge">tinyfasta.py</code> module.</p>
<h2 id="and-iterate">And iterate</h2>
<p>Now that we have the basics implemented we want to add more functionality and
by now you know what that means: another test. As we are wanting to add new
functionality we start all over again with another functional test.</p>
<p>In the commit history of the <code class="language-plaintext highlighter-rouge">tinyfasta</code> project one can see how
functionality for searching the FASTA description line was added.</p>
<ul>
<li><a href="https://github.com/tjelvar-olsson/tinyfasta/commit/fb685cb82aa947c8f1249120f093ecfb27cf3c50">fb685cb Added functional test for FastaRecord.description_matches search.</a></li>
<li><a href="https://github.com/tjelvar-olsson/tinyfasta/commit/cd78c020da21a0a07f06313eca891bac5418e6e6">cd78c02 Added empty description_matches function.</a></li>
<li><a href="https://github.com/tjelvar-olsson/tinyfasta/commit/1b89336c960bc46e88593599bd24314c12801d1b">1b89336 Added description_matches implementation.</a></li>
</ul>
<p>Followed by functionality for searching the biological sequence.</p>
<ul>
<li><a href="https://github.com/tjelvar-olsson/tinyfasta/commit/5fe617d0016af4ef5c6ecbca8f8670abe57b6f54">5fe617d Added functional test for FastaRecord.sequence_matches function.</a></li>
<li><a href="https://github.com/tjelvar-olsson/tinyfasta/commit/e97e623f5d4749b95d794c18e192eca366c4636b">e97e623 Added empty sequence_matches function.</a></li>
<li><a href="https://github.com/tjelvar-olsson/tinyfasta/commit/4f27704d0ca7e85948ae901de5d9d0db37a2d871">4f27704 Added sequence_matches implementation.</a></li>
</ul>
<h2 id="refactoring">Refactoring</h2>
<p>Up until this point we have followed the work flow below</p>
<ol>
<li>Write a test</li>
<li>Write minimal code to make the test pass</li>
</ol>
<p>However, this is not the whole story as it leaves out an important aspect of
test-driven development: refactoring.</p>
<p>Let us start with a simple example of factoring out code duplication. After
having added functionality for using either strings or compiled regular
expressions to search the description and sequence we notice that there is a
lot of code duplication.</p>
<p><a href="https://github.com/tjelvar-olsson/tinyfasta/blob/e748ac3f3d50a43dcef23f81c8bddb49ed556917/tinyfasta/__init__.py">e748ac3 <code class="language-plaintext highlighter-rouge">tinyfasta/__init__.py</code></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">description_matches</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">search_term</span><span class="p">):</span>
<span class="s">"""Return True if the search_term is in the description."""</span>
<span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">search_term</span><span class="p">,</span> <span class="s">"search"</span><span class="p">):</span>
<span class="k">return</span> <span class="n">search_term</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">description</span><span class="p">)</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">description</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">search_term</span><span class="p">)</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">def</span> <span class="nf">sequence_matches</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">search_motif</span><span class="p">):</span>
<span class="s">"""Return True if the motif is in the sequence.
:param search_motif: string or compiled regex
:returns: bool
"""</span>
<span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">search_motif</span><span class="p">,</span> <span class="s">"search"</span><span class="p">):</span>
<span class="k">return</span> <span class="n">search_motif</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sequence</span><span class="p">)</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">sequence</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">search_motif</span><span class="p">)</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span>
</code></pre></div></div>
<p>As we have been using test-driven development we have tests for all the
functionality of interest. We can therefore refactor the code to the below.</p>
<p><a href="https://github.com/tjelvar-olsson/tinyfasta/blob/2b988b9d8b309ae4de6ae1a953078e834ead724c/tinyfasta/__init__.py">2b988b9 <code class="language-plaintext highlighter-rouge">tinyfasta/__init__.py</code></a></p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="o">@</span><span class="nb">staticmethod</span>
<span class="k">def</span> <span class="nf">_match</span><span class="p">(</span><span class="n">string</span><span class="p">,</span> <span class="n">search_term</span><span class="p">):</span>
<span class="s">"""Return True if the search_term is in the string.
:param string: string to be searched
:param search_term: string or compiled regex
:returns: bool
"""</span>
<span class="k">if</span> <span class="nb">hasattr</span><span class="p">(</span><span class="n">search_term</span><span class="p">,</span> <span class="s">"search"</span><span class="p">):</span>
<span class="k">return</span> <span class="n">search_term</span><span class="o">.</span><span class="n">search</span><span class="p">(</span><span class="n">string</span><span class="p">)</span> <span class="ow">is</span> <span class="ow">not</span> <span class="bp">None</span>
<span class="k">return</span> <span class="n">string</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">search_term</span><span class="p">)</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">def</span> <span class="nf">description_matches</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">search_term</span><span class="p">):</span>
<span class="s">"""Return True if the search_term is in the description.
:param search_term: string or compiled regex
:returns: bool
"""</span>
<span class="k">return</span> <span class="n">FastaRecord</span><span class="o">.</span><span class="n">_match</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">description</span><span class="p">,</span> <span class="n">search_term</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">sequence_matches</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">search_motif</span><span class="p">):</span>
<span class="s">"""Return True if the motif is in the sequence.
:param search_motif: string or compiled regex
:returns: bool
"""</span>
<span class="k">return</span> <span class="n">FastaRecord</span><span class="o">.</span><span class="n">_match</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sequence</span><span class="p">,</span> <span class="n">search_motif</span><span class="p">)</span>
</code></pre></div></div>
<p>And run the tests.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ nosetests
......................
Name Stmts Miss Cover Missing
-----------------------------------------
tinyfasta 43 0 100%
----------------------------------------------------------------------
Ran 22 tests in 0.032s
OK
</code></pre></div></div>
<p>As all the tests pass we can have some level of confidence that everything is
still working as intended.</p>
<h2 id="improving-the-design-of-the-code">Improving the design of the code</h2>
<p>At some point whilst documenting how to use the <code class="language-plaintext highlighter-rouge">tinyfasta</code> package I realised that
the function names <code class="language-plaintext highlighter-rouge">description_matches</code> and <code class="language-plaintext highlighter-rouge">sequence_matches</code> were a little bit
misleading and that the names <code class="language-plaintext highlighter-rouge">description_contains</code> and <code class="language-plaintext highlighter-rouge">sequence_contains</code> would
be more appropriate. This was a relatively simple change to make, see
<a href="https://github.com/tjelvar-olsson/tinyfasta/commit/0496373038942e31a7afb53549ee5d4e371c0312">commit 0496373</a>.</p>
<p>However, some time later I realised that it would be much nicer if the API of the
<code class="language-plaintext highlighter-rouge">tinyfasta</code> package would allow code that looked like the below. Note that the
<code class="language-plaintext highlighter-rouge">description</code> is no longer a function, but an instance of some sort which has
a <code class="language-plaintext highlighter-rouge">contains</code> function.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">tinyfasta</span> <span class="kn">import</span> <span class="n">FastaParser</span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">fasta_record</span> <span class="ow">in</span> <span class="n">FastaParser</span><span class="p">(</span><span class="s">"tests/data/dummy.fasta"</span><span class="p">):</span>
<span class="o">...</span> <span class="k">if</span> <span class="n">fasta_record</span><span class="o">.</span><span class="n">description</span><span class="o">.</span><span class="n">contains</span><span class="p">(</span><span class="s">'seq1'</span><span class="p">):</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">fasta_record</span><span class="p">)</span>
<span class="o">...</span>
<span class="o">></span><span class="n">seq1</span><span class="o">|</span><span class="n">contains</span> <span class="mi">2</span><span class="n">x78</span> <span class="n">A</span><span class="s">'s</span><span class="err">
</span><span class="s">AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA</span><span class="err">
</span><span class="s">AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA</span><span class="err">
</span></code></pre></div></div>
<p>Although, the change to the feel of the API is minor (an underscore swapped for
a full stop), the change to the underlying behaviour of the <code class="language-plaintext highlighter-rouge">tinyfasta</code>
package is major.</p>
<p>However, because of all the tests the change was not so hard to implement.
First I went into the tests and changed all the calls to the
<code class="language-plaintext highlighter-rouge">description_contains</code> and <code class="language-plaintext highlighter-rouge">sequence_contains</code> to <code class="language-plaintext highlighter-rouge">description.contains</code>
and <code class="language-plaintext highlighter-rouge">sequence.contains</code>. Then I simply “listened to my tests” as they guided
me through all the changes that needed to be made for the package to become
functional again. Have a look at
<a href="https://github.com/tjelvar-olsson/tinyfasta/commit/7fb248f7ce3029bd517abe887623ecbe5b68c23e">commit 7fb248f</a>
to see the resulting changes to the code base.</p>
<h2 id="conclusion">Conclusion</h2>
<p>I hope this post inspires you to try out test-driven development. However,
don’t be surprised if you find that it is harder than it looks. Like everything
it requires practise. If you feel really stuck, try using a spike to get you
going and then use the resulting code to inspire a functional test.</p>
<p>I can also highly recommend Harry Percival’s book
<a href="http://chimera.labs.oreilly.com/books/1234000000754">Test-Driven Development with Python</a>.
It is what inspired me to start using test-driven development.</p>
<p>Happy coding!</p>
Four tools for testing your Python code2015-05-30T00:00:00+00:00http://tjelvarolsson.com/blog/four-tools-for-testing-your-python-code<h2 id="introduction">Introduction</h2>
<p>It is important to test your code. Tests provide a means to verify that code
does what it is intended to do. However, repeated manual testing is tedious and
error prone.</p>
<p>In this post I will highlight four tools for helping you automate the testing
of your code base.</p>
<h2 id="background">Background</h2>
<p>In a
<a href="/blog/begginers-guide-creating-clean-python-development-environments/">previous post</a>
we discussed how to set up clean Python development environments using
<code class="language-plaintext highlighter-rouge">virtualenv</code> and
<a href="/blog/using-cookiecutter-a-passive-code-generator/">cookicutter</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
...
repo_name (default is "mypackage")? awesome
...
$ cd awesome
$ virtualenv ~/virtualenvs/awesome
$ source ~/virtualenvs/awesome/bin/activate
(awesome)$ python setup.py develop
</code></pre></div></div>
<p>In this post we will make use of some of the files generated using this setup.</p>
<h2 id="1-unittest---a-python-module-for-creating-tests">1. Unittest - a Python module for creating tests</h2>
<p>Python comes with batteries included and built into the standard library is a
module named <code class="language-plaintext highlighter-rouge">unittest</code>, which can be used to write tests.</p>
<p>As a side note: tests can be classified into many different types: unit tests,
integration tests, functional tests, acceptance tests. Mark Simpson has written
a nice overview of the different types of tests on
<a href="http://stackoverflow.com/a/4904533">stackoverflow</a>. As the post implies the
subject of classifying tests is rather subjective and you get different answers
depending on where you look. Personally, I simply use two broad categories:
unit tests and functional tests. Where the latter incorporates both acceptance
and integration tests.</p>
<p>No matter how you classify your tests you can use Python’s <code class="language-plaintext highlighter-rouge">unittest</code> module
to write them.</p>
<p>Below is a bare bones skeleton for writing a test using the <code class="language-plaintext highlighter-rouge">unittest</code> module.
To write a test we create a subclass of the <code class="language-plaintext highlighter-rouge">unittest.TestCase</code> base class.
Now any functions in our test class that start with <code class="language-plaintext highlighter-rouge">test_</code> will be tested
when we call the <code class="language-plaintext highlighter-rouge">unittests.main()</code> function. Copy and past the code below
into a file named <code class="language-plaintext highlighter-rouge">basic_unittest.py</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">unittest</span>
<span class="k">class</span> <span class="nc">MyTest</span><span class="p">(</span><span class="n">unittest</span><span class="o">.</span><span class="n">TestCase</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">test_something</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">unittest</span><span class="o">.</span><span class="n">main</span><span class="p">()</span>
</code></pre></div></div>
<p>Let’s see what happens when we run this code.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awesome)$ python basic_unittest.py
.
----------------------------------------------------------------------
Ran 1 test in 0.000s
OK
</code></pre></div></div>
<p>Okay, now let us have a look at the <code class="language-plaintext highlighter-rouge">tests/tests.py</code> file generated earlier
on by our cookiecutter template.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">unittest</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">os.path</span>
<span class="kn">import</span> <span class="nn">shutil</span>
<span class="n">HERE</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">dirname</span><span class="p">(</span><span class="n">__file__</span><span class="p">)</span>
<span class="n">DATA_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">HERE</span><span class="p">,</span> <span class="s">'data'</span><span class="p">)</span>
<span class="n">TMP_DIR</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">HERE</span><span class="p">,</span> <span class="s">'tmp'</span><span class="p">)</span>
<span class="k">class</span> <span class="nc">UnitTests</span><span class="p">(</span><span class="n">unittest</span><span class="o">.</span><span class="n">TestCase</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">test_can_import_package</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="c1"># Raises import error if the package cannot be imported.
</span> <span class="kn">import</span> <span class="nn">awsome</span>
<span class="k">def</span> <span class="nf">test_package_has_version_string</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="kn">import</span> <span class="nn">awsome</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertTrue</span><span class="p">(</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">awsome</span><span class="o">.</span><span class="n">__version__</span><span class="p">,</span> <span class="nb">str</span><span class="p">))</span>
<span class="k">class</span> <span class="nc">FunctionalTests</span><span class="p">(</span><span class="n">unittest</span><span class="o">.</span><span class="n">TestCase</span><span class="p">):</span>
<span class="k">def</span> <span class="nf">setUp</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isdir</span><span class="p">(</span><span class="n">TMP_DIR</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">mkdir</span><span class="p">(</span><span class="n">TMP_DIR</span><span class="p">)</span>
<span class="k">def</span> <span class="nf">tearDown</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="n">shutil</span><span class="o">.</span><span class="n">rmtree</span><span class="p">(</span><span class="n">TMP_DIR</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="n">unittest</span><span class="o">.</span><span class="n">main</span><span class="p">()</span>
</code></pre></div></div>
<p>There are several things to note here.</p>
<p>Let us start by looking at the <code class="language-plaintext highlighter-rouge">test_package_has_version_string()</code> function.
It makes use of <code class="language-plaintext highlighter-rouge">unittest.TestCase.assertTrue()</code> to check that the version
number of the <code class="language-plaintext highlighter-rouge">awesome</code> package we are developing is a string. There are
many other useful “assert” functions built into the <code class="language-plaintext highlighter-rouge">unittest.TestCase</code> base
class, one of the most used ones being <code class="language-plaintext highlighter-rouge">unittest.TestCase.assertEqual()</code>.</p>
<p>At the top of the file we import several additional modules: <code class="language-plaintext highlighter-rouge">os</code>,
<code class="language-plaintext highlighter-rouge">os.path</code>, <code class="language-plaintext highlighter-rouge">shutil</code>. The <code class="language-plaintext highlighter-rouge">os.path</code> module is used to create some
variables for defining input and output directories for our functional
tests.</p>
<p>The <code class="language-plaintext highlighter-rouge">unittest.TestCase.setUp()</code> and <code class="language-plaintext highlighter-rouge">unittest.TestCase.tearDown()</code>
functions provide a way to ensure test isolation. They are run before and after
each individual test function in a test class. The <code class="language-plaintext highlighter-rouge">os</code> module is used to
create the <code class="language-plaintext highlighter-rouge">tests/tmp</code> directory during the set up of a functional test and
similarly the <code class="language-plaintext highlighter-rouge">shutil</code> module is used to remove the <code class="language-plaintext highlighter-rouge">tests/tmp</code> directory
when a functional test is finished.</p>
<p>Hopefully this quick overview has provided a enough detail for you to get
started writing your own tests. For more information have a look at the
<a href="https://docs.python.org/2/library/unittest.html">unittest documentation</a>.</p>
<h2 id="2-nose---a-test-runner-for-your-tests">2. Nose - a test runner for your tests</h2>
<p>As you build up more and more tests you want to have a way of running them all
automatically. One way to do this is to use
<a href="https://nose.readthedocs.org/en/latest/">nose</a>.</p>
<p>Let us install it using <code class="language-plaintext highlighter-rouge">pip</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awesome)$ pip install nose
</code></pre></div></div>
<p>Now we can run the test suite using the <code class="language-plaintext highlighter-rouge">nosetests</code> command.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awesome)$ nosetests
nose.plugins.cover: ERROR: Coverage not available: unable to import coverage module
..
----------------------------------------------------------------------
Ran 2 tests in 0.004s
OK
</code></pre></div></div>
<p>There are two things to note in the above. First of all, the <code class="language-plaintext highlighter-rouge">nosetests</code> command
automatically found and ran our tests. Yay!</p>
<p>Secondly, it complained about not being able to import the <code class="language-plaintext highlighter-rouge">coverage</code> module.
There are two reasons for this:</p>
<ol>
<li>We have not installed the <code class="language-plaintext highlighter-rouge">coverage</code> module yet</li>
<li>The <code class="language-plaintext highlighter-rouge">awesome/setup.cnf</code> file specifies that it should be used</li>
</ol>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[nosetests]
detailed-errors=1
with-coverage=1
cover-package=awesome
cover-erase=1
verbosity=1
</code></pre></div></div>
<p><em>What is coverage all about anyway?</em></p>
<h2 id="3-coverage---measuring-your-code-coverage">3. Coverage - measuring your code coverage</h2>
<p>The <code class="language-plaintext highlighter-rouge">coverge</code> module measures code coverage. Code coverage is a measure of
how many lines of code are being exercised by your tests. It is
particularly useful for identifying areas of the code-base that need more
tests.</p>
<p>Let us install it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awesome)$ pip install coverage
</code></pre></div></div>
<p>Now let us run the tests again.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awesome)$ nosetests
..
Name Stmts Miss Cover Missing
--------------------------------------
awsome 1 0 100%
----------------------------------------------------------------------
Ran 2 tests in 0.009s
OK
</code></pre></div></div>
<p>Awesome we have 100% test coverage!</p>
<p>Let us add some more functionality to see what happens when we have code that
is not tested. Add the <code class="language-plaintext highlighter-rouge">fpaths_in_dir()</code> function to the
<code class="language-plaintext highlighter-rouge">awesome/__init__.py</code> file.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="s">"""awesome package."""</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="n">__version__</span> <span class="o">=</span> <span class="s">"0.0.1"</span>
<span class="k">def</span> <span class="nf">fpaths_in_dir</span><span class="p">(</span><span class="n">directory</span><span class="p">):</span>
<span class="s">"""Return the paths to the files in the directory."""</span>
<span class="n">fpaths</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">for</span> <span class="n">fname</span> <span class="ow">in</span> <span class="n">os</span><span class="o">.</span><span class="n">listdir</span><span class="p">(</span><span class="n">directory</span><span class="p">):</span>
<span class="n">fpaths</span><span class="o">.</span><span class="n">append</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">directory</span><span class="p">,</span> <span class="n">fname</span><span class="p">))</span>
<span class="k">return</span> <span class="n">fpaths</span>
</code></pre></div></div>
<p>If we run the tests again we find out that lines 8-11 have not been convered
by the tests.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awseome)$ nosetests
..
Name Stmts Miss Cover Missing
---------------------------------------
awesome 7 4 43% 8-11
----------------------------------------------------------------------
Ran 2 tests in 0.010s
OK
</code></pre></div></div>
<p>Let’s add a test for them! But wait… Errr…</p>
<p><em>How do we add a reliable test for something that wants to read information
from the file system?</em></p>
<h2 id="4-mock---faking-objects-for-unit-tests">4. Mock - faking objects for unit tests</h2>
<p>We can make use of mock objects to solve these types of problems. Mock objects
mimic the behaviour of real objects in controllable ways. For more background
have a look at the
<a href="http://en.wikipedia.org/wiki/Mock_object">Mock object wikipedia page</a>.</p>
<p>As of Python 3.3 <code class="language-plaintext highlighter-rouge">mock</code> is part of the standard library. However, users of older
versions of Python can install it using <code class="language-plaintext highlighter-rouge">pip</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awseome)$ pip install mock
</code></pre></div></div>
<p>Now we can write a test for our function. Add the test function below to the
<code class="language-plaintext highlighter-rouge">UnitTests</code> class in the <code class="language-plaintext highlighter-rouge">awesome/tests/tests.py</code> file.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code> <span class="k">def</span> <span class="nf">test_fpaths_in_dir</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="kn">from</span> <span class="nn">mock</span> <span class="kn">import</span> <span class="n">MagicMock</span>
<span class="kn">from</span> <span class="nn">awesome</span> <span class="kn">import</span> <span class="n">fpaths_in_dir</span>
<span class="n">os</span><span class="o">.</span><span class="n">listdir</span> <span class="o">=</span> <span class="n">MagicMock</span><span class="p">(</span><span class="n">return_value</span><span class="o">=</span><span class="p">[</span><span class="s">'test1.txt'</span><span class="p">,</span> <span class="s">'test2.txt'</span><span class="p">])</span>
<span class="n">fpaths</span> <span class="o">=</span> <span class="n">fpaths_in_dir</span><span class="p">(</span><span class="s">'some/dir'</span><span class="p">)</span>
<span class="n">expected</span> <span class="o">=</span> <span class="p">[</span><span class="s">'some/dir/test1.txt'</span><span class="p">,</span> <span class="s">'some/dir/test2.txt'</span><span class="p">]</span>
<span class="bp">self</span><span class="o">.</span><span class="n">assertEqual</span><span class="p">(</span><span class="n">fpaths</span><span class="p">,</span> <span class="n">expected</span><span class="p">)</span>
</code></pre></div></div>
<p>Let us run the tests again.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awseome)$ nosetests
...
Name Stmts Miss Cover Missing
---------------------------------------
awesome 7 0 100%
----------------------------------------------------------------------
Ran 3 tests in 0.043s
OK
</code></pre></div></div>
<p>Great all the tests are passing! Now we can relax again.</p>
<p>The <code class="language-plaintext highlighter-rouge">mock</code> module can do much more than what I have shown above. Have a look
at the <a href="https://pypi.python.org/pypi/mock">mock documentation</a> for some more
inspiration.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Python comes with lots of useful tools for helping you test your code base. In
this post I have described some of the most established ones. However there are
others around. Experiment and find out what works for you.</p>
<p>In the
<a href="/blog/test-driven-develpment-for-scientists/">next post</a>
I will continue the theme of testing by illustrating some aspects of
test-driven development.</p>
Five exercises to master the Python debugger2015-05-15T00:00:00+00:00http://tjelvarolsson.com/blog/five-exercises-to-master-the-python-debugger<figure>
<img src="/images/debugging.jpg" alt="Debugging." />
<figcaption>
How do you do your debugging?
</figcaption>
</figure>
<h2 id="introduction">Introduction</h2>
<p>When programming (in Python) it is common to find oneself inserting <code class="language-plaintext highlighter-rouge">print</code>
statements all over the code when trying to find out why things are not working
as expected. This can often be a quick way of working out what is going on.</p>
<p>However, it can become tedious whenever the problem is not resolved by the
first <code class="language-plaintext highlighter-rouge">print</code> statement. I have often found myself spending significant
amounts of time scattering <code class="language-plaintext highlighter-rouge">print</code> statements all over my code to work out
what is going on. Usually this is followed by me spending time hunting through
my code for the <code class="language-plaintext highlighter-rouge">print</code> statements so that I can delete them. After which I
often realise that I still needed them.</p>
<p>There is a more powerful way of finding out what a program is doing: using a
debugger. However, people often shy away from debuggers because of their arcane
interfaces. This post contains five exercises to help you master the Python
debugger.</p>
<p>By the end of this post I hope that you will be substituting your <code class="language-plaintext highlighter-rouge">print</code>
statements with <code class="language-plaintext highlighter-rouge">import pdb; pdb.set_trace()</code>.</p>
<h2 id="exercise-1-stepping-through-a-program">Exercise 1: stepping through a program</h2>
<p>Let us start by stepping through a simple program. Copy and paste the code
snippet below into a file named <code class="language-plaintext highlighter-rouge">pdb_exercise_1.py</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">name</span> <span class="o">=</span> <span class="s">'alice'</span>
<span class="n">greeting</span> <span class="o">=</span> <span class="s">'hello '</span> <span class="o">+</span> <span class="n">name</span>
<span class="k">print</span><span class="p">(</span><span class="n">greeting</span><span class="p">)</span>
</code></pre></div></div>
<p>Now invoke the script using the python debugger via the command below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python -m pdb pdb_exercie_1.py
</code></pre></div></div>
<p>In the above <code class="language-plaintext highlighter-rouge">pdb</code> is the three letter acronym for the Python Debugger. You
should be greeted by the prompt below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> pdb_exercise_1.py(1)<module>()
-> name = 'alice'
(Pdb)
</code></pre></div></div>
<p>The debugger shows the next line to be executed (<code class="language-plaintext highlighter-rouge">-> name = 'alice'</code>) as well
as the prompt for interacting with the debugger (<code class="language-plaintext highlighter-rouge">(Pdb)</code>).</p>
<p>Type in <code class="language-plaintext highlighter-rouge">n</code>, short for <code class="language-plaintext highlighter-rouge">next</code>, to execute the line displayed. You should
now see the output below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) n
> pdb_exercise_1.py(2)<module>()
-> greeting = 'hello' + name
</code></pre></div></div>
<p>Let us check the value of the newly assigned <code class="language-plaintext highlighter-rouge">name</code> variable. Type in <code class="language-plaintext highlighter-rouge">p
name</code> (<code class="language-plaintext highlighter-rouge">p</code> as in “print”). It should tell you that the name is “alice”.
Type in <code class="language-plaintext highlighter-rouge">n</code> again to execute the next command. The <code class="language-plaintext highlighter-rouge">greeting</code> variable
should now have been assigned the string “hello alice”.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) p name
'alice'
(Pdb) n
> pdb_exercise_1.py(3)<module>()
> -> print(greeting)
> (Pdb) p greeting
> 'hello alice'
</code></pre></div></div>
<p>When debugging it is quite easy to lose the frame of reference as to where one
is in the code. To put things into context type in <code class="language-plaintext highlighter-rouge">l</code> as in <code class="language-plaintext highlighter-rouge">list</code> (the
source code for the current file). You should see output below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) l
1 name = 'alice'
2 greeting = 'hello ' + name
3 -> print(greeting)
[EOF]
</code></pre></div></div>
<p>Okay, so we are almost at the end. Type in <code class="language-plaintext highlighter-rouge">n</code> again to execute the last command.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) n
hello alice
</code></pre></div></div>
<p>Finally, type in <code class="language-plaintext highlighter-rouge">q</code> to <code class="language-plaintext highlighter-rouge">quit</code> the debugger.</p>
<p>Well done! You have just used the Python debugger to step through a program.</p>
<h2 id="exercise-2-stepping-into-functions">Exercise 2: stepping into functions</h2>
<p>Let us create a script with a function. Copy and paste the code snippet below
into a file named <code class="language-plaintext highlighter-rouge">pdb_exercise_2.py</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">greet</span><span class="p">(</span><span class="n">name</span><span class="p">):</span>
<span class="n">greeting</span> <span class="o">=</span> <span class="s">'hello '</span> <span class="o">+</span> <span class="n">name</span>
<span class="k">return</span> <span class="n">greeting</span>
<span class="n">greeting</span> <span class="o">=</span> <span class="n">greet</span><span class="p">(</span><span class="s">'alice'</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">greeting</span><span class="p">)</span>
</code></pre></div></div>
<p>Start the debugger.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python -m pdb pdb_exercise_2.py
</code></pre></div></div>
<p>This time, rather than stepping through the program, press <code class="language-plaintext highlighter-rouge">c</code> (which stands
for “continue execution”). You should see the output below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) c
hello alice
The program finished and will be restarted
> pdb_exercise_2.py(1)<module>()
-> def greet(name):
(Pdb)
</code></pre></div></div>
<p>Basically the program ran from beginning to end, printing out the greeting, and
then it restarted itself leaving us at the <code class="language-plaintext highlighter-rouge">(Pdb)</code> prompt.</p>
<p>This time use <code class="language-plaintext highlighter-rouge">n</code> to walk through the script. Note that you only need to
enter <code class="language-plaintext highlighter-rouge">n</code> three times to get to the end of the program and that the debugger
does not step into the <code class="language-plaintext highlighter-rouge">greet()</code> function. You should see the output below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> pdb_exercise_2.py(1)<module>()
-> def greet(name):
(Pdb) n
> pdb_exercise_2.py(5)<module>()
-> greeting = greet('alice')
(Pdb) n
> pdb_exercise_2.py(6)<module>()
-> print(greeting)
(Pdb) n
hello alice
</code></pre></div></div>
<p>In other words <code class="language-plaintext highlighter-rouge">n</code> continues execution until the next line in the current
function is reached or it returns.</p>
<p>Press <code class="language-plaintext highlighter-rouge">c</code> to restart the program and press <code class="language-plaintext highlighter-rouge">n</code> once to get to the line
where the greet function is about to be called.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> pdb_exercise_2.py(1)<module>()
-> def greet(name):
(Pdb) n
> pdb_exercise_2.py(5)<module>()
</code></pre></div></div>
<p>This time we will use <code class="language-plaintext highlighter-rouge">s</code> to <code class="language-plaintext highlighter-rouge">step</code> into the <code class="language-plaintext highlighter-rouge">greet()</code> function, then we will
continue walking through the program using <code class="language-plaintext highlighter-rouge">n</code>. Note the difference now that
you have stepped into the <code class="language-plaintext highlighter-rouge">greet()</code> function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) s
--Call--
> pdb_exercise_2.py(1)greet()
-> def greet(name):
(Pdb) n
> pdb_exercise_2.py(2)greet()
-> greeting = 'hello ' + name
(Pdb) n
> pdb_exercise_2.py(3)greet()
-> return greeting
(Pdb) n
--Return--
> pdb_exercise_2.py(3)greet()->'hello alice'
-> return greeting
(Pdb) n
> pdb_exercise_2.py(6)<module>()
-> print(greeting)
(Pdb) n
hello alice
--Return--
> pdb_exercise_2.py(6)<module>()->None
-> print(greeting)
(Pdb)
</code></pre></div></div>
<p>Finally let us have a look at the <code class="language-plaintext highlighter-rouge">r</code> command, which stands for <code class="language-plaintext highlighter-rouge">return</code>.
This is similar to the <code class="language-plaintext highlighter-rouge">c</code> command, but rather than continuing to the end of
the program <code class="language-plaintext highlighter-rouge">r</code> runs to the end of the function.</p>
<p>Let us try it out, start off by entering <code class="language-plaintext highlighter-rouge">c</code> to restart the program then
enter <code class="language-plaintext highlighter-rouge">n</code> and <code class="language-plaintext highlighter-rouge">s</code>. You should now be in the <code class="language-plaintext highlighter-rouge">greet()</code> function.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) c
The program finished and will be restarted
> pdb_exercise_2.py(1)<module>()
-> def greet(name):
(Pdb) n
> pdb_exercise_2.py(5)<module>()
-> greeting = greet('alice')
(Pdb) s
--Call--
> pdb_exercise_2.py(1)greet()
-> def greet(name):
(Pdb)
</code></pre></div></div>
<p>As a sanity check, use <code class="language-plaintext highlighter-rouge">l</code> to list where you are in the code. You should see the below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) l
1 -> def greet(name):
2 greeting = 'hello ' + name
3 return greeting
4
5 greeting = greet('alice')
6 print(greeting)
[EOF]
(Pdb)
</code></pre></div></div>
<p>Now press <code class="language-plaintext highlighter-rouge">r</code> as in “return”.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) r
--Return--
> pdb_exercise_2.py(3)greet()->'hello alice'
-> return greeting
(Pdb)
</code></pre></div></div>
<p>Note that we are immediately placed at the end of the function where it is
about to deliver its return value.</p>
<h2 id="exercise-3-getting-help">Exercise 3: getting help</h2>
<p>When using a tool infrequently it is easy to forget what the commands are named
and what they do. However, using the <code class="language-plaintext highlighter-rouge">help</code> command it is easy to refresh
your memory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) help
Documented commands (type help <topic>):
========================================
EOF bt cont enable jump pp run unt
a c continue exit l q s until
alias cl d h list quit step up
args clear debug help n r tbreak w
b commands disable ignore next restart u whatis
break condition down j p return unalias where
Miscellaneous help topics:
==========================
exec pdb
Undocumented commands:
======================
retval rv
</code></pre></div></div>
<p>Let us have a look at the <code class="language-plaintext highlighter-rouge">help</code> descriptions of the commands that
we have been using so far.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) help n
n(ext)
Continue execution until the next line in the current function
is reached or it returns.
(Pdb) help s
s(tep)
Execute the current line, stop at the first possible occasion
(either in a function that is called or in the current function).
(Pdb) help c
c(ont(inue))
Continue execution, only stop when a breakpoint is encountered.
(Pdb) help r
r(eturn)
Continue execution until the current function returns.
(Pdb) help l
l(ist) [first [,last]]
List source code for the current file.
Without arguments, list 11 lines around the current line
or continue the previous listing.
With one argument, list 11 lines starting at that line.
With two arguments, list the given range;
if the second argument is less than the first, it is a count.
(Pdb) help help
h(elp)
Without argument, print the list of available commands.
With a command name as argument, print help about that command
"help pdb" pipes the full documentation file to the $PAGER
"help exec" gives help on the ! command
(Pdb)
</code></pre></div></div>
<h2 id="exercise-4-interacting-with-the-program-under-inspection">Exercise 4: interacting with the program under inspection</h2>
<p>Up until this point we have not actually had any errors in our scripts to
correct. Let us change that. Copy and paste the code below into a file named
<code class="language-plaintext highlighter-rouge">pdb_exercise_4.py</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">sys</span>
<span class="k">def</span> <span class="nf">magic</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">):</span>
<span class="k">return</span> <span class="n">x</span> <span class="o">+</span> <span class="n">y</span> <span class="o">*</span> <span class="mi">2</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">sys</span><span class="o">.</span><span class="n">argv</span><span class="p">[</span><span class="mi">1</span><span class="p">]</span>
<span class="n">answer</span> <span class="o">=</span> <span class="n">magic</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="s">'The answer is: {}'</span><span class="o">.</span><span class="nb">format</span><span class="p">(</span><span class="n">answer</span><span class="p">))</span>
</code></pre></div></div>
<p>Suppose that we run this script with the inputs 1 and 50 expecting the result 101.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python pdb_exercise_4.py 1 50
The answer is: 111
</code></pre></div></div>
<p>What is going on?</p>
<p>Now, rather than inserting <code class="language-plaintext highlighter-rouge">print</code> statements all over the code to work it out,
let us examine the code in the debugger.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python -m pdb pdb_exercise_4.py 1 50
</code></pre></div></div>
<p>Let us get to the point where we have access to the variables <code class="language-plaintext highlighter-rouge">x</code> and <code class="language-plaintext highlighter-rouge">y</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> pdb_exercise_4.py(1)<module>()
-> import sys
(Pdb) n
> pdb_exercise_4.py(3)<module>()
-> def magic(x, y):
(Pdb) n
> pdb_exercise_4.py(6)<module>()
-> x = sys.argv[1]
(Pdb) n
> pdb_exercise_4.py(7)<module>()
-> y = sys.argv[1]
(Pdb) n
> pdb_exercise_4.py(9)<module>()
-> answer = magic(x, y)
(Pdb)
</code></pre></div></div>
<p>First of all let us see what attributes are available in the scope of the
program. We can do this using <code class="language-plaintext highlighter-rouge">p</code> for print.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) p dir()
['__builtins__', '__file__', '__name__', '__package__', 'magic', 'sys', 'x', 'y']
</code></pre></div></div>
<p>There is also <code class="language-plaintext highlighter-rouge">pp</code> for pretty print.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) pp dir()
['__builtins__',
'__file__',
'__name__',
'__package__',
'magic',
'sys',
'x',
'y']
</code></pre></div></div>
<p>So what is <code class="language-plaintext highlighter-rouge">x</code>?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) p x
'1'
</code></pre></div></div>
<p>Hey, that looks suspiciously like a string. Note that we can use raw Python
within the debugger. Let us find out type <code class="language-plaintext highlighter-rouge">x</code> is.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) type(x)
<type 'str'>
</code></pre></div></div>
<p>The fact that we can execute Python within the debugger means that we can
change the input variables dynamically.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) x = int(x)
(Pdb) y = int(y)
</code></pre></div></div>
<p>Let us just check the values before we run the program.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) p x, y
(1, 1)
</code></pre></div></div>
<p>What <code class="language-plaintext highlighter-rouge">y</code> is 1 not 50?</p>
<p>Inspecting the code we find that I forgot to update the index when I copied the
input parsing line (note line 7 in the code listing below).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) l
4 return x + y * 2
5
6 x = sys.argv[1]
7 y = sys.argv[1]
8
9 -> answer = magic(x, y)
10 print('The answer is: {}'.format(answer))
[EOF]
(Pdb)
</code></pre></div></div>
<p>Ok, let us just change the value of <code class="language-plaintext highlighter-rouge">y</code> to 50 in the debugger before checking
if the code works as expected by letting it run to completion.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(Pdb) y = 50
(Pdb) c
The answer is: 101
The program finished and will be restarted
> pdb_exercise_4.py(1)<module>()
-> import sys
(Pdb)
</code></pre></div></div>
<p>Ok, so the example is a little bit naff. However, I hope it illustrates the
power of working with the debugger, particularly if you are working on a more
complicated code base.</p>
<h2 id="exercise-5-using-breakpoints">Exercise 5: using breakpoints</h2>
<p>So far we have been stepping though the scripts from beginning to end. However,
when working on larger programs this is often not practical. To simulate such a
situation, copy and paste the code below into a file named
<code class="language-plaintext highlighter-rouge">pdb_exercise_5.py</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">time</span>
<span class="k">def</span> <span class="nf">slow_subtractor</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
<span class="s">"""Return a minus b."""</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="k">return</span> <span class="n">a</span> <span class="o">-</span> <span class="n">b</span>
<span class="n">some</span> <span class="o">=</span> <span class="n">slow_subtractor</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">8</span><span class="p">)</span>
<span class="n">crazy</span> <span class="o">=</span> <span class="n">slow_subtractor</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">78</span><span class="p">)</span>
<span class="n">scientific</span> <span class="o">=</span> <span class="n">slow_subtractor</span><span class="p">(</span><span class="mi">56</span><span class="p">,</span> <span class="mi">31</span><span class="p">)</span>
<span class="n">experiment</span> <span class="o">=</span> <span class="n">slow_subtractor</span><span class="p">(</span><span class="mi">101</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span>
<span class="n">total</span> <span class="o">=</span> <span class="n">some</span> <span class="o">+</span> <span class="n">crazy</span> <span class="o">+</span> <span class="n">scientific</span> <span class="o">+</span> <span class="n">experiment</span>
<span class="n">experimental_fraction</span> <span class="o">=</span> <span class="n">experiment</span> <span class="o">/</span> <span class="n">total</span>
</code></pre></div></div>
<p>When we run this code we get a <code class="language-plaintext highlighter-rouge">ZeroDivisionError</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python pdb_exercise_5.py
Traceback (most recent call last):
File "pdb_exercise_5.py", line 15, in <module>
experimental_fraction = experiment / total
ZeroDivisionError: integer division or modulo by zero
</code></pre></div></div>
<p>Stepping through the code in the debugger would be annoying as you would have
to press <code class="language-plaintext highlighter-rouge">n</code> every time the <code class="language-plaintext highlighter-rouge">slow_subtraction()</code> function was called. Let
us instead insert a breakpoint before the line that generates the error. This
is achieved by importing the <code class="language-plaintext highlighter-rouge">pdb</code> module and using the <code class="language-plaintext highlighter-rouge">pdb.set_trace()</code>
function.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">time</span>
<span class="k">def</span> <span class="nf">slow_subtractor</span><span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">):</span>
<span class="s">"""Return a minus b."""</span>
<span class="n">time</span><span class="o">.</span><span class="n">sleep</span><span class="p">(</span><span class="mi">5</span><span class="p">)</span>
<span class="k">return</span> <span class="n">a</span> <span class="o">-</span> <span class="n">b</span>
<span class="n">some</span> <span class="o">=</span> <span class="n">slow_subtractor</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">8</span><span class="p">)</span>
<span class="n">crazy</span> <span class="o">=</span> <span class="n">slow_subtractor</span><span class="p">(</span><span class="mi">12</span><span class="p">,</span> <span class="mi">78</span><span class="p">)</span>
<span class="n">scientific</span> <span class="o">=</span> <span class="n">slow_subtractor</span><span class="p">(</span><span class="mi">56</span><span class="p">,</span> <span class="mi">31</span><span class="p">)</span>
<span class="n">experiment</span> <span class="o">=</span> <span class="n">slow_subtractor</span><span class="p">(</span><span class="mi">101</span><span class="p">,</span> <span class="mi">64</span><span class="p">)</span>
<span class="n">total</span> <span class="o">=</span> <span class="n">some</span> <span class="o">+</span> <span class="n">crazy</span> <span class="o">+</span> <span class="n">scientific</span> <span class="o">+</span> <span class="n">experiment</span>
<span class="kn">import</span> <span class="nn">pdb</span><span class="p">;</span> <span class="n">pdb</span><span class="o">.</span><span class="n">set_trace</span><span class="p">()</span>
<span class="n">experimental_fraction</span> <span class="o">=</span> <span class="n">experiment</span> <span class="o">/</span> <span class="n">total</span>
</code></pre></div></div>
<p>If we run the code now we get dumped into a debugger session before the
offending line is executed.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python pdb_exercise_5.py
> pdb_exercise_5.py(17)<module>()
-> experimental_fraction = experiment / total
(Pdb) p total
0
(Pdb) p some, crazy, scientific, experiment
(4, -66, 25, 37)
</code></pre></div></div>
<p>Ok, so it looks like there is something funny going on with the <code class="language-plaintext highlighter-rouge">crazy</code>
variable. Perhaps the input arguments were given the wrong way around.</p>
<p>The take home message is that setting breakpoints is a powerful way of getting
to the point of interest in your code when you want to examine what is going
on.</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post we have worked our way through some rather academic exercises to
get ourselves familiar with the Python debugger and how to interact with it.
Hopefully you now feel that you have the skill to step through and query the
state of your program from with the debugger.</p>
<p>However, if you only take one thing away from this post please let it be the
commitment to insert the line <code class="language-plaintext highlighter-rouge">import pdb; pdb.set_trace()</code> just above your
code of interest the next time you feel tempted to <code class="language-plaintext highlighter-rouge">print</code> the value of a
variable in a program that is not behaving as expected.</p>
<h2 id="further-reading">Further reading</h2>
<ul>
<li><a href="https://pythonconquerstheuniverse.wordpress.com/2009/09/10/debugging-in-python/">Debugging in Python | Python Conquers The
Universe</a></li>
<li><a href="https://zapier.com/engineering/debugging-python-boss/">Debugging Python Like a Boss - The Zapier Engineering
Blog</a></li>
</ul>
Beginner's Guide: creating clean Python development environments2015-05-09T00:00:00+00:00http://tjelvarolsson.com/blog/begginers-guide-creating-clean-python-development-environments<h2 id="introduction">Introduction</h2>
<p>Code interacts with its environment. For example, you can only run a Python
script if you have Python installed on the system. Furthermore, a Python
script will only run without raising <code class="language-plaintext highlighter-rouge">ImportError</code> exceptions if all the
required packages are installed.</p>
<p>It therefore becomes important for you as a developer / computational scientist
to understand and control the environment in which your code operates.</p>
<p>In this post I will illustrate a work flow for creating clean Python
development environments.</p>
<h2 id="example-developing-a-python-package">Example: developing a Python package</h2>
<p>In the
<a href="/blog/using-cookiecutter-a-passive-code-generator/">previous post</a>
I illustrated how you could use a static code generator (<code class="language-plaintext highlighter-rouge">cookiecutter</code>) to
create a basic template to develop a Python package.</p>
<p>Now suppose that we wanted to develop a Python package named “awesome”. Let us
use a GitHub hosted Cookiecutter template to create a basic project layout.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
Cloning into 'cookiecutter-pypackage'...
remote: Counting objects: 48, done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 48 (delta 13), reused 37 (delta 8), pack-reused 0
Unpacking objects: 100% (48/48), done.
Checking connectivity... done.
repo_name (default is "mypackage")? awesome
version (default is "0.0.1")?
authors (default is "Tjelvar Olsson")?
</code></pre></div></div>
<p>This creates the directory <code class="language-plaintext highlighter-rouge">awesome</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cd awesome/
</code></pre></div></div>
<p>With a number of files and directories in it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ tree
.
├── README.rst
├── awesome
│ └── __init__.py
├── docs
│ ├── Makefile
│ ├── make.bat
│ └── source
│ ├── README.rst
│ ├── conf.py
│ └── index.rst
├── setup.cfg
├── setup.py
└── tests
├── __init__.py
└── tests.py
4 directories, 11 files
</code></pre></div></div>
<p>You may notice that there are some tests included by default. Let us try to run
them.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python tests/tests.py
EE
======================================================================
ERROR: test_can_import_package (__main__.UnitTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/tests.py", line 15, in test_can_import_package
import awesome
ImportError: No module named awesome
======================================================================
ERROR: test_package_has_version_string (__main__.UnitTests)
----------------------------------------------------------------------
Traceback (most recent call last):
File "tests/tests.py", line 18, in test_package_has_version_string
import awesome
ImportError: No module named awesome
----------------------------------------------------------------------
Ran 2 tests in 0.001s
FAILED (errors=2)
</code></pre></div></div>
<p>That’s not very good! What is going on? It seems that we cannot import the
<code class="language-plaintext highlighter-rouge">awesome</code> module.</p>
<p>Depending on your level of familiarity with Python the problem may be obvious
to you. However, when I started out with Python this caused me a lot of
confusion. I clearly could import the <code class="language-plaintext highlighter-rouge">awesome</code> module!</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python -c "import awesome; print(awesome.__version__)"
0.0.1
</code></pre></div></div>
<p>One of the places where Python looks for modules is within the directory of the
calling script, which is why the command above works. However, when we run the
<code class="language-plaintext highlighter-rouge">tests/tests.py</code> script there is no <code class="language-plaintext highlighter-rouge">awesome</code> package to be found within
the <code class="language-plaintext highlighter-rouge">tests</code> directory, illustrated below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cd tests/
$ python -c "import awesome; print(awesome.__version__)"
Traceback (most recent call last):
File "<string>", line 1, in <module>
ImportError: No module named awesome
$ cd ../
</code></pre></div></div>
<p>At this point one could start manually configuring the <code class="language-plaintext highlighter-rouge">PYTHONPATH</code>
environment variable. However, let us look at a more elegant solution.</p>
<h2 id="making-use-of-setuptools">Making use of <code class="language-plaintext highlighter-rouge">setuptools</code></h2>
<p>In the
<a href="/blog/using-cookiecutter-a-passive-code-generator/">previous post</a>
we started building up a basic <code class="language-plaintext highlighter-rouge">setup.py</code> file, which made use of the
<code class="language-plaintext highlighter-rouge">setuptools</code> module.</p>
<p>You are probably already familiar with <code class="language-plaintext highlighter-rouge">setuptools</code> from installing other Python
packages using the command <code class="language-plaintext highlighter-rouge">python setup.py install</code>. This installs the
package of interest into your Python distribution’s <code class="language-plaintext highlighter-rouge">site-packages</code>
directory.</p>
<p>However, this is not what we want to do because the package would be copied
there and any changes that we made to our local development files would not take
effect until we reinstalled the package. We want to be able to edit our local
development files and see the effects take place immediately.</p>
<p>The solution to this problem is to use <code class="language-plaintext highlighter-rouge">python setup.py develop</code> which
creates an <code class="language-plaintext highlighter-rouge">.egg-link</code> to our local development directory in the
<code class="language-plaintext highlighter-rouge">site-packages</code> directory.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo python setup.py develop
Password:
running develop
...
Processing dependencies for awesome==0.0.1
Finished processing dependencies for awesome==0.0.1
</code></pre></div></div>
<p>Let us re-run the tests now that <code class="language-plaintext highlighter-rouge">site-packages</code> contains an <code class="language-plaintext highlighter-rouge">awesome.egg-link</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python tests/tests.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.000s
OK
</code></pre></div></div>
<p>Great, we have a working development environment!</p>
<p>Before continuing let us square the circle by removing the development package
we just installed.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo python setup.py develop --uninstall
Password:
running develop
...
Removing awesome 0.0.1 from easy-install.pth file
</code></pre></div></div>
<p>So far so good, but there are two issues with what we are currently doing.
First of all we need to have root permissions to run <code class="language-plaintext highlighter-rouge">python setup.py
develop</code> and <code class="language-plaintext highlighter-rouge">python setup.py develop --uninstall</code> when using the system’s
Python. Secondly, when using the system’s Python we are using a (potentially)
polluted environment.</p>
<p>Let me expand on the second issue. Suppose that you had the <code class="language-plaintext highlighter-rouge">PyYAML</code> package
installed in your system’s Python. It is a very useful package, but it is not
part of Python’s standard library. Suppose further that your package needed to
be able to parse YAML files. You therefore start using <code class="language-plaintext highlighter-rouge">PyYAML</code>. Some time
later you want to share your code with your friend Alice. You run your tests,
<em>of course you are writing tests as you go along</em>, and they all pass. You feel
happy and send the package Alice. However, Alice has not yet installed the
<code class="language-plaintext highlighter-rouge">PyYAML</code> package and consequently her first experience of your code is an
<code class="language-plaintext highlighter-rouge">ImportError</code>.</p>
<p>This <code class="language-plaintext highlighter-rouge">ImportError</code> could have been avoided by adding <code class="language-plaintext highlighter-rouge">pyyaml</code> as a
requirement to our <code class="language-plaintext highlighter-rouge">setup.py</code> file. For more details see the “Specifying
Dependencies” section in Scott Torborg’s <a href="http://www.scotttorborg.com/python-packaging/dependencies.html">How To Package Your Python
Code</a>.</p>
<p>However, the question is <em>how could we have detected this issue before sending
our code to Alice?</em></p>
<h2 id="creating-a-virtual-python-development-environment">Creating a virtual Python development environment</h2>
<p>There is a way to avoid making use of the system’s “polluted” Python, which
also lets us work without requiring root privileges. When I first heard about
this it sounded like magic.</p>
<p>The solution is to make use of
<a href="https://virtualenv.pypa.io/en/latest/">virtualenv</a>.
From the virtualenv website:</p>
<blockquote><code>virtualenv</code> is a tool for creating isolated Python environments.</blockquote>
<p>Let us install <code class="language-plaintext highlighter-rouge">virtualenv</code> using <code class="language-plaintext highlighter-rouge">pip</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ pip install virtualenv
</code></pre></div></div>
<p>Now we can create a virtual environment for our project. However, before we do
that let me give you a tip: create a separate directory for storing all
your virtual environments and give each virtual environment a descriptive name.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir ~/virtualenvs
</code></pre></div></div>
<p>If you are anything like me you will end up having at least one virtual
environment for each project you are working on.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ virtualenv ~/virtualenvs/awesome
New python executable in /home/tjelvar/virtualenvs/awesome/bin/python
Installing setuptools, pip...done.
</code></pre></div></div>
<p>Note that this creates a directory named <code class="language-plaintext highlighter-rouge">awesome</code> in the <code class="language-plaintext highlighter-rouge">~/virtualenvs</code>
directory. You could have named it anything, but I like to use the same name as
the project for which I intend to use the virtual environment. The
<code class="language-plaintext highlighter-rouge">~/virtualenvs/awesome</code> directory contains the virtual environment.</p>
<p>To make use of a virtual environment we need to “activate” it. This is done by
sourcing the <code class="language-plaintext highlighter-rouge">activate</code> script in the <code class="language-plaintext highlighter-rouge">bin</code> directory of the virtual
environment.</p>
<p>To get a feel for the effect of activating the virtual environment let
us use <code class="language-plaintext highlighter-rouge">which</code> to find the path to <code class="language-plaintext highlighter-rouge">python</code> before and after we
activate the virtual environment.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ which python
/usr/bin/python
</code></pre></div></div>
<p>Now let us activate the virtual environment we just created.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ source ~/virtualenvs/awesome/bin/activate
(awesome)$ which python
/home/tjelvar/virtualenvs/awesome/bin/python
</code></pre></div></div>
<p>When we source the <code class="language-plaintext highlighter-rouge">activate</code> script above it basically alters the <code class="language-plaintext highlighter-rouge">PATH</code>
and <code class="language-plaintext highlighter-rouge">PS1</code> environment variables. It also defines a <code class="language-plaintext highlighter-rouge">deactivate</code> function
that one can use to reset the environment variables to their original state.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awesome)$ deactivate
$ which python
/usr/bin/python
</code></pre></div></div>
<h2 id="tying-it-all-together">Tying it all together</h2>
<p>That was a lot of pre-amble to be able to show a simple and effective work flow
for setting up clean Python development environments.</p>
<p>Generate a new Python project template.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
...
repo_name (default is "mypackage")? awesome
...
$ cd awesome
</code></pre></div></div>
<p>Create a virtual environment for the project.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ virtualenv ~/virtualenvs/awesome
</code></pre></div></div>
<p>Activate the virtual environment and use <code class="language-plaintext highlighter-rouge">setuptools</code> to create a development
environment.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ source ~/virtualenvs/awesome/bin/activate
(awesome)$ python setup.py develop
</code></pre></div></div>
<h2 id="run-the-tests">Run the tests!</h2>
<p>Tests are great, they let us know that things are working as intended. Let us
make sure that our setup is sound.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>(awesome)$ python tests/tests.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.000s
OK
</code></pre></div></div>
<h2 id="discussion">Discussion</h2>
<p>In this post I have shown you how to use <code class="language-plaintext highlighter-rouge">setuptools</code> and <code class="language-plaintext highlighter-rouge">virtualenv</code> to
create reproducible, clean and isolated Python development environments.</p>
<p>However, the work flow is not limited to development environments. It is just
as applicable to production environments and it is extensively used in the
Python web development community. In fact, having the same work flow for
setting up your development and production environments is a great bonus as it
gives you more confidence in the end product.</p>
Using Cookiecutter - a passive code generator2015-05-04T00:00:00+00:00http://tjelvarolsson.com/blog/using-cookiecutter-a-passive-code-generator<figure>
<img src="/images/cookiecutter.jpg" alt="Cookiecutters." />
<figcaption>
Using a tool to generate a templated result.
</figcaption>
</figure>
<p>In <a href="https://pragprog.com/book/tpp/the-pragmatic-programmer">The Pragmatic
Programmer</a> Andrew Hunt
and David Thomas talk about the importance of code generators when faced with
the task of producing the same thing over and over. They further separate code
generators into two types: passive and active.</p>
<p>A passive code generator being one that saves on typing. It is run once, the
result is placed into version control and then the code is built upon by hand.</p>
<p>Whereas an active code generator is used to produce complete code by converting
a source of meta-data into language(s) of interest. Active code generators are
run frequently and as the resulting code is reproducible it is also disposable,
hence it does not need to be tracked in version control.</p>
<p>In this post I will show you how you can use a passive code generator to create
a basic layout for a Python package.</p>
<h2 id="cookiecutter-a-passive-code-generator">Cookiecutter: a passive code generator</h2>
<p>A classic example where passive code generators are useful is in setting up an
initial project structure. Let us take the example of creating a Python
package, in the simplest case you will want to create a <code class="language-plaintext highlighter-rouge">setup.py</code> file and a
directory with the desired package name containing an <code class="language-plaintext highlighter-rouge">__init__.py</code> file.
Scott Torborg has created a great tutorial on
<a href="http://www.scotttorborg.com/python-packaging/">How To Package Your Python Code</a>.</p>
<p>Several tools exist to deal with this type of scenario. However, I quite like
Audrey Roy’s <a href="https://github.com/audreyr/cookiecutter">Cookiecutter</a>. Let us
illustrate it’s use by creating a minimal template for a Python package.</p>
<p>Firs of all we install it using <code class="language-plaintext highlighter-rouge">pip</code>.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">sudo </span>pip <span class="nb">install </span>cookiecutter
</code></pre></div></div>
<p>Now we will create a funny looking directory structure. It is funny looking because it uses the
<a href="http://jinja.pocoo.org">Jinja2</a> templating syntax.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">mkdir</span> <span class="nt">-p</span> mypyproject/<span class="o">{{</span>cookiecutter.repo_name<span class="o">}}</span>/<span class="o">{{</span>cookiecutter.repo_name<span class="o">}}</span>
</code></pre></div></div>
<p>Now create the file <code class="language-plaintext highlighter-rouge">myproject/cookiecutter.json</code> and add the code below to it.</p>
<div class="language-json highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="p">{</span><span class="w">
</span><span class="nl">"repo_name"</span><span class="p">:</span><span class="w"> </span><span class="s2">"mypackage"</span><span class="p">,</span><span class="w">
</span><span class="nl">"version"</span><span class="p">:</span><span class="w"> </span><span class="s2">"0.0.1"</span><span class="p">,</span><span class="w">
</span><span class="nl">"author"</span><span class="p">:</span><span class="w"> </span><span class="s2">"Your Name"</span><span class="w">
</span><span class="p">}</span><span class="w">
</span></code></pre></div></div>
<p>Let us have a look at the directory structure we have created.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tree mypyproject/
mypyproject/
├── cookiecutter.json
└── <span class="o">{{</span>cookiecutter.repo_name<span class="o">}}</span>
└── <span class="o">{{</span>cookiecutter.repo_name<span class="o">}}</span>
2 directories, 1 file
</code></pre></div></div>
<p>We now have enough boilerplate to run cookiecutter. Actually we have more
than enough, at this point we do not need the <code class="language-plaintext highlighter-rouge">version</code> and <code class="language-plaintext highlighter-rouge">author</code>
variables.</p>
<p>Let us create an “awesome” Python package to see it in action.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>cookiecutter mypyproject/
repo_name <span class="o">(</span>default is <span class="s2">"mypackage"</span><span class="o">)</span>? awesome
version <span class="o">(</span>default is <span class="s2">"0.0.1"</span><span class="o">)</span>?
author <span class="o">(</span>default is <span class="s2">"Your Name"</span><span class="o">)</span>? Tjelvar Olsson
</code></pre></div></div>
<p>Note that the prompts and default values are the key/value pairs specified
in the <code class="language-plaintext highlighter-rouge">cookiecutter.json</code> file.</p>
<p>Let us have a look at what was produced.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tree awesome/
awesome/
└── awesome
1 directory, 0 files
</code></pre></div></div>
<p>Ok, great - let us add an <code class="language-plaintext highlighter-rouge">__init__.py</code> file to the leaf
<code class="language-plaintext highlighter-rouge">myproject/{{cookiecutter.repo_name}}/{{cookiecutter.repo_name}}</code> directory.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">touch </span>mypyproject/<span class="se">\{\{</span>cookiecutter.repo_name<span class="se">\}\}</span>/<span class="se">\{\{</span>cookiecutter.repo_name<span class="se">\}\}</span>/__init__.py
</code></pre></div></div>
<p>In the above we need to esacape the <code class="language-plaintext highlighter-rouge">{</code> and <code class="language-plaintext highlighter-rouge">}</code> characters when using bash.
If you are not already using tab completion when using bash this may be a good
point to try it out (just start typing the name of the file/directory of
interest and then press the tab key).</p>
<p>Let’s run <code class="language-plaintext highlighter-rouge">cookiecutter</code> again to see what we get now that we have added the
<code class="language-plaintext highlighter-rouge">__init__.py</code> file.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>cookiecutter mypyproject/
repo_name <span class="o">(</span>default is <span class="s2">"mypackage"</span><span class="o">)</span>? awesome
version <span class="o">(</span>default is <span class="s2">"0.0.1"</span><span class="o">)</span>?
author <span class="o">(</span>default is <span class="s2">"Your Name"</span><span class="o">)</span>? Tjelvar Olsson
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tree awesome/
awesome/
└── awesome
└── __init__.py
1 directory, 1 file
</code></pre></div></div>
<p>Great we now automatically get an <code class="language-plaintext highlighter-rouge">__init__.py</code> file added to our project
when we create it. Now let us add a basic, but all the same templated,
<code class="language-plaintext highlighter-rouge">setup.py</code> file to our project layout. Create the file
<code class="language-plaintext highlighter-rouge">mypyproject/{{cookiecutter.repo_name}}/setup.py</code> and copy and paste the code
below into it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">setuptools</span> <span class="kn">import</span> <span class="n">setup</span>
<span class="n">setup</span><span class="p">(</span><span class="n">name</span><span class="o">=</span><span class="s">"{{ cookiecutter.repo_name }}"</span><span class="p">,</span>
<span class="n">version</span><span class="o">=</span><span class="s">"{{ cookiecutter.version }}"</span><span class="p">,</span>
<span class="n">author</span><span class="o">=</span><span class="s">"{{ cookiecutter.author }}"</span>
<span class="p">)</span>
</code></pre></div></div>
<p>Let us try this out.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cookiecutter mypyproject/
repo_name (default is "mypackage")? awesome
version (default is "0.0.1")?
author (default is "Your Name")? Tjelvar Olsson
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>tree awesome/
awesome/
├── awesome
│ └── __init__.py
└── setup.py
1 directory, 2 files
</code></pre></div></div>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span><span class="nb">cat </span>awesome/setup.py
from setuptools import setup
setup<span class="o">(</span><span class="nv">name</span><span class="o">=</span><span class="s2">"awesome"</span>,
<span class="nv">version</span><span class="o">=</span><span class="s2">"0.0.1"</span>,
<span class="nv">author</span><span class="o">=</span><span class="s2">"Tjelvar Olsson"</span>
<span class="o">)</span>
</code></pre></div></div>
<p>Great we now have a basic layout for building up a Python project!</p>
<p>Now that you know the principles you can use them to automate the generation of
your boilerplate code.</p>
<h2 id="making-use-of-github">Making use of GitHub</h2>
<p>Once you start building up your template make sure that you save it on GitHub
or BitBucket. <em>You are already using version control, right?</em></p>
<p>A nice feature of Cookiecutter is that it has built in functionality for making
use of templates stored in GitHub/Bitbucket. For example to make use of my
default Python package layout, which includes:</p>
<ul>
<li>setup.py</li>
<li>test suite layout using nose and coverage</li>
<li>sphinx docs layout using read the docs theme</li>
</ul>
<p>You can simply use the command below.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nv">$ </span>cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
Cloning into <span class="s1">'cookiecutter-pypackage'</span>...
remote: Counting objects: 48, <span class="k">done</span><span class="nb">.</span>
remote: Compressing objects: 100% <span class="o">(</span>37/37<span class="o">)</span>, <span class="k">done</span><span class="nb">.</span>
remote: Total 48 <span class="o">(</span>delta 13<span class="o">)</span>, reused 37 <span class="o">(</span>delta 8<span class="o">)</span>, pack-reused 0
Unpacking objects: 100% <span class="o">(</span>48/48<span class="o">)</span>, <span class="k">done</span><span class="nb">.</span>
Checking connectivity... <span class="k">done</span><span class="nb">.</span>
repo_name <span class="o">(</span>default is <span class="s2">"mypackage"</span><span class="o">)</span>? awesome
version <span class="o">(</span>default is <span class="s2">"0.0.1"</span><span class="o">)</span>?
authors <span class="o">(</span>default is <span class="s2">"Tjelvar Olsson"</span><span class="o">)</span>?
</code></pre></div></div>
<p>Alternatively, for an even more extensive setup have a look at <a href="https://github.com/audreyr/cookiecutter-pypackage">Audrey Roy’s
ultimate python package
template</a>.</p>
<h2 id="summary">Summary</h2>
<p>When you find yourself repeatedly doing the same thing it may be time to start
thinking about using a code generator. In this post I have shown you how to
use <code class="language-plaintext highlighter-rouge">cookiecutter</code> to produce a basic Python package template.</p>
<p>However, it is not limited to Python package projects. You could use
it to automate the setup of CMake / HTML / LaTeX files; the world is your
oyster.</p>
<p>Happy code generating!</p>
How to manage firewalls using ferm and Ansible2015-04-24T00:00:00+00:00http://tjelvarolsson.com/blog/how-to-manage-firewalls-using-ferm-and-ansible<p><img src="/images/firewall.jpg" alt="Firewall." /></p>
<p>In the
<a href="/blog/ansible-playbook-for-installing-the-gbrowse-genome-browser/">previous post</a>
we created an Ansible playbook for installing the
GBrowse genome browser. As the name implies GBrowse is a browser based
application and it serves web pages over http using Apache. If one is
installing this software as a service to be made more widely accessible one
needs to start thinking about security. In this post we will therefore
configure a firewall for our machine.</p>
<h2 id="iptables">iptables</h2>
<p>The standard tool for setting up firewalls on Linux is <code class="language-plaintext highlighter-rouge">iptables</code>. It is a
way to set up policy chains to allow or block traffic to, from and through the
machine of interest. If you have not come across or managed <code class="language-plaintext highlighter-rouge">iptables</code> before
I recommend that you have a look at howtogeek’s <a href="http://www.howtogeek.com/177621/the-beginners-guide-to-iptables-the-linux-firewall/">Beginner’s Guide to iptables,
the Linux
Firewall</a>
and Major Hayden’s <a href="https://major.io/2010/04/12/best-practices-iptables/">Best practices:
iptables</a>.</p>
<h2 id="ferm">ferm</h2>
<p>However, managing firewalls using <code class="language-plaintext highlighter-rouge">iptables</code> can be a pain. Several tools have
therefore evolved to make things easier. In this post we will be using a
program called <a href="http://ferm.foo-projects.org">ferm</a> (for Easy Rule Making).</p>
<p>When configuring a firewall it is easy to lock oneself out of the machine one
is configuring. The most common scenario for this is setting the default policy
to drop incoming connections and then accidentally flushing the connection
rules, including the rule to accept <code class="language-plaintext highlighter-rouge">ssh</code> connections, leaving the server
inaccessible. To avoid this scenario we will configure the default policy to
accept incoming connections and to secure the server we will include a rule to
drop any incoming connections that do not match any other rules.</p>
<p>Below is a list stating the behaviour that we want from the <code class="language-plaintext highlighter-rouge">INPUT</code> chain of
our firewall.</p>
<ul>
<li>We want the default policy to accept incoming connections</li>
<li>We want to enable connection tracking</li>
<li>We want to be able to <code class="language-plaintext highlighter-rouge">ping</code> the machine</li>
<li>We want to be able to <code class="language-plaintext highlighter-rouge">ssh</code> into the machine</li>
<li>We want to be able to add custom rules using Ansible</li>
<li>Finally, we want to drop any incoming connections that do not match any rules</li>
</ul>
<p>The behaviours that we want from the <code class="language-plaintext highlighter-rouge">OUTPUT</code> and <code class="language-plaintext highlighter-rouge">FORWARD</code> chains are
simpler. We do not want to limit any outgoing connections so we will set the
output policy to accept all connections and because we are not configuring a
router we will set the policy of the forward chain to drop all connections.</p>
<p>We can configure the behaviour above using the <code class="language-plaintext highlighter-rouge">ferm.conf</code> file below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Ferm script for configuring iptables.
table filter {
chain INPUT {
# Set the default policy to ACCEPT to avoid getting
# accidentally locked out.
policy ACCEPT;
# Connection tracking.
mod state state INVALID DROP;
mod state state (ESTABLISHED RELATED) ACCEPT;
# Allow local connections.
interface lo ACCEPT;
# Respond to ping.
proto icmp icmp-type echo-request ACCEPT;
# Allow ssh connections.
proto tcp dport ssh ACCEPT;
# Ansible specified rules.
# Because the default policy is to ACCEPT we DROP
# everything that comes through to this stage.
DROP;
}
# Outgoing connections are not limited.
chain OUTPUT policy ACCEPT;
# This is not a router.
chain FORWARD policy DROP;
}
</code></pre></div></div>
<p>If you have <code class="language-plaintext highlighter-rouge">ferm</code> installed you can apply the firewall above using the command below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ sudo ferm ferm.conf
</code></pre></div></div>
<p>Note that in the <code class="language-plaintext highlighter-rouge">ferm.conf</code> file above we have an empty section marked by the comment
<code class="language-plaintext highlighter-rouge"># Ansible specified rules.</code>. We will use this to dynamically alter the
firewall rules during the running of our Ansible playbook.</p>
<h2 id="integrating-ferm-with-ansible">Integrating ferm with Ansible</h2>
<p>Let us create an Ansible role for installing and configuring <code class="language-plaintext highlighter-rouge">ferm</code>. Copy
and paste the code below into a file named <code class="language-plaintext highlighter-rouge">roles/ferm/task/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="c1"># Install and configure ferm.</span>
<span class="c1">#</span>
<span class="c1"># The ferm program is in the epel repository so we need</span>
<span class="c1"># to enable it. This could be a separate role, but this</span>
<span class="c1"># is left as an exercise for the reader.</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">enable the epel repo</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name=epel-release</span>
<span class="s">state=present</span>
<span class="c1"># We need to install libselinux-python on the target</span>
<span class="c1"># machine to be able to use Ansible to copy the ferm.conf</span>
<span class="c1"># file to the /etc/ferm/ directory. It would be reasonable</span>
<span class="c1"># to move this task into a separate role for installing common</span>
<span class="c1"># software, again this is left as an exercise for the reader.</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install libselinux-python</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name=libselinux-python</span>
<span class="s">state=present</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install ferm</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name=ferm</span>
<span class="s">state=present</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">add /etc/ferm directory</span>
<span class="na">file</span><span class="pi">:</span> <span class="s">path=/etc/ferm</span>
<span class="s">mode=0700</span>
<span class="s">state=directory</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">add the ferm.conf file to /etc/ferm</span>
<span class="na">copy</span><span class="pi">:</span> <span class="s">src=ferm.conf</span>
<span class="s">dest=/etc/ferm/ferm.conf</span>
<span class="na">notify</span><span class="pi">:</span> <span class="s">run ferm</span>
</code></pre></div></div>
<p>Note that the last task copies the <code class="language-plaintext highlighter-rouge">ferm.conf</code> file we created above to the
target machine. However, for this to work Ansible expects the <code class="language-plaintext highlighter-rouge">ferm.conf</code>
file to be located in the directory named <code class="language-plaintext highlighter-rouge">roles/ferm/files/</code>. Let us
therefore create this directory and move the file there.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir roles/ferm/files
$ mv ferm.conf roles/ferm/files/
</code></pre></div></div>
<p>In the
<a href="/blog/ansible-playbook-for-installing-the-gbrowse-genome-browser/">previous post</a>
I introduced the concept of handlers that could be
notified by other tasks. Let us create a handler for applying the <code class="language-plaintext highlighter-rouge">ferm</code>
rules. Copy and paste the code below into a file named
<code class="language-plaintext highlighter-rouge">roles/ferm/handlers/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">run ferm</span>
<span class="na">command</span><span class="pi">:</span> <span class="s">ferm /etc/ferm/ferm.conf</span>
<span class="na">notify</span><span class="pi">:</span> <span class="s">save iptables</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">save iptables</span>
<span class="na">command</span><span class="pi">:</span> <span class="s">service iptables save</span>
</code></pre></div></div>
<p>Now we have a handler named <code class="language-plaintext highlighter-rouge">run ferm</code>, which when notified will run the
command <code class="language-plaintext highlighter-rouge">ferm /etc/ferm/ferm.conf</code> and in turn notify the <code class="language-plaintext highlighter-rouge">save iptables</code>
handler, which makes sure that the firewall rules persist if the machine is
rebooted.</p>
<p>Let us add this role to our playbook. Update the <code class="language-plaintext highlighter-rouge">gbrowse.yml</code> file so that
it looks like the below (we have only added the <code class="language-plaintext highlighter-rouge">ferm</code> role).</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">all</span>
<span class="na">sudo</span><span class="pi">:</span> <span class="s">True</span>
<span class="na">roles</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">ferm</span>
<span class="pi">-</span> <span class="s">gbrowse</span>
</code></pre></div></div>
<p>However, if you run the <code class="language-plaintext highlighter-rouge">gbrowse.yml</code> playbook at this point the GBrowse
application will stop working as port 80 will be closed. Let us therefore add a
task to open up ports 80 (http) and 443 (https) to the <code class="language-plaintext highlighter-rouge">apache</code> role. Edit
the file <code class="language-plaintext highlighter-rouge">roles/apache/tasks/main.yml</code> to look like the below.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="c1"># Install and configure Apache.</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install apache</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name=httpd</span>
<span class="s">state=present</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">start apache and enable at boot</span>
<span class="na">service</span><span class="pi">:</span> <span class="s">name=httpd</span>
<span class="s">enabled=yes</span>
<span class="s">state=started</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">open up the http and https ports</span>
<span class="na">lineinfile</span><span class="pi">:</span> <span class="s">dest=/etc/ferm/ferm.conf</span>
<span class="s">line='proto tcp dport (http https) ACCEPT;'</span>
<span class="s">insertafter='# Ansible specified rules.'</span>
<span class="na">notify</span><span class="pi">:</span> <span class="s">run ferm</span>
</code></pre></div></div>
<p>In the above we make use of Ansible’s <a href="http://docs.ansible.com/lineinfile_module.html">lineinfile
module</a> to insert a new rule to
the <code class="language-plaintext highlighter-rouge">ferm.conf</code> file.</p>
<h2 id="results">Results</h2>
<p>Let us run the playbook and find out what the resulting <code class="language-plaintext highlighter-rouge">iptables</code> firewall
looks like. Here I am using the same Vagrant/Ansible setup as described in <a href="/blog/how-to-create-automated-and-reproducible-work-flows-for-installing-scientific-software/">how
to create automated and reproducible work flows for installing scientific
software</a>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ansible-playbook -i hosts gbrowse.yml
...
$ vagrant ssh
Last login: Thu Apr 16 02:03:00 2015 from 192.168.33.1
[vagrant@localhost ~]$ sudo iptables -nL
Chain INPUT (policy ACCEPT)
target prot opt source destination
DROP all -- 0.0.0.0/0 0.0.0.0/0 state INVALID
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED
ACCEPT all -- 0.0.0.0/0 0.0.0.0/0
ACCEPT icmp -- 0.0.0.0/0 0.0.0.0/0 icmp type 8
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:22
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:80
ACCEPT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:443
DROP all -- 0.0.0.0/0 0.0.0.0/0
Chain FORWARD (policy DROP)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
</code></pre></div></div>
<h2 id="discussion">Discussion</h2>
<p>If you have public facing machines you need to think about security. However
managing firewalls using <code class="language-plaintext highlighter-rouge">iptables</code> directly can be a pain.</p>
<p>In this post I have outlined how you can integrate <code class="language-plaintext highlighter-rouge">ferm</code> and Ansible to
manage your firewall. The cool thing about this approach is that the role of
interest, in this case <code class="language-plaintext highlighter-rouge">apache</code>, is responsible for opening up the relevant
ports.</p>
<p>Furthermore as the <code class="language-plaintext highlighter-rouge">/etc/ferm/ferm.conf</code> file will be re-written every time
you run the playook your rules will be updated both if you add or remove roles
from the playbook. In other words if you removed the <code class="language-plaintext highlighter-rouge">apache</code> role and ran
the playbook ports 80 and 433 would be closed at the end when the handlers were
executed (handlers notified during a playbook are executed at the end of it).</p>
<p>Finally, note that security is a complex topic and that the reading of this
post should not be taken as a substitute for a proper understanding of how to
manage firewalls. That is a roundabout way of stating that I do not take
responsibility for any security breaches that you encounter.</p>
Ansible playbook for installing the GBrowse genome browser2015-04-18T00:00:00+00:00http://tjelvarolsson.com/blog/ansible-playbook-for-installing-the-gbrowse-genome-browser<p><img src="/images/gbrowse-screenshot.png" alt="GBrowse screenshot" /></p>
<p>In previous posts I have described <a href="/blog/how-to-create-automated-and-reproducible-work-flows-for-installing-scientific-software/">how to use ansible to create automated and
reproducible work flows for installing scientific
software</a>
and <a href="/blog/how-to-create-reusable-ansible-components/">how to create reusable Ansible
components</a>.
In this post we will create a playbook for installing the genome browser
<a href="http://gbrowse.org/index.html">GBrowse</a> and in the process we will learn how
to install and manage services, such as Apache, using Ansible.</p>
<h2 id="adding-biographics-to-the-bio_perl-role">Adding <code class="language-plaintext highlighter-rouge">Bio::Graphics</code> to the <code class="language-plaintext highlighter-rouge">bio_perl</code> role</h2>
<p>GBrowse does not only depend on <code class="language-plaintext highlighter-rouge">Bio::Perl</code> it also depends on
<code class="language-plaintext highlighter-rouge">Bio::Graphics</code>. At this point we could add a role for installing
<code class="language-plaintext highlighter-rouge">Bio::Graphics</code>. However, I prefer to add the installation of it to the
existing <code class="language-plaintext highlighter-rouge">bio_perl</code> role.</p>
<p>It turns out that <code class="language-plaintext highlighter-rouge">Bio::Graphics</code> depends on <code class="language-plaintext highlighter-rouge">GD</code>, which I struggle to
install using <code class="language-plaintext highlighter-rouge">cpanm</code>. However, it is available in a pre-compiled form from
the CentOS repositories so we can install <code class="language-plaintext highlighter-rouge">perl-GD</code> from there using <code class="language-plaintext highlighter-rouge">yum</code>.</p>
<p>Furthermore, it turned out that the <code class="language-plaintext highlighter-rouge">Bio::Graphics</code> had an implicit
dependency on the <code class="language-plaintext highlighter-rouge">CGI</code> module.</p>
<p>Please update the <code class="language-plaintext highlighter-rouge">roles/bio_perl/tasks/main.yml</code> file to look like the below.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="c1"># Install and configure Bio::Perl and Bio::Graphics.</span>
<span class="c1"># Bio::Graphics requires GD.</span>
<span class="c1"># However, I cannot work out how to install GD using cpanm,</span>
<span class="c1"># so installing it using yum instead.</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install perl-GD</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name=perl-GD</span>
<span class="s">state=present</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install implicit Bio::Perl dependencies</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">Time::HiRes</span>
<span class="pi">-</span> <span class="s">LWP::UserAgent</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install implicit Bio::Graphics dependencies</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name=CGI</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install BioPerl</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">Bio::Perl</span>
<span class="pi">-</span> <span class="s">Bio::Graphics</span>
</code></pre></div></div>
<h2 id="installing-and-configuring-apache">Installing and configuring Apache</h2>
<p>As the name implies GBrowse is a web based tool so to serve it we need to
install Apache. Let us create create a new role for this. Copy and paste the
text below into the file <code class="language-plaintext highlighter-rouge">roles/apache/tasks/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="c1"># Install and configure Apache.</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install apache</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name=httpd</span>
<span class="s">state=present</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">start apache and enable at boot</span>
<span class="na">service</span><span class="pi">:</span> <span class="s">name=httpd</span>
<span class="s">enabled=yes</span>
<span class="s">state=started</span>
</code></pre></div></div>
<p>The code above introduces us to the Ansible <a href="http://docs.ansible.com/service_module.html">service
module</a>. The service module is
used to interact with services managed by <code class="language-plaintext highlighter-rouge">initd</code> (or <code class="language-plaintext highlighter-rouge">systemd</code> on CentOS
7). In the <code class="language-plaintext highlighter-rouge">service</code> task above we ask for the service to be started and for
it to be enabled at boot.</p>
<p>Now suppose that we wanted to restart Apache at some point in our Ansible
script. For example after having installed another piece of software that was
served by Apache, such as GBrowse. This can be achived using Ansible’s concept
of handlers. Let us therefore add a handler for restarting apache. Copy and
paste the code below into the file <code class="language-plaintext highlighter-rouge">roles/apache/handlers/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">restart apache</span>
<span class="na">service</span><span class="pi">:</span> <span class="s">name=httpd</span>
<span class="s">state=restarted</span>
</code></pre></div></div>
<p>Now any task in a playbook that makes use of the <code class="language-plaintext highlighter-rouge">apache</code> role can restart
Apache by adding the directive <code class="language-plaintext highlighter-rouge">notify: restart apache</code>. We will see an
example of this later on in the post towards the end of the <code class="language-plaintext highlighter-rouge">gbrowse</code> role.</p>
<h2 id="creating-the-gbrowse-role">Creating the <code class="language-plaintext highlighter-rouge">gbrowse</code> role</h2>
<p>We are now in a position to create the <code class="language-plaintext highlighter-rouge">gbrowse</code> role for configuring and
installing the GBrowse software. Let us start by defining the Ansible roles it
depends on. Copy and paste the code below into the file
<code class="language-plaintext highlighter-rouge">roles/gbrowse/meta/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">dependencies</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span> <span class="nv">role</span><span class="pi">:</span> <span class="nv">apache</span> <span class="pi">}</span>
<span class="pi">-</span> <span class="pi">{</span> <span class="nv">role</span><span class="pi">:</span> <span class="nv">bio_perl</span> <span class="pi">}</span>
</code></pre></div></div>
<p>GBrowse has got pretty good <a href="http://search.cpan.org/src/LDS/GBrowse-2.54/README">installation
notes</a> and following them
we only need to deal with a couple issues: a couple of undocumented Perl module
dependencies and the fact that the resulting <code class="language-plaintext highlighter-rouge">Build</code> script requires
interactive answers. The former is easy to deal with, we simply install the
missing Perl modules using <code class="language-plaintext highlighter-rouge">cpanm</code>. However, the latter is more tricky.</p>
<p>Ansible is not really meant to deal with interactive tasks. This means that
installers that ask a lot of questions pose a problem. However fortunately in
this case the <code class="language-plaintext highlighter-rouge">./Build config</code> command provides sensible defaults that we can
accept and we can simply answer no to all the questions posed by <code class="language-plaintext highlighter-rouge">./Build
install</code>. This means that we can use a work around outlined in a <a href="http://marvelley.com/blog/2014/04/23/handling-interactive-ansible-tasks/">post by Craig
Marvelley</a>.</p>
<p>Copy and paste the code below into the file <code class="language-plaintext highlighter-rouge">roles/gbrowse/tasks/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="c1"># Install and configure the gbrowse genome browser.</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install undocumented dependencies</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">Date::Parse</span>
<span class="pi">-</span> <span class="s">Term::ReadKey</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install remaining perl module dependencies</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">CGI::Session</span>
<span class="pi">-</span> <span class="s">Digest::MD5</span>
<span class="pi">-</span> <span class="s">File::Temp</span>
<span class="pi">-</span> <span class="s">IO::String</span>
<span class="pi">-</span> <span class="s">JSON</span>
<span class="pi">-</span> <span class="s">Storable</span>
<span class="pi">-</span> <span class="s">Statistics::Descriptive</span>
<span class="pi">-</span> <span class="s">DBI</span>
<span class="pi">-</span> <span class="s">Net::SMTP</span>
<span class="pi">-</span> <span class="s">DBD::SQLite</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">download the gbrowse tarball</span>
<span class="na">get_url</span><span class="pi">:</span> <span class="s">url=http://search.cpan.org/CPAN/authors/id/L/LD/LDS/GBrowse-2.54.tar.gz</span>
<span class="s">dest=/tmp/</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">unpack the gbrowse tarball</span>
<span class="na">command</span><span class="pi">:</span> <span class="s">tar -zxf GBrowse-2.54.tar.gz</span>
<span class="na">args</span><span class="pi">:</span>
<span class="na">chdir</span><span class="pi">:</span> <span class="s">/tmp/</span>
<span class="na">creates</span><span class="pi">:</span> <span class="s">/tmp/GBrowse-2.54/LICENSE</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">build the installer</span>
<span class="na">command</span><span class="pi">:</span> <span class="s">perl Build.PL</span>
<span class="na">args</span><span class="pi">:</span>
<span class="na">chdir</span><span class="pi">:</span> <span class="s">/tmp/GBrowse-2.54/</span>
<span class="na">creates</span><span class="pi">:</span> <span class="s">/tmp/GBrowse-2.54/Build</span>
<span class="c1"># For more detail on ``yes ' ' |`` syntax for accepting default values see:</span>
<span class="c1"># http://marvelley.com/blog/2014/04/23/handling-interactive-ansible-tasks/</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">configure the install accepting all default values</span>
<span class="na">shell</span><span class="pi">:</span> <span class="s">yes '' | ./Build config</span>
<span class="na">args</span><span class="pi">:</span>
<span class="na">chdir</span><span class="pi">:</span> <span class="s">/tmp/GBrowse-2.54/</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install gbrowse answering no to all interactive questions</span>
<span class="na">shell</span><span class="pi">:</span> <span class="s">yes 'n' | ./Build install</span>
<span class="na">args</span><span class="pi">:</span>
<span class="na">chdir</span><span class="pi">:</span> <span class="s">/tmp/GBrowse-2.54/</span>
<span class="na">creates</span><span class="pi">:</span> <span class="s">/etc/httpd/conf.d/gbrowse2.conf</span>
<span class="na">notify</span><span class="pi">:</span> <span class="s">restart apache</span>
</code></pre></div></div>
<p>Note the <code class="language-plaintext highlighter-rouge">notify: restart apache</code> directive added to the final task above.
This will ensure that Apache is restarted after GBrowse has been installed.</p>
<p>One of the questions we answer “no” to in the interactive installer is to
register our use of GBrowse. If you find this tool useful the developers of it
would appreciate if you registered. You can do this at any point by running the
command <code class="language-plaintext highlighter-rouge">./Build register</code>.</p>
<h2 id="creating-the-playbook">Creating the playbook</h2>
<p>Now create a playbook named <code class="language-plaintext highlighter-rouge">gbrowse.yml</code> at the same level as your
<code class="language-plaintext highlighter-rouge">roles</code> directory with the code below.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">all</span>
<span class="na">sudo</span><span class="pi">:</span> <span class="s">True</span>
<span class="na">roles</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">gbrowse</span>
</code></pre></div></div>
<p>I am using the same Vagrant setup as outlined in the post on <a href="/blog/how-to-create-automated-and-reproducible-work-flows-for-installing-scientific-software/">how to create
automated and reproducible work flows for installing scientific
software</a>.
So to run the playbook I simply use the command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ansible-playbook -i hosts playbook.yml
</code></pre></div></div>
<p>When the playbook finished running I could view the GBrowse application
in my browser by going to the url <code class="language-plaintext highlighter-rouge">http://192.168.33.10/gbrowse2/</code>
(<code class="language-plaintext highlighter-rouge">192.168.33.10</code> being the private network specified in the Vagrant file from
the previous post).</p>
<h2 id="conclusion">Conclusion</h2>
<p>In this post I have shown you how to create a reproducible and automated work
flow for installing the GBrowse genome brower using Ansible.</p>
<p>We created a role for installing and managing Apache. This introduced us to
Ansible’s <code class="language-plaintext highlighter-rouge">service</code> module and the concept of “handlers” that can be
“notified” by other tasks in a playbook.</p>
<p>In the
<a href="/blog/how-to-manage-firewalls-using-ferm-and-ansible/">next post</a>
we will look into how we can manage the firewall of our
machine using Ansible and ferm.</p>
How to create reusable Ansible components2015-04-11T00:00:00+00:00http://tjelvarolsson.com/blog/how-to-create-reusable-ansible-components<p><img src="/images/reusable-components-pop-art.jpg" alt="Tools pop art." /></p>
<p>In the <a href="/blog/how-to-create-automated-and-reproducible-work-flows-for-installing-scientific-software/">previous post</a>
I described how to create reproducible and automated work flows for installing
scientific software using <a href="http://www.ansible.com/home">Ansible</a>. In the end we
had an Ansible playbook for installing <code class="language-plaintext highlighter-rouge">Bio::Perl</code>. The playbook did
many things. It installed <code class="language-plaintext highlighter-rouge">gcc</code> and <code class="language-plaintext highlighter-rouge">cpanm</code> as well as <code class="language-plaintext highlighter-rouge">Bio::Perl</code>. In
this post I will show how we can split these tasks out into reusable components
using Ansible’s concept of “roles”.</p>
<p>Let us have a look at the Ansible playbook from the end of the previous post.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">all</span>
<span class="na">sudo</span><span class="pi">:</span> <span class="s">True</span>
<span class="na">tasks</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install gcc required to build some Perl modules</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name=gcc</span>
<span class="s">state=present</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install cpan and perl-devel</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="s">state=present</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">perl-devel</span>
<span class="pi">-</span> <span class="s">perl-CPAN</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">download cpanm</span>
<span class="na">get_url</span><span class="pi">:</span> <span class="s">url=https://cpanmin.us/</span>
<span class="s">dest=/tmp/cpanm.pl</span>
<span class="s">mode=755</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install cpanm so that we can use the ansible cpanm module</span>
<span class="na">command</span><span class="pi">:</span> <span class="s">perl cpanm.pl App::cpanminus</span>
<span class="na">args</span><span class="pi">:</span>
<span class="na">chdir</span><span class="pi">:</span> <span class="s">/tmp/</span>
<span class="na">creates</span><span class="pi">:</span> <span class="s">/usr/local/bin/cpanm</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">add cpanm symbolic link to /usr/bin/</span>
<span class="na">file</span><span class="pi">:</span> <span class="s">src=/usr/local/bin/cpanm</span>
<span class="s">dest=/usr/bin/cpanm</span>
<span class="s">state=link</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install implicit Bio::Perl dependencies</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">Time::HiRes</span>
<span class="pi">-</span> <span class="s">LWP::UserAgent</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install BioPerl</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name=Bio::Perl</span>
</code></pre></div></div>
<p>Looking at the above there are at least three reusable roles: <code class="language-plaintext highlighter-rouge">build_tools</code>
(for installing <code class="language-plaintext highlighter-rouge">gcc</code>; this role could grow to include more build tools in the
future), <code class="language-plaintext highlighter-rouge">cpanm</code> (for installing and configuring <code class="language-plaintext highlighter-rouge">cpanm</code>), and <code class="language-plaintext highlighter-rouge">bio_perl</code>
(for installing <code class="language-plaintext highlighter-rouge">Bio::Perl</code> and its implicit dependencies). I guess one could
argue that the implicit dependencies of <code class="language-plaintext highlighter-rouge">Bio::Perl</code> could be split out into
individual roles, but for now I think that would be too granular.</p>
<p>To create Ansible roles we need a directory named <code class="language-plaintext highlighter-rouge">roles</code>. Let us create it
along with the directories required for the <code class="language-plaintext highlighter-rouge">build_tools</code> role.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir -p roles/build_tools/tasks
</code></pre></div></div>
<p>Now we move the task of installing <code class="language-plaintext highlighter-rouge">gcc</code> into the <code class="language-plaintext highlighter-rouge">build_tools</code> role by
copying and pasting the text below into the file
<code class="language-plaintext highlighter-rouge">roles/build_tools/tasks/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="c1"># Install and configure build tools.</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install gcc</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name=gcc</span>
<span class="s">state=present</span>
</code></pre></div></div>
<p>We now need to remove the <code class="language-plaintext highlighter-rouge">gcc</code> task from the playbook and add the
<code class="language-plaintext highlighter-rouge">build_tools</code> role. Modify the <code class="language-plaintext highlighter-rouge">playbook.yml</code> file to look like the
below.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">all</span>
<span class="na">sudo</span><span class="pi">:</span> <span class="s">True</span>
<span class="na">roles</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">build_tools</span>
<span class="na">tasks</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install cpan and perl-devel</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="s">state=present</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">perl-devel</span>
<span class="pi">-</span> <span class="s">perl-CPAN</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">download cpanm</span>
<span class="na">get_url</span><span class="pi">:</span> <span class="s">url=https://cpanmin.us/</span>
<span class="s">dest=/tmp/cpanm.pl</span>
<span class="s">mode=755</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install cpanm so that we can use the ansible cpanm module</span>
<span class="na">command</span><span class="pi">:</span> <span class="s">perl cpanm.pl App::cpanminus</span>
<span class="na">args</span><span class="pi">:</span>
<span class="na">chdir</span><span class="pi">:</span> <span class="s">/tmp/</span>
<span class="na">creates</span><span class="pi">:</span> <span class="s">/usr/local/bin/cpanm</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">add cpanm symbolic link to /usr/bin/</span>
<span class="na">file</span><span class="pi">:</span> <span class="s">src=/usr/local/bin/cpanm</span>
<span class="s">dest=/usr/bin/cpanm</span>
<span class="s">state=link</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install implicit Bio::Perl dependencies</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">Time::HiRes</span>
<span class="pi">-</span> <span class="s">LWP::UserAgent</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install BioPerl</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name=Bio::Perl</span>
</code></pre></div></div>
<p>In the above it is worth noting that one can mix <code class="language-plaintext highlighter-rouge">roles</code> and <code class="language-plaintext highlighter-rouge">tasks</code> in the
same playbook. This is useful when one wants to create a playbook that makes
use of some reusable roles but which also needs to perform some non-reusable
tasks.</p>
<p>Now we can try running the playbook to make sure that we have not broken
anything. Note that the output now reflects the fact that the <code class="language-plaintext highlighter-rouge">install gcc</code>
task is being called from within the <code class="language-plaintext highlighter-rouge">build_tools</code> role.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ansible-playbook -i hosts playbook.yml
PLAY [all] ********************************************************************
GATHERING FACTS ***************************************************************
ok: [scicomp.example.com]
TASK: [build_tools | install gcc] *********************************************
changed: [scicomp.example.com]
...
</code></pre></div></div>
<p>Let us now create directory structures for the <code class="language-plaintext highlighter-rouge">cpanm</code> and <code class="language-plaintext highlighter-rouge">bio_perl</code> roles.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir -p roles/{cpanm,bio_perl}/tasks
</code></pre></div></div>
<p>For the <code class="language-plaintext highlighter-rouge">cpanm</code> role cut and paste the code below into the file
<code class="language-plaintext highlighter-rouge">roles/cpanm/tasks/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="c1"># Install and configure cpanm.</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install cpan and perl-devel</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="s">state=present</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">perl-devel</span>
<span class="pi">-</span> <span class="s">perl-CPAN</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">download cpanm</span>
<span class="na">get_url</span><span class="pi">:</span> <span class="s">url=https://cpanmin.us/</span>
<span class="s">dest=/tmp/cpanm.pl</span>
<span class="s">mode=755</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install cpanm so that we can use the ansible cpanm module</span>
<span class="na">command</span><span class="pi">:</span> <span class="s">perl cpanm.pl App::cpanminus</span>
<span class="na">args</span><span class="pi">:</span>
<span class="na">chdir</span><span class="pi">:</span> <span class="s">/tmp/</span>
<span class="na">creates</span><span class="pi">:</span> <span class="s">/usr/local/bin/cpanm</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">add cpanm symbolic link to /usr/bin/</span>
<span class="na">file</span><span class="pi">:</span> <span class="s">src=/usr/local/bin/cpanm</span>
<span class="s">dest=/usr/bin/cpanm</span>
<span class="s">state=link</span>
</code></pre></div></div>
<p>And the code below into the file <code class="language-plaintext highlighter-rouge">roles/bio_perl/tasks/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="c1"># Install and configure Bio::Perl.</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install implicit Bio::Perl dependencies</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">Time::HiRes</span>
<span class="pi">-</span> <span class="s">LWP::UserAgent</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install BioPerl</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name=Bio::Perl</span>
</code></pre></div></div>
<p>Finally let us update the <code class="language-plaintext highlighter-rouge">playbook.yml</code> file so that it looks like the below.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">all</span>
<span class="na">sudo</span><span class="pi">:</span> <span class="s">True</span>
<span class="na">roles</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">build_tools</span>
<span class="pi">-</span> <span class="s">cpanm</span>
<span class="pi">-</span> <span class="s">bio_perl</span>
</code></pre></div></div>
<p>That is much cleaner! Furthermore the roles can be reused in other
playbooks as and when we need them.</p>
<h2 id="adding-dependencies">Adding dependencies</h2>
<p>It might be obvious to us now that the <code class="language-plaintext highlighter-rouge">bio_perl</code> role depends on the
<code class="language-plaintext highlighter-rouge">build_tools</code> and <code class="language-plaintext highlighter-rouge">cpanm</code> roles. However, it may be less obvious as the
playbook grows or when we want to create a new playbook that makes use of the
<code class="language-plaintext highlighter-rouge">bio_perl</code> module.</p>
<p>It is possible to make dependencies explicit when using Ansible roles. To
do this we will need to add a <code class="language-plaintext highlighter-rouge">meta</code> directory to our <code class="language-plaintext highlighter-rouge">bio_perl</code> role.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ mkdir roles/bio_perl/meta
</code></pre></div></div>
<p>Now copy and paste the code below into the file <code class="language-plaintext highlighter-rouge">roles/bio_perl/meta/main.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="na">dependencies</span><span class="pi">:</span>
<span class="pi">-</span> <span class="pi">{</span> <span class="nv">role</span><span class="pi">:</span> <span class="nv">build_tools</span><span class="pi">}</span>
<span class="pi">-</span> <span class="pi">{</span> <span class="nv">role</span><span class="pi">:</span> <span class="nv">cpanm</span> <span class="pi">}</span>
</code></pre></div></div>
<p>At this point one can reduce the <code class="language-plaintext highlighter-rouge">playbook.yml</code> file to include only the
<code class="language-plaintext highlighter-rouge">bio_perl</code> module as the <code class="language-plaintext highlighter-rouge">build_tools</code> and <code class="language-plaintext highlighter-rouge">cpanm</code> modules will be
pulled in as dependencies.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">all</span>
<span class="na">sudo</span><span class="pi">:</span> <span class="s">True</span>
<span class="na">roles</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">bio_perl</span>
</code></pre></div></div>
<p>The content of the playbook now really reflects the original intent: to install
<code class="language-plaintext highlighter-rouge">Bio::Perl</code>.</p>
<h2 id="summary">Summary</h2>
<p>Ansible has the concept of “roles” that can be used to create reusable
components. To create a role one simply needs to adhere to Ansible’s
conventions of naming files and structuring directories. In its most basic
form a role takes the form of tasks within a file named
<code class="language-plaintext highlighter-rouge">roles/name_of_role/tasks/main.yml</code>.</p>
<p>In this post we also used the file <code class="language-plaintext highlighter-rouge">role/bio_perl/meta/main.yml</code> to specify
the dependencies of the role. This meant that the content of the final playbook
was succinct and reflected the intent for which it was created, namely to install
<code class="language-plaintext highlighter-rouge">Bio::Perl</code>. Furthermore, by explicitly stating the dependencies of the
<code class="language-plaintext highlighter-rouge">bio_perl</code> role we made it easier to reuse.</p>
<p>Finally, we also noted that it is possible to pick and mix roles and tasks
within a single playbook. This can be useful when creating playbooks that have
both reusable and non-reusable components within them.</p>
<h2 id="further-reading">Further reading</h2>
<p>The functionality of Ansible roles are not limited to what I have described in
this post. For more information on what they can do have a look at the <a href="https://docs.ansible.com/playbooks_roles.html">Ansible
documentation</a>.</p>
<p>In the
<a href="/blog/ansible-playbook-for-installing-the-gbrowse-genome-browser/">next post</a>
we will create a playbook for installing the GBrowse genome browser and learn
how to manage services, such as Apache, using Ansible.</p>
How to create automated and reproducible work flows for installing scientific software2015-04-02T00:00:00+00:00http://tjelvarolsson.com/blog/how-to-create-automated-and-reproducible-work-flows-for-installing-scientific-software<p><img src="/images/ansible-playbook.png" alt="Ansible playbook" /></p>
<p>In any organisation systems administration is a big role, which entails making
sure the systems everyone take for granted just work. Email, internet, etc;
everything needs to function 24/7.</p>
<p>But as computational scientists we need specialist software, written by and for
scientists. This means that we often have to rely on ourselves to do some basic
systems administration to install and manage scientific software.</p>
<p>The question then arises: <em>how can one effectively configure machines to run
scientific software?</em> Particularly as installing software written by other
scientists can often be a torturously painful process.</p>
<p>In this post I will outline a method for producing work flows that result in
automated and reproducible software installations.</p>
<p>Let us start on the assumption that we have been given a clean machine running
CentOS 6.5 by the IT department and now it us up to us to configure it with our
scientific software.</p>
<h2 id="vagrant---create-your-own-virtual-machine">Vagrant - create your own virtual machine</h2>
<p>Let us refer to the machine give to us by the IT department as the production
machine. This could be a physical box or a virtual machine, it does not really
matter.</p>
<p>At this point we do not want to experiment with our production machine.
Instead we will create a virtual machine on our desktop, which we will refer to
as the testing machine. Depending on your interest in virtualisation you may
already have heard of and used <a href="https://www.virtualbox.org">VirtualBox</a>. It is
a tool for creating virtual machines. If you have not already installed
VirtualBox do so now (<a href="https://www.virtualbox.org/wiki/Downloads">VirtualBox
downloads</a>).</p>
<p>Rather than working with VirtualBox directly we will make use of
<a href="https://www.vagrantup.com">Vagrant</a>. Vagrant is a command line utility for
working with VirtualBox and other virtual machine providers such as VMWare and
AWS. Here is a link to the <a href="https://www.vagrantup.com/downloads.html">Vagrant
downloads</a>.</p>
<p>We are now in a position to create and work with virtual machines solely from
the command line. Let us start by creating a Vagrant file for setting up a
CentOS 6.5 box.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vagrant init chef/centos-6.5
</code></pre></div></div>
<p>The command above creates a file named <code class="language-plaintext highlighter-rouge">Vagrantfile</code>, which in its most basic
form simply specifies the Linux image to provision the virtual machine with. In
this instance the image from:
<a href="https://atlas.hashicorp.com/chef/boxes/centos-6.5">atlas.hashicorp.com/chef/boxes/centos-6.5</a>.
Let us have a quick look at the <code class="language-plaintext highlighter-rouge">Vagrantfile</code> file.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="no">Vagrant</span><span class="p">.</span><span class="nf">configure</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span>
<span class="n">config</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">box</span> <span class="o">=</span> <span class="s2">"chef/centos-6.5"</span>
<span class="k">end</span>
</code></pre></div></div>
<p>Above I have left out all the comments giving further suggestions on how to
configure the setup of the virtual machine.</p>
<p>Let us spin up the virtual machine and ssh into it.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vagrant up
$ vagrant ssh
Last login: Fri Mar 7 16:57:20 2014 from 10.0.2.2
[vagrant@localhost ~]$ pwd
/home/vagrant
</code></pre></div></div>
<p>As you can see Vagrant has configured ssh to allow the <code class="language-plaintext highlighter-rouge">vagrant</code> user to
login without a password. Let’s close the ssh connection and find more details
about the ssh configuration.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[vagrant@localhost ~]$ exit
logout
Connection to 127.0.0.1 closed.
$ vagrant ssh-config
Host default
HostName 127.0.0.1
User vagrant
Port 2222
UserKnownHostsFile /dev/null
StrictHostKeyChecking no
PasswordAuthentication no
IdentityFile /home/olsson/.vagrant/machines/default/virtualbox/private_key
IdentitiesOnly yes
LogLevel FATAL
</code></pre></div></div>
<p>Finally, let us have a look at the Vagrant help.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vagrant help
Usage: vagrant [options] <command> [<args>]
-v, --version Print the version and exit.
-h, --help Print this help.
Common commands:
box manages boxes: installation, removal, etc.
connect connect to a remotely shared Vagrant environment
destroy stops and deletes all traces of the vagrant machine
global-status outputs status Vagrant environments for this user
halt stops the vagrant machine
help shows the help for a subcommand
init initializes a new Vagrant environment by creating a Vagrantfile
login log in to HashiCorp's Atlas
package packages a running vagrant environment into a box
plugin manages plugins: install, uninstall, update, etc.
provision provisions the vagrant machine
push deploys code in this environment to a configured destination
rdp connects to machine via RDP
reload restarts vagrant machine, loads new Vagrantfile configuration
resume resume a suspended vagrant machine
share share your Vagrant environment with anyone in the world
ssh connects to machine via SSH
ssh-config outputs OpenSSH valid configuration to connect to the machine
status outputs status of the vagrant machine
suspend suspends the machine
up starts and provisions the vagrant environment
version prints current and latest Vagrant version
For help on any individual command run `vagrant COMMAND -h`
Additional subcommands are available, but are either more advanced
or not commonly used. To see all subcommands, run the command
`vagrant list-commands`.
</code></pre></div></div>
<p>Note the <code class="language-plaintext highlighter-rouge">vagrant halt</code> and <code class="language-plaintext highlighter-rouge">vagrant destroy</code> commands to stop and delete
the vagrant machine respectively.</p>
<h2 id="ansible---configure-your-virtual-machine">Ansible - configure your virtual machine</h2>
<p>The aim of the game is to make the process of installing our scientific
software of interest reproducible and automated!</p>
<p>We will use the testing virtual machine provisioned using Vagrant to experiment
with scripts to configure it.</p>
<p>My favorite tool for configuring machines is
<a href="http://www.ansible.com/home">Ansible</a>. It is written in Python and makes use
of the OpenSSH protocol. Unlike many other
configuration tools, such as Puppet and Chef, Ansible is agentless. In other
words it does not require you to install an agent on the machine that you want
to configure, which makes it much easier to use. It is also very easy to
install, here is a link to the <a href="http://docs.ansible.com/intro_installation.html">Anisble installation
notes</a>.</p>
<p>Ansible uses the <a href="http://yaml.org">YAML</a> file format. Let us create
a file named <code class="language-plaintext highlighter-rouge">playbook.yml</code>.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="c1"># A basic playbook that simply checks who I logged in as.</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">all</span>
<span class="na">tasks</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">run the whoami command</span>
<span class="na">command</span><span class="pi">:</span> <span class="s">whoami</span>
</code></pre></div></div>
<p>To configure the Vagrant testing machine we simply need to update the
<code class="language-plaintext highlighter-rouge">Vagrantfile</code> file; inserting the provisioning section below.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="no">Vagrant</span><span class="p">.</span><span class="nf">configure</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span>
<span class="n">config</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">box</span> <span class="o">=</span> <span class="s2">"chef/centos-6.5"</span>
<span class="n">config</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">provision</span> <span class="s2">"ansible"</span> <span class="k">do</span> <span class="o">|</span><span class="n">ansible</span><span class="o">|</span>
<span class="n">ansible</span><span class="p">.</span><span class="nf">playbook</span> <span class="o">=</span> <span class="s2">"playbook.yml"</span>
<span class="k">end</span>
<span class="k">end</span>
</code></pre></div></div>
<p>We can now configure the Vagrant testing machine using the command below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vagrant provision
==> default: Running provisioner: ansible...
PLAY [all] ********************************************************************
GATHERING FACTS ***************************************************************
ok: [default]
TASK: [run the whoami command] ************************************************
changed: [default]
PLAY RECAP ********************************************************************
default : ok=2 changed=1 unreachable=0 failed=0
</code></pre></div></div>
<p>At this point our Ansible playbook does not really do anything useful. It
simply uses the <code class="language-plaintext highlighter-rouge">command</code> module to run the <code class="language-plaintext highlighter-rouge">whoami</code> program.</p>
<p>Ansible comes with a whole host of built in modules. For example
<a href="http://docs.ansible.com/yum_module.html">yum</a>,
<a href="http://docs.ansible.com/apt_module.html">apt</a> and
<a href="http://docs.ansible.com/homebrew_module.html">homebrew</a> are but a few of the
modules for operating system package management. It also has
<a href="http://docs.ansible.com/pip_module.html">pip</a>,
<a href="http://docs.ansible.com/cpanm_module.html">cpanm</a> and
<a href="http://docs.ansible.com/gem_module.html">gem</a> modules for managing Python
packages, Perl modules and Ruby gems respectively. There is also a vast array
of <a href="http://docs.ansible.com/list_of_files_modules.html">modules for working with
files</a>. For more
information check out the <a href="http://docs.ansible.com/modules_by_category.html">Ansible module
index</a>.</p>
<p>Below is a slightly more involved playbook for installing
the <code class="language-plaintext highlighter-rouge">Bio::Perl</code> module. The playbook deals with a number of complications.
It installs <code class="language-plaintext highlighter-rouge">gcc</code> to be able to compile some of the Perl
modules. It installs <code class="language-plaintext highlighter-rouge">cpan</code> and <code class="language-plaintext highlighter-rouge">cpanm</code> to make it easier to install
Perl modules. Further, <code class="language-plaintext highlighter-rouge">Bio::Perl</code> has some implicit dependencies that are
not taken care of automatically when installing it using <code class="language-plaintext highlighter-rouge">cpanm</code>, so the playbook
installs these dependencies first.</p>
<div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nn">---</span>
<span class="pi">-</span> <span class="na">hosts</span><span class="pi">:</span> <span class="s">all</span>
<span class="na">sudo</span><span class="pi">:</span> <span class="s">True</span>
<span class="na">tasks</span><span class="pi">:</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install gcc required to build some Perl modules</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name=gcc</span>
<span class="s">state=present</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install cpan and perl-devel</span>
<span class="na">yum</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="s">state=present</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">perl-devel</span>
<span class="pi">-</span> <span class="s">perl-CPAN</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">download cpanm</span>
<span class="na">get_url</span><span class="pi">:</span> <span class="s">url=https://cpanmin.us/</span>
<span class="s">dest=/tmp/cpanm.pl</span>
<span class="s">mode=755</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install cpanm so that we can use the ansible cpanm module</span>
<span class="na">command</span><span class="pi">:</span> <span class="s">perl cpanm.pl App::cpanminus</span>
<span class="na">args</span><span class="pi">:</span>
<span class="na">chdir</span><span class="pi">:</span> <span class="s">/tmp/</span>
<span class="na">creates</span><span class="pi">:</span> <span class="s">/usr/local/bin/cpanm</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">add cpanm symbolic link to /usr/bin/</span>
<span class="na">file</span><span class="pi">:</span> <span class="s">src=/usr/local/bin/cpanm</span>
<span class="s">dest=/usr/bin/cpanm</span>
<span class="s">state=link</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install implicit Bio::Perl dependencies</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name={{ item }}</span>
<span class="na">with_items</span><span class="pi">:</span>
<span class="pi">-</span> <span class="s">Time::HiRes</span>
<span class="pi">-</span> <span class="s">LWP::UserAgent</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">install BioPerl</span>
<span class="na">cpanm</span><span class="pi">:</span> <span class="s">name=Bio::Perl</span>
</code></pre></div></div>
<p>We can now try out this Ansible playbook on the testing virtual machine.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vagrant provision
==> default: Running provisioner: ansible...
PLAY [all] ********************************************************************
GATHERING FACTS ***************************************************************
ok: [default]
TASK: [install gcc required to build some Perl modules] ***********************
changed: [default]
TASK: [install cpan and perl-devel] *******************************************
changed: [default] => (item=perl-devel,perl-CPAN)
TASK: [download cpanm] ********************************************************
changed: [default]
TASK: [install cpanm so that we can use the ansible cpanm module] *************
changed: [default]
TASK: [add cpanm symbolic link to /usr/bin/] **********************************
changed: [default]
TASK: [install implicit Bio::Perl dependencies] *******************************
ok: [default] => (item=Time::HiRes)
ok: [default] => (item=LWP::UserAgent)
TASK: [install BioPerl] *******************************************************
ok: [default]
PLAY RECAP ********************************************************************
default : ok=8 changed=5 unreachable=0 failed=0
</code></pre></div></div>
<p>Great it works! Almost time to deploy to the production machine. However, first
let us commit our scripts to version control.</p>
<h2 id="git---tracking-what-you-are-doing">Git - tracking what you are doing</h2>
<p>One of the beauties of Ansible is that it uses the human readable YAML file
format. This means that you get descriptive configuration files that can be
used directly to configure your machines.</p>
<p>Another beauty of text files is that they can be tracked in version control.
This means that you can get an audit record of how the specification of the
configuration evolved over time. Furthermore, you can use the ability to add
comments to your commits to specify the reason why particular changes needed to
be made.</p>
<p>Let us commit our work to version control.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git init
$ git add Vagrantfile
$ git commit -m "Vagrant file with CentOS 6.5 configured by playbook.yml"
$ git add playbook.yml
$ git commit -m "Playbook for installing Bio::Perl"
</code></pre></div></div>
<h2 id="configuring-your-production-machine">Configuring your production machine</h2>
<p>Now that we have built up our Ansible configuration script and committed it to
version control we can use it to configure the production machine.</p>
<p>In order to achieve this we need to put our public ssh key on the production server.</p>
<p>If you have not already created an ssh key pair you can do so using
<code class="language-plaintext highlighter-rouge">ssh-keyen</code>. You can then append the public key to the <code class="language-plaintext highlighter-rouge">authorized_keys</code>
files in the <code class="language-plaintext highlighter-rouge">.ssh</code> directory on the production server. For more detail see, for example,
Etel Sverdlov blog post on <a href="https://www.digitalocean.com/community/tutorials/how-to-set-up-ssh-keys--2">How To Set Up SSH
Keys</a>.</p>
<p>Up until this point we have not used Ansible directly, we have only used it
through Vagrant. We will remedy that now.</p>
<p>First of all Ansible needs to know about the machines that you want it to talk
to. By default Ansible looks for these in <code class="language-plaintext highlighter-rouge">/etc/ansible/hosts</code>.
Alternatively, you can specify a “hosts” file using the command line option
<code class="language-plaintext highlighter-rouge">-i</code>. Suppose that your server’s host name was scicomp.example.com you could
then add this to a file named <code class="language-plaintext highlighter-rouge">hosts</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scicomp.example.com
</code></pre></div></div>
<p>A simple way to check that everything is setup as it should be is to make use
of Ansible’s <a href="http://docs.ansible.com/ping_module.html">ping</a> module. If
everything is working you will see something along the lines of the below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ansible -i hosts -m ping scicomp.example.com
scicomp.example.com | success >> {
"changed": false,
"ping": "pong"
}
</code></pre></div></div>
<p>Otherwise, you will see something along the lines of the below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ansible -i hosts -m ping scicomp.example.com
scicomp.example.com | FAILED => SSH encountered an unknown error during the connection. We recommend you re-run the command using -vvvv, which will enable SSH debugging output to help diagnose the issue
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">ansible</code> program can be an extremely effective way of issuing <em>ad-hoc</em>
commands to remote machines. However, we have a playbook that we want to run so
we want to use <code class="language-plaintext highlighter-rouge">ansible-playbook</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ansible-playbook -i hosts playbook.yml
PLAY [all] ********************************************************************
GATHERING FACTS ***************************************************************
ok: [scicomp.example.com]
TASK: [install gcc required to build some Perl modules] ***********************
changed: [scicomp.example.com]
TASK: [install cpan and perl-devel] *******************************************
changed: [scicomp.example.com] => (item=perl-devel,perl-CPAN)
TASK: [download cpanm] ********************************************************
changed: [scicomp.example.com]
TASK: [install cpanm so that we can use the ansible cpanm module] *************
changed: [scicomp.example.com]
TASK: [add cpanm symbolic link to /usr/bin/] **********************************
changed: [scicomp.example.com]
TASK: [install implicit Bio::Perl dependencies] *******************************
ok: [scicomp.example.com] => (item=Time::HiRes)
ok: [scicomp.example.com] => (item=LWP::UserAgent)
TASK: [install BioPerl] *******************************************************
ok: [scicomp.example.com]
PLAY RECAP ********************************************************************
scicomp.example.com : ok=8 changed=5 unreachable=0 failed=0
</code></pre></div></div>
<p>And now the production machine is configured with <code class="language-plaintext highlighter-rouge">Bio::Perl</code>!</p>
<h2 id="a-confession">A confession</h2>
<p>I did not actually have the IT department create a production machine for me
just for the purposes of this blog post. Instead I used Vagrant to create a
virtual one for me by simply removing the provisioning section we added earlier
and uncommenting the line for setting up the machine on a private network.</p>
<div class="language-ruby highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="no">Vagrant</span><span class="p">.</span><span class="nf">configure</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="k">do</span> <span class="o">|</span><span class="n">config</span><span class="o">|</span>
<span class="n">config</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">box</span> <span class="o">=</span> <span class="s2">"chef/centos-6.5"</span>
<span class="n">config</span><span class="p">.</span><span class="nf">vm</span><span class="p">.</span><span class="nf">network</span> <span class="s2">"private_network"</span><span class="p">,</span> <span class="ss">ip: </span><span class="s2">"192.168.33.10"</span>
<span class="k">end</span>
</code></pre></div></div>
<p>To make sure I got the machine in a clean state I simply destroyed it and spun
it up again.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ vagrant destroy
default: Are you sure you want to destroy the 'default' VM? [y/N] y
==> default: Forcing shutdown of VM...
==> default: Destroying VM and associated drives...
$ vagrant up
</code></pre></div></div>
<p>I then used the Ansible <code class="language-plaintext highlighter-rouge">hosts</code> file below, all in one long line, to specify
how to connect to the machine.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>scicomp.example.com ansible_ssh_host=192.168.33.10 ansible_ssh_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/default/virtualbox/private_key
</code></pre></div></div>
<p>Using the above we have pretty much created a staging virtual machine in a
couple of minutes. Pretty cool!</p>
<h2 id="summary">Summary</h2>
<p>As a computational scientist you are likely to get exposed to systems
administration to some extent. In particular for installing scientific
software.</p>
<p>In an ideal world you should try to make the installation of your software as
reproducible and automated as possible, because your machine will fall over at
one point or another. When this happens you want to be in a position where you
simply need to press a button to get your new machine configured with all the
software that you need to work effectively.</p>
<p>Vagrant is a tool for spinning up virtual machines from the command line.
Virtual machines are great for testing scripts that you create to configure
your machines.</p>
<p>Ansible is a wonderful tool for scripting the configuration of your machines.
It is very powerful, yet easy to use. Make it your friend!</p>
<p>Finally, I highly recommend that you keep your Vagrant and Ansible files under
version control. It will give you more confidence when experimenting with new
setups and it provides a way for you to track the progression of your machines
configurations.</p>
<p>In the <a href="/blog/how-to-create-reusable-ansible-components/">next post</a>
we will learn how to convert the playbook created in this post into reusable components.</p>
Object-oriented programming for scientists2015-03-22T00:00:00+00:00http://tjelvarolsson.com/blog/object-oriented-programming-for-scientists<figure>
<img src="/images/sandcastle.jpg" alt="Sandcastle" />
<figcaption>
Using a bucket to create sandcastles. That is what object-oriented
programming is all about.
</figcaption>
</figure>
<h2 id="introduction">Introduction</h2>
<p>For anyone not familiar with object-oriented programming it can sometimes come
across as something mysterious that is used by expert coders. Indeed, any
respectable text book on object-oriented programming will try to overwhelm the
reader with concepts such as “abstraction”, “encapsulation”, “inheritance” and
“polymorphism”.</p>
<p>However, object-oriented programming is not that difficult and can be very
useful when dealing with complex data structures. In this post I will
illustrate some object-oriented principles using a bioinformatics example, the
parsing of <a href="http://en.wikipedia.org/wiki/FASTA_format">FASTA files</a>.</p>
<p>The code will be written in Python as I like it, it has built-in support for
object-oriented programming and its syntax is relatively easy to understand.</p>
<h2 id="an-example-using-procedural-programming">An example using procedural programming</h2>
<p>To set the scene let us write some code using procedural programming to parse
the <code class="language-plaintext highlighter-rouge">example.fasta</code> file below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>>sp|O76074|PDE5A_HUMAN cGMP-specific 3',5'-cyclic phosphodiesterase OS=Homo sapiens GN=PDE5A PE=1 SV=2
MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE
RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG
TVSFLSDSEKKEQMPLTPPRFDHDEGDQCSRLLELVKDISSHLDVTALCHKIFLHIHGLI
SADRYSLFLVCEDSSNDKFLISRLFDVAEGSTLEEVSNNCIRLEWNKGIVGHVAALGEPL
NIKDAYEDPRFNAEVDQITGYKTQSILCMPIKNHREEVVGVAQAINKKSGNGGTFTEKDE
KDFAAYLAFCGIVLHNAQLYETSLLENKRNQVLLDLASLIFEEQQSLEVILKKIAATIIS
FMQVQKCTIFIVDEDCSDSFSSVFHMECEELEKSSDTLTREHDANKINYMYAQYVKNTME
PLNIPDVSKDKRFPWTTENTGNVNQQCIRSLLCTPIKNGKKNKVIGVCQLVNKMEENTGK
VKPFNRNDEQFLEAFVIFCGLGIQNTQMYEAVERAMAKQMVTLEVLSYHASAAEEETREL
QSLAAAVVPSAQTLKITDFSFSDFELSDLETALCTIRMFTDLNLVQNFQMKHEVLCRWIL
SVKKNYRKNVAYHNWRHAFNTAQCMFAALKAGKIQNKLTDLEILALLIAALSHDLDHRGV
NNSYIQRSEHPLAQLYCHSIMEHHHFDQCLMILNSPGNQILSGLSIEEYKTTLKIIKQAI
LATDLALYIKRRGEFFELIRKNQFNLEDPHQKELFLAMLMTACDLSAITKPWPIQQRIAE
LVATEFFDQGDRERKELNIEPTDLMNREKKNKIPSMQVGFIDAICLQLYEALTHVSEDCF
PLLDGCRKNRQKWQALAEQQEKMLINGESGQAKRN
>sp|Q9Y233|PDE10_HUMAN cAMP and cAMP-inhibited cGMP 3',5'-cyclic phosphodiesterase 10A OS=Homo sapiens GN=PDE10A PE=1 SV=1
MRIEERKSQHLTGLTDEKVKAYLSLHPQVLDEFVSESVSAETVEKWLKRKNNKSEDESAP
KEVSRYQDTNMQGVVYELNSYIEQRLDTGGDNQLLLYELSSIIKIATKADGFALYFLGEC
NNSLCIFTPPGIKEGKPRLIPAGPITQGTTVSAYVAKSRKTLLVEDILGDERFPRGTGLE
SGTRIQSVLCLPIVTAIGDLIGILELYRHWGKEAFCLSHQEVATANLAWASVAIHQVQVC
RGLAKQTELNDFLLDVSKTYFDNIVAIDSLLEHIMIYAKNLVNADRCALFQVDHKNKELY
SDLFDIGEEKEGKPVFKKTKEIRFSIEKGIAGQVARTGEVLNIPDAYADPRFNREVDLYT
GYTTRNILCMPIVSRGSVIGVVQMVNKISGSAFSKTDENNFKMFAVFCALALHCANMYHR
IRHSECIYRVTMEKLSYHSICTSEEWQGLMQFTLPVRLCKEIELFHFDIGPFENMWPGIF
VYMVHRSCGTSCFELEKLCRFIMSVKKNYRRVPYHNWKHAVTVAHCMYAILQNNHTLFTD
LERKGLLIACLCHDLDHRGFSNSYLQKFDHPLAALYSTSTMEQHHFSQTVSILQLEGHNI
FSTLSSSEYEQVLEIIRKAIIATDLALYFGNRKQLEEMYQTGSLNLNNQSHRDRVIGLMM
TACDLCSVTKLWPVTKLTANDIYAEFWAEGDEMKKLGIQPIPMMDRDKKDEVPQGQLGFY
NAVAIPCYTTLTQILPPTEPLLKACRDNLSQWEKVIRGEETATWISSPSVAQKAAASED
</code></pre></div></div>
<p>The aim is to find and print out the FASTA record with the UniProt identifier
<code class="language-plaintext highlighter-rouge">Q9Y233</code> (the second entry). The code below achieves this using procedural
programming.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="s">'example.fasta'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="n">match</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">fh</span><span class="p">:</span>
<span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="c1"># Remove newline at the end of the line.
</span> <span class="k">if</span> <span class="n">line</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'>'</span><span class="p">):</span>
<span class="c1"># We have encountered a description line.
</span> <span class="c1"># That means the start of a new FASTA record.
</span> <span class="k">if</span> <span class="n">line</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="s">'Q9Y233'</span><span class="p">)</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span><span class="p">:</span>
<span class="c1"># We have matched our search criteria.
</span> <span class="n">match</span> <span class="o">=</span> <span class="bp">True</span>
<span class="k">else</span><span class="p">:</span>
<span class="c1"># We have encountered a new entry and it does
</span> <span class="c1"># not match the search criteria.
</span> <span class="n">match</span> <span class="o">=</span> <span class="bp">False</span>
<span class="k">if</span> <span class="n">match</span><span class="p">:</span>
<span class="c1"># We are currently in a section of the FASTA file
</span> <span class="c1"># that matches our search criteria.
</span> <span class="k">print</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
</code></pre></div></div>
<p>It is worth noting that I felt the need to add quite a few comments to explain
what was going on, a sign that everything is not as clear as it should be.
However, on the whole the code does a job and it works.</p>
<p>Now imagine that you wanted to add a filter based on the length of the
sequence. Is it immediately obvious what you would do? How can you ensure that
the code remains understandable?</p>
<h2 id="object-oriented-programming-to-the-rescue">Object-oriented programming to the rescue</h2>
<p>Object-oriented programming is all about grouping data and functionality
together. This allows one to abstract away some of the complexities of the
processing logic and to encapsulate the data.</p>
<p>Let us start by creating an object representing a FASTA record. Save the code
below to a file named <code class="language-plaintext highlighter-rouge">fasta.py</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">FastaRecord</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class representing a FASTA record."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">description_line</span><span class="p">):</span>
<span class="s">"""Initialise an instance of the FastaRecord class."""</span>
<span class="bp">self</span><span class="o">.</span><span class="n">description</span> <span class="o">=</span> <span class="n">description_line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">sequences</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">def</span> <span class="nf">add_sequence_line</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sequence_line</span><span class="p">):</span>
<span class="s">"""
Add a sequence line to the FastaRecord instance.
This function can be called more than once.
"""</span>
<span class="bp">self</span><span class="o">.</span><span class="n">sequences</span><span class="o">.</span><span class="n">append</span><span class="p">(</span> <span class="n">sequence_line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="p">)</span>
<span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Representation of the FastaRecord instance."""</span>
<span class="n">lines</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">description</span><span class="p">,]</span>
<span class="n">lines</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sequences</span><span class="p">)</span>
<span class="k">return</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
</code></pre></div></div>
<p>There are a few things to note in the code above. Particularly if you are new
to object-oriented programming and/or Python.</p>
<p>First of all we inherit functionality from the base class <code class="language-plaintext highlighter-rouge">object</code> (the first
line). This is kind of historical where in Python 2.1 “new style” classes
were added. To remain backwards compatible with the “classic” or “old style”
classes it was decided that one would have to inherit from <code class="language-plaintext highlighter-rouge">object</code> to access
the goodness of the new style class. There are more details on the <a href="https://wiki.python.org/moin/NewClassVsClassicClass">Python
wiki</a>.</p>
<p>Secondly, we make use of the “magic” method <code class="language-plaintext highlighter-rouge">__init__</code>. This is used to create
an instance of a class.</p>
<p><em>Classes, objects, instances, what is up with all this terminology? What does
it all mean?</em></p>
<p>Okay, let us take a slight detour. You can think of classes as moulds, for
example a plastic bucket that you bring to the beach to make a sand castle. You
fill the bucket with sand and tip it up-side down, pat it on the top and lift
it up. What remains is a tower made out of sand. This sand castle is an
“instance” of your bucket “class”. Finally, the term “object”, as in
object-oriented programming, tends to be used to refer to classes and instances
interchangeably.</p>
<p>Back to the <code class="language-plaintext highlighter-rouge">__init__</code> method, which is used to initialise an instance of
the class. The instance created is accessible via the <code class="language-plaintext highlighter-rouge">self</code> argument. During
the initialisation of the <code class="language-plaintext highlighter-rouge">FastaRecord</code> class we also provide the description
line.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">fasta</span> <span class="kn">import</span> <span class="n">FastaRecord</span>
<span class="o">>>></span> <span class="n">fasta_record</span> <span class="o">=</span> <span class="n">FastaRecord</span><span class="p">(</span><span class="s">'>sp|O76074|PDE5A_HUMAN'</span><span class="p">)</span>
</code></pre></div></div>
<p>Note that the <code class="language-plaintext highlighter-rouge">fasta_record</code> variable above is an instance of the
<code class="language-plaintext highlighter-rouge">FastaRecord</code> class. We can access the <code class="language-plaintext highlighter-rouge">description</code> attribute of the
<code class="language-plaintext highlighter-rouge">FastaRecord</code> instance directly.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">fasta_record</span><span class="o">.</span><span class="n">description</span>
<span class="s">'>sp|O76074|PDE5A_HUMAN'</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">add_sequence_line</code> method simply adds a sequence line to the
<code class="language-plaintext highlighter-rouge">sequences</code> (list) attribute.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">fasta_record</span><span class="o">.</span><span class="n">add_sequence_line</span><span class="p">(</span><span class="s">'MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">fasta_record</span><span class="o">.</span><span class="n">add_sequence_line</span><span class="p">(</span><span class="s">'RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG'</span><span class="p">)</span>
</code></pre></div></div>
<p>Finally, we have the “magic” <code class="language-plaintext highlighter-rouge">__repr__</code> method. At this point you are
probably screaming out loud, what is a “magic” method? A “magic” method is
basically a way to make an object behave like a built-in Python object. For
example the <code class="language-plaintext highlighter-rouge">__repr__</code> method is used to describe how the instance should be
represented. Let us illustrate this below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">fasta_record</span>
<span class="o">></span><span class="n">sp</span><span class="o">|</span><span class="n">O76074</span><span class="o">|</span><span class="n">PDE5A_HUMAN</span>
<span class="n">MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE</span>
<span class="n">RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG</span>
</code></pre></div></div>
<p>For more information on “magic” methods have a look at Rafe Kettler’s blog post
<a href="http://www.rafekettler.com/magicmethods.html">A Guide to Python’s Magic
Methods</a>.</p>
<h2 id="a-fasta-parser-object">A FASTA parser object</h2>
<p>Now that we have a basic class for working with FASTA records let us create
another class for parsing FASTA files.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">FastaParser</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class for parsing FASTA files."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">fpath</span><span class="p">):</span>
<span class="s">"""Initialise an instance of the FastaParser."""</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">fpath</span>
<span class="k">def</span> <span class="nf">__iter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Yield FastaRecord instances."""</span>
<span class="n">fasta_record</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="k">for</span> <span class="n">line</span> <span class="ow">in</span> <span class="n">fh</span><span class="p">:</span>
<span class="k">if</span> <span class="n">line</span><span class="o">.</span><span class="n">startswith</span><span class="p">(</span><span class="s">'>'</span><span class="p">):</span>
<span class="k">if</span> <span class="n">fasta_record</span><span class="p">:</span>
<span class="k">yield</span> <span class="n">fasta_record</span>
<span class="n">fasta_record</span> <span class="o">=</span> <span class="n">FastaRecord</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="k">else</span><span class="p">:</span>
<span class="n">fasta_record</span><span class="o">.</span><span class="n">add_sequence_line</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
<span class="k">yield</span> <span class="n">fasta_record</span>
</code></pre></div></div>
<p>In the example above I have used the <code class="language-plaintext highlighter-rouge">__iter__</code> magic method. This basically
defines the behaviour the class should display when called as an iterator. In
this particular case we want it to <code class="language-plaintext highlighter-rouge">yield</code> <code class="language-plaintext highlighter-rouge">FastaRecord</code> instances as the FASTA
file is parsed.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">fasta</span> <span class="kn">import</span> <span class="n">FastaParser</span>
<span class="o">>>></span> <span class="n">fasta_parser</span> <span class="o">=</span> <span class="n">FastaParser</span><span class="p">(</span><span class="s">'example.fasta'</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">for</span> <span class="n">fasta_record</span> <span class="ow">in</span> <span class="n">fasta_parser</span><span class="p">:</span>
<span class="o">...</span> <span class="k">print</span><span class="p">(</span><span class="n">fasta_record</span><span class="o">.</span><span class="n">description</span><span class="p">)</span>
<span class="o">...</span>
<span class="o">></span><span class="n">sp</span><span class="o">|</span><span class="n">O76074</span><span class="o">|</span><span class="n">PDE5A_HUMAN</span> <span class="n">cGMP</span><span class="o">-</span><span class="n">specific</span> <span class="mi">3</span><span class="s">',5'</span><span class="o">-</span><span class="n">cyclic</span> <span class="n">phosphodiesterase</span> <span class="n">OS</span><span class="o">=</span><span class="n">Homo</span> <span class="n">sapiens</span> <span class="n">GN</span><span class="o">=</span><span class="n">PDE5A</span> <span class="n">PE</span><span class="o">=</span><span class="mi">1</span> <span class="n">SV</span><span class="o">=</span><span class="mi">2</span>
<span class="o">></span><span class="n">sp</span><span class="o">|</span><span class="n">Q9Y233</span><span class="o">|</span><span class="n">PDE10_HUMAN</span> <span class="n">cAMP</span> <span class="ow">and</span> <span class="n">cAMP</span><span class="o">-</span><span class="n">inhibited</span> <span class="n">cGMP</span> <span class="mi">3</span><span class="s">',5'</span><span class="o">-</span><span class="n">cyclic</span> <span class="n">phosphodiesterase</span> <span class="mi">10</span><span class="n">A</span> <span class="n">OS</span><span class="o">=</span><span class="n">Homo</span> <span class="n">sapiens</span> <span class="n">GN</span><span class="o">=</span><span class="n">PDE10A</span> <span class="n">PE</span><span class="o">=</span><span class="mi">1</span> <span class="n">SV</span><span class="o">=</span><span class="mi">1</span>
</code></pre></div></div>
<h2 id="back-to-grouping-data-and-functionality">Back to grouping data and functionality</h2>
<p>At this point we could write a simple script to loop over the FASTA records and
find the hits of interest. However, where should we add the logic for finding
hits of interest?</p>
<p>I would argue that this is a great opportunity for abstracting away the logic
of identifying a hit by putting it in the <code class="language-plaintext highlighter-rouge">FastaRecord</code> class itself. Let us
extend the class to do this.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">FastaRecord</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class representing a FASTA record."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">description_line</span><span class="p">):</span>
<span class="s">"""Initialise an instance of the FastaRecord class."""</span>
<span class="bp">self</span><span class="o">.</span><span class="n">description</span> <span class="o">=</span> <span class="n">description_line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span>
<span class="bp">self</span><span class="o">.</span><span class="n">sequences</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">def</span> <span class="nf">add_sequence_line</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">sequence_line</span><span class="p">):</span>
<span class="s">"""
Add a sequence line to the FastaRecord instance.
This function can be called more than once.
"""</span>
<span class="bp">self</span><span class="o">.</span><span class="n">sequences</span><span class="o">.</span><span class="n">append</span><span class="p">(</span> <span class="n">sequence_line</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="p">)</span>
<span class="k">def</span> <span class="nf">matches</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">search_term</span><span class="p">):</span>
<span class="s">"""Return True if the search_term is in the description."""</span>
<span class="k">return</span> <span class="bp">self</span><span class="o">.</span><span class="n">description</span><span class="o">.</span><span class="n">find</span><span class="p">(</span><span class="n">search_term</span><span class="p">)</span> <span class="o">!=</span> <span class="o">-</span><span class="mi">1</span>
<span class="k">def</span> <span class="nf">__repr__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="s">"""Representation of the FastaRecord instance."""</span>
<span class="n">lines</span> <span class="o">=</span> <span class="p">[</span><span class="bp">self</span><span class="o">.</span><span class="n">description</span><span class="p">,]</span>
<span class="n">lines</span><span class="o">.</span><span class="n">extend</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">sequences</span><span class="p">)</span>
<span class="k">return</span> <span class="s">'</span><span class="se">\n</span><span class="s">'</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">lines</span><span class="p">)</span>
</code></pre></div></div>
<p>Note the addition of the <code class="language-plaintext highlighter-rouge">matches</code> method above. Also, note that the
addition of more functionality did not make the code any more difficult to
understand.</p>
<p>It is now trivial to write a script to do the analysis that we want.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">fasta</span> <span class="kn">import</span> <span class="n">FastaParser</span>
<span class="k">for</span> <span class="n">fasta_record</span> <span class="ow">in</span> <span class="n">FastaParser</span><span class="p">(</span><span class="s">'example.fasta'</span><span class="p">):</span>
<span class="k">if</span> <span class="n">fasta_record</span><span class="o">.</span><span class="n">matches</span><span class="p">(</span><span class="s">'Q9Y233'</span><span class="p">):</span>
<span class="k">print</span><span class="p">(</span><span class="n">fasta_record</span><span class="p">)</span>
</code></pre></div></div>
<p>Compare the descriptiveness of this code to that of the procedural example at
the beginning of this post.</p>
<p><em>But you had to write so much more code to get to this point, is it really
worth it?</em></p>
<p>I go back to the scenario outlined earlier in this post. Imagine that you had
to extend the logic of the code to be able to filter based on the length of the
sequence. Which code base would you rather use as a starting point? If you are
unsure, try adding this functionality to both code bases to find out which one
is more extensible.</p>
<h2 id="try-to-avoid-re-inventing-the-wheel">Try to avoid re-inventing the wheel</h2>
<p>The point of this post was to illustrate object-oriented programming, not to
re-invent the wheel. I used the example of parsing FASTA files in this post as
they are widely used in biological research and are conceptually easy to
understand. However, if you are serious about using Python for bioinformatics I
suggest that you check out <a href="http://biopython.org/wiki/Main_Page">Biopython</a>.</p>
<h2 id="conclusion">Conclusion</h2>
<p>Object-oriented programming can be very useful when dealing with complex data
structures. In particular it can be used to hide complexity by grouping data
and functionality together.</p>
<p>Furthermore, it can make your code more understandable and extensible.</p>
<p>Finally, do not let your lack of knowledge about “polymorphism” and
“inheritance” hold you back from making use of objects. Yes, these are
interesting topics, and please do read up on them. However, they are not
essential to your use of object-oriented programming (at least not in Python).</p>
<p>I hope you find this post useful and that it has encouraged you to try out
object-oriented programming. Send me a message if you need any help.</p>
Strategies to access content from Python functions that write to disk2015-03-12T00:00:00+00:00http://tjelvarolsson.com/blog/strategies-to-access-content-from-python-functions-that-write-to-disk<p>Have you ever worked with an API that has some sort of “save to file” function
only to find yourself wanting a function that returns the content to a string?
For example the Python image module <code class="language-plaintext highlighter-rouge">skimage.io</code> has a function named
<a href="http://scikit-image.org/docs/dev/api/skimage.io.html#imsave"><code class="language-plaintext highlighter-rouge">imsave</code></a> that
takes <code class="language-plaintext highlighter-rouge">fname</code> and <code class="language-plaintext highlighter-rouge">arr</code> as arguments and writes an image to disk. However,
what I wanted was a function that returned the image as a byte string. In other
words I wanted the behaviour of the Python Image Library’s
<a href="http://pillow.readthedocs.org/en/latest/reference/Image.html#PIL.Image.Image.tobytes"><code class="language-plaintext highlighter-rouge">PIL.Image.tobytes</code></a>
function. However, I could not find one in scikit-image.</p>
<h2 id="strategy-1-make-use-of-stringio">Strategy 1: make use of <code class="language-plaintext highlighter-rouge">StringIO</code></h2>
<p>In these types of circumstances one can often make use of Python’s built-in
<a href="https://docs.python.org/2/library/stringio.html"><code class="language-plaintext highlighter-rouge">StringIO</code></a> module.
Let’s illustrate this using <code class="language-plaintext highlighter-rouge">PIL</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">StringIO</span> <span class="kn">import</span> <span class="n">StringIO</span>
<span class="o">>>></span> <span class="n">ar</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">zeros</span><span class="p">((</span><span class="mi">50</span><span class="p">,</span><span class="mi">50</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span> <span class="c1"># The array we want to get a png byte string for.
</span><span class="o">>>></span> <span class="n">img</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">fromarray</span><span class="p">(</span><span class="n">ar</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">img</span> <span class="o">=</span> <span class="n">img</span><span class="o">.</span><span class="n">convert</span><span class="p">(</span><span class="s">'RGB'</span><span class="p">)</span> <span class="c1"># Need to convert to RGB to save as PNG.
</span><span class="o">>>></span> <span class="n">output</span> <span class="o">=</span> <span class="n">StringIO</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">img</span><span class="o">.</span><span class="n">save</span><span class="p">(</span><span class="n">output</span><span class="p">,</span> <span class="nb">format</span><span class="o">=</span><span class="s">"PNG"</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">contents</span> <span class="o">=</span> <span class="n">output</span><span class="o">.</span><span class="n">getvalue</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">output</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="o">>>></span> <span class="k">assert</span><span class="p">(</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">contents</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">))</span>
</code></pre></div></div>
<h2 id="strategy-2-write-read-delete">Strategy 2: write, read, delete</h2>
<p>However, one cannot use the approach above with <code class="language-plaintext highlighter-rouge">skimage.io.imsave</code> as it
does not provide a means to specify the format (the format seems to be
“automagically” determined from the file name). So we are forced to save the
image to disk and then read the contents of the file.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">os</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">skimage.io</span> <span class="kn">import</span> <span class="n">imsave</span>
<span class="o">>>></span> <span class="n">imsave</span><span class="p">(</span><span class="s">'tmp.png'</span><span class="p">,</span> <span class="n">ar</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">contents</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="s">'tmp.png'</span><span class="p">,</span> <span class="s">'rb'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">os</span><span class="o">.</span><span class="n">unlink</span><span class="p">(</span><span class="s">'tmp.png'</span><span class="p">)</span>
<span class="o">>>></span> <span class="k">assert</span><span class="p">(</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">contents</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">))</span>
</code></pre></div></div>
<h2 id="strategy-3-create-a-context-manager">Strategy 3: create a context manager</h2>
<p>The code above above is really ugly. What we want is something that can give us
a relatively safe temporary file path and delete it once we are done with it.
This is what Python’s context managers are for. Context managers are what lets
you use the <code class="language-plaintext highlighter-rouge">with</code> statement for opening files etc. Jeff Preshing has
written a nice tutorial on context mangers <a href="http://preshing.com/20110920/the-python-with-statement-by-example/">The Python “with” Statement by
Example</a>.</p>
<p>Here I will use a test driven development (TDD) approach to illustrate how we
can implement a context manager to help us work more safely with temporary file
paths. So, before we start working on an implementation let us specify
the desired behaviour as a test. Add the code below to a file named
<code class="language-plaintext highlighter-rouge">tempfilepath.py</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="kn">import</span> <span class="nn">os.path</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">with</span> <span class="n">TemporaryFilePath</span><span class="p">()</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="k">assert</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">))</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="n">fh</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">'Testing opening and writing...'</span><span class="p">)</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span>
<span class="k">assert</span><span class="p">(</span><span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">fpath</span><span class="p">))</span>
</code></pre></div></div>
<p>The code above will raise a <code class="language-plaintext highlighter-rouge">NameError</code> stating that the
<code class="language-plaintext highlighter-rouge">TemporaryFilePath</code> is not defined. Great, now we can start adding an
implementation to make the tests pass. I will do this incrementally as it is a
useful illustration of some of the aspects of TDD.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="kn">import</span> <span class="nn">os.path</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">with</span> <span class="n">TemporaryFilePath</span><span class="p">()</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="k">assert</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">))</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="n">fh</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">'Testing opening and writing...'</span><span class="p">)</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span>
<span class="k">assert</span><span class="p">(</span><span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">fpath</span><span class="p">))</span>
</code></pre></div></div>
<p>We now get the error message below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "tempfilepath.py", line 7, in <module>
with TemporaryFilePath() as tmp:
AttributeError: __exit__
</code></pre></div></div>
<p>In true TDD style let us add a minimal implementation to make the test pass.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="kn">import</span> <span class="nn">os.path</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">with</span> <span class="n">TemporaryFilePath</span><span class="p">()</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="k">assert</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">))</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="n">fh</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">'Testing opening and writing...'</span><span class="p">)</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span>
<span class="k">assert</span><span class="p">(</span><span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">fpath</span><span class="p">))</span>
</code></pre></div></div>
<p>The implementation now gives the error below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "tempfilepath.py", line 10, in <module>
with TemporaryFilePath() as tmp:
AttributeError: __enter__
</code></pre></div></div>
<p>Let us add the minimal implementation to fix this.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__enter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="k">pass</span>
</code></pre></div></div>
<p>Which reveals the error below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "tempfilepath.py", line 14, in <module>
assert(os.path.isfile(tmp.fpath))
AttributeError: 'NoneType' object has no attribute 'fpath'
</code></pre></div></div>
<p>Can you work out what we need to do to fix this? This is a bit subtle, and it
caught me out. The clue is that the <code class="language-plaintext highlighter-rouge">tmp</code> variable is <code class="language-plaintext highlighter-rouge">NoneType</code>, whereas
it should have been <code class="language-plaintext highlighter-rouge">TemporaryFilePath</code>. This is due to the <code class="language-plaintext highlighter-rouge">__enter__</code>
function not returning anything. Let us fix it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__enter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="k">pass</span>
</code></pre></div></div>
<p>Now the context manager returns the object type we expect.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "tempfilepath.py", line 14, in <module>
assert(os.path.isfile(tmp.fpath))
AttributeError: 'TemporaryFilePath' object has no attribute 'fpath'
</code></pre></div></div>
<p>Time to add the <code class="language-plaintext highlighter-rouge">fpath</code> attribute.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="s">'tmp.txt'</span>
<span class="k">def</span> <span class="nf">__enter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="k">pass</span>
</code></pre></div></div>
<p>Now we are starting to get to the centre of the desired functionality.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "tempfilepath.py", line 17, in <module>
assert(os.path.isfile(tmp.fpath))
AssertionError
</code></pre></div></div>
<p>At this stage we just want to get the tests to pass so we add a “dumb” implementation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="s">'tmp.txt'</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">__enter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="k">pass</span>
</code></pre></div></div>
<p>Which gets us to the second assertion statement.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "tempfilepath.py", line 23, in <module>
assert(not os.path.isfile(fpath))
AssertionError
</code></pre></div></div>
<p>Basically, we need to add some clean up functionality to the <code class="language-plaintext highlighter-rouge">__exit__</code> function.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="s">'tmp.txt'</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="k">pass</span>
<span class="k">def</span> <span class="nf">__enter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">unlink</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">)</span>
</code></pre></div></div>
<p>And that makes all the tests pass. However, this code still has the ugly
side-effect of hijacking the <code class="language-plaintext highlighter-rouge">tmp.txt</code> file. It is time to refactor the code
to make it less nasty. Let us make use of the
<a href="https://docs.python.org/2/library/tempfile.html"><code class="language-plaintext highlighter-rouge">tempfile</code></a> module.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">tempfile</span>
<span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">name</span>
<span class="k">def</span> <span class="nf">__enter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">unlink</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">)</span>
</code></pre></div></div>
<p>Great, everything is working nicely. The only problem is that in order to be
able to save an image in png file format we need to be able to specify the
suffix of the file name.</p>
<p>At this stage it is very tempting to simply add the desired functionality and
it is where test driven development really requires discipline. Let us be good
practitioners of TDD and add a test specifying the desired behaviour first.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="kn">import</span> <span class="nn">os.path</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">with</span> <span class="n">TemporaryFilePath</span><span class="p">()</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="k">assert</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">))</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="n">fh</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">'Testing opening and writing...'</span><span class="p">)</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span>
<span class="k">assert</span><span class="p">(</span><span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">fpath</span><span class="p">))</span>
<span class="k">with</span> <span class="n">TemporaryFilePath</span><span class="p">(</span><span class="n">suffix</span><span class="o">=</span><span class="s">'.png'</span><span class="p">)</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="k">assert</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.png'</span><span class="p">))</span>
</code></pre></div></div>
<p>Great, we now have a failing test.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "tempfilepath.py", line 27, in <module>
with TemporaryFilePath(suffix='.png') as tmp:
TypeError: __init__() got an unexpected keyword argument 'suffix'
</code></pre></div></div>
<p>Let us continue to work incrementally and only fix the error reported.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">suffix</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">name</span>
</code></pre></div></div>
<p>This moves us on to the actual assertion that we wanted to test.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ python tempfilepath.py
Traceback (most recent call last):
File "tempfilepath.py", line 28, in <module>
assert(tmp.fpath.endswith('.png'))
AssertionError
</code></pre></div></div>
<p>Let us try to fix it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">suffix</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">suffix</span><span class="o">=</span><span class="n">suffix</span><span class="p">,</span> <span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">name</span>
<span class="k">def</span> <span class="nf">__enter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">unlink</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">)</span>
</code></pre></div></div>
<p>However, this results in a horrible error message.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Traceback (most recent call last):
File "tempfilepath.py", line 20, in <module>
with TemporaryFilePath() as tmp:
File "tempfilepath.py", line 8, in __init__
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tempfile.py", line 462, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tempfile.py", line 237, in _mkstemp_inner
file = _os.path.join(dir, pre + name + suf)
TypeError: cannot concatenate 'str' and 'NoneType' objects
</code></pre></div></div>
<p>The error message above is basically trying to tell us that the <code class="language-plaintext highlighter-rouge">suffix</code>
argument should not be <code class="language-plaintext highlighter-rouge">None</code> by default. We can verify this by looking at
the
<a href="https://docs.python.org/2/library/tempfile.html#tempfile.NamedTemporaryFile">tempfile.NamedTemporaryFile</a>
documentation, which states that it should be an empty string. Let us fix our code.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">suffix</span><span class="o">=</span><span class="s">''</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">suffix</span><span class="o">=</span><span class="n">suffix</span><span class="p">,</span> <span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">name</span>
<span class="k">def</span> <span class="nf">__enter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">unlink</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">)</span>
</code></pre></div></div>
<p>And now all the tests pass. Below is the code in all its glory.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">os</span>
<span class="kn">import</span> <span class="nn">tempfile</span>
<span class="k">class</span> <span class="nc">TemporaryFilePath</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Context manager for handling temporary file paths."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">suffix</span><span class="o">=</span><span class="s">''</span><span class="p">):</span>
<span class="k">with</span> <span class="n">tempfile</span><span class="o">.</span><span class="n">NamedTemporaryFile</span><span class="p">(</span><span class="n">suffix</span><span class="o">=</span><span class="n">suffix</span><span class="p">,</span> <span class="n">delete</span><span class="o">=</span><span class="bp">False</span><span class="p">)</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">name</span>
<span class="k">def</span> <span class="nf">__enter__</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="bp">self</span>
<span class="k">def</span> <span class="nf">__exit__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="nb">type</span><span class="p">,</span> <span class="n">value</span><span class="p">,</span> <span class="n">tb</span><span class="p">):</span>
<span class="n">os</span><span class="o">.</span><span class="n">unlink</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">)</span>
<span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">"__main__"</span><span class="p">:</span>
<span class="kn">import</span> <span class="nn">os.path</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="bp">None</span>
<span class="k">with</span> <span class="n">TemporaryFilePath</span><span class="p">()</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="k">assert</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">))</span>
<span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span> <span class="k">as</span> <span class="n">fh</span><span class="p">:</span>
<span class="n">fh</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">'Testing opening and writing...'</span><span class="p">)</span>
<span class="n">fpath</span> <span class="o">=</span> <span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span>
<span class="k">assert</span><span class="p">(</span><span class="ow">not</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">isfile</span><span class="p">(</span><span class="n">fpath</span><span class="p">))</span>
<span class="k">with</span> <span class="n">TemporaryFilePath</span><span class="p">(</span><span class="n">suffix</span><span class="o">=</span><span class="s">'.png'</span><span class="p">)</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="k">assert</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">'.png'</span><span class="p">))</span>
</code></pre></div></div>
<p>We can now use this to get the content of our numpy array as an image in byte
string representation.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">tempfilepath</span> <span class="kn">import</span> <span class="n">TemporaryFilePath</span>
<span class="o">>>></span> <span class="k">with</span> <span class="n">TemporaryFilePath</span><span class="p">(</span><span class="n">suffix</span><span class="o">=</span><span class="s">'.png'</span><span class="p">)</span> <span class="k">as</span> <span class="n">tmp</span><span class="p">:</span>
<span class="o">...</span> <span class="n">imsave</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="n">ar</span><span class="p">)</span>
<span class="o">...</span> <span class="n">content</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">tmp</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'rb'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="o">...</span>
<span class="o">>>></span> <span class="k">assert</span><span class="p">(</span><span class="nb">isinstance</span><span class="p">(</span><span class="n">content</span><span class="p">,</span> <span class="nb">bytes</span><span class="p">))</span>
</code></pre></div></div>
How to display objects as images in IPython2015-03-08T00:00:00+00:00http://tjelvarolsson.com/blog/how-to-display-objects-as-images-in-ipython<p>IPython has some neat functionality for displaying objects in ways that can be
more informative than the standard <code class="language-plaintext highlighter-rouge">__repr__</code> representation. Both the
IPython notebook and qtconsole support the display of png, jpeg and svg images.
Furthermore, the IPython notebook can also display html, javascript, json and
latex.</p>
<p>If you simply want to display an image you can achieve this using the
<code class="language-plaintext highlighter-rouge">IPython.display.Image</code> class.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">IPython.display</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="o">>>></span> <span class="n">image</span> <span class="o">=</span> <span class="n">Image</span><span class="p">(</span><span class="s">'tiny_tjelvar.png'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">image</span>
</code></pre></div></div>
<p>The last <code class="language-plaintext highlighter-rouge">image</code> call would result in the image below being displayed in the
IPython qtconsole/notebook.</p>
<p><img src="/images/tiny_tjelvar.png" alt="Tiny image of Tjelvar." /></p>
<p>However, suppose that you wanted to create an image representation of your own
class. Let us illustrate this with the hypothetical example of an
<code class="language-plaintext highlighter-rouge">ImageFile</code> class that simply stores the location of an image.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ImageFile</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class for storing an image location."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">fpath</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">fpath</span>
<span class="k">def</span> <span class="nf">_repr_png_</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">return</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
</code></pre></div></div>
<p>The usage of the class above would be along the lines of the below.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">im_file</span> <span class="o">=</span> <span class="n">ImageFile</span><span class="p">(</span><span class="s">'tiny_tjelvar.png'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">im_file</span>
</code></pre></div></div>
<p><img src="/images/tiny_tjelvar.png" alt="Tiny image of Tjelvar." /></p>
<p>The example above would fall over if the file was not in png format. Let us
make the code a little bit more robust by adding a naive file format check.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ImageFile</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class for storing an image location."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">fpath</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">fpath</span>
<span class="bp">self</span><span class="o">.</span><span class="nb">format</span> <span class="o">=</span> <span class="n">fpath</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'.'</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">_repr_png_</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="nb">format</span> <span class="o">==</span> <span class="s">'png'</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
</code></pre></div></div>
<p>Finally, let us extend the class to be able to deal with jpeg and svg images as
well.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">class</span> <span class="nc">ImageFile</span><span class="p">(</span><span class="nb">object</span><span class="p">):</span>
<span class="s">"""Class for storing an image location."""</span>
<span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">fpath</span><span class="p">):</span>
<span class="bp">self</span><span class="o">.</span><span class="n">fpath</span> <span class="o">=</span> <span class="n">fpath</span>
<span class="bp">self</span><span class="o">.</span><span class="nb">format</span> <span class="o">=</span> <span class="n">fpath</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">'.'</span><span class="p">)[</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span>
<span class="k">def</span> <span class="nf">_repr_png_</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="nb">format</span> <span class="o">==</span> <span class="s">'png'</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_repr_jpeg_</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="nb">format</span> <span class="o">==</span> <span class="s">'jpeg'</span> <span class="ow">or</span> <span class="bp">self</span><span class="o">.</span><span class="nb">format</span> <span class="o">==</span> <span class="s">'jpg'</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
<span class="k">def</span> <span class="nf">_repr_svg_</span><span class="p">(</span><span class="bp">self</span><span class="p">):</span>
<span class="k">if</span> <span class="bp">self</span><span class="o">.</span><span class="nb">format</span> <span class="o">==</span> <span class="s">'svg'</span><span class="p">:</span>
<span class="k">return</span> <span class="nb">open</span><span class="p">(</span><span class="bp">self</span><span class="o">.</span><span class="n">fpath</span><span class="p">,</span> <span class="s">'r'</span><span class="p">)</span><span class="o">.</span><span class="n">read</span><span class="p">()</span>
</code></pre></div></div>
<p>You now have a class that stores image file paths and has the capability to
interactively display the images using either IPython qtconsole or IPython
notebook.</p>
<h2 id="useful-links">Useful links</h2>
<ul>
<li><a href="http://ipython.org/ipython-doc/dev/config/integrating.html">Integrating your own objects with
IPython</a></li>
<li><a href="http://nbviewer.ipython.org/github/ipython/ipython/blob/1.x/examples/notebooks/Part%205%20-%20Rich%20Display%20System.ipynb">IPython rich display
system</a></li>
<li><a href="http://nbviewer.ipython.org/github/ipython/ipython/blob/3607712653c66d63e0d7f13f073bde8c0f209ba8/docs/examples/notebooks/display_protocol.ipynb">Using the IPython display protocol for your own
objects</a></li>
</ul>
NorDevCon a day filled with passion!2015-03-06T00:00:00+00:00http://tjelvarolsson.com/blog/nordevcon-a-day-filled-with-passion<p><img src="http://photos1.meetupstatic.com/photos/event/3/a/1/1/600_434894865.jpeg" alt="NorDevCon 2015 closing keynote audio-visial bonanza." /></p>
<p>This years Norfolk Developer Conference (NorDevCon) took place on the 27th of
Febrary and what a feast it was! The conference which traditionally has a heavy
focus on Tech and Agile was this year bolstered by a new business track as well
as more design sessions.</p>
<p>The day started off with an introduction by Paul Grenyer
(<a href="https://twitter.com/pjgrenyer">@pjgrenyer</a>) who organised the event, followed
by a short speech by Huw Sayer (<a href="https://twitter.com/HuwSayer">@HuwSayer</a>)
congratulating Norwich on now being an official #TechCluster on the
(<a href="https://twitter.com/TechCityUK">@TechCityUK</a>) map. This is a great
achievement and it was interesting to hear about <a href="http://norfolkchamber.co.uk/knowledge/guest-blogs/how-build-innovation-and-tech-cluster-norfolk">the journey that lead to the
official
recognition</a>.</p>
<p>The event was then officially opened by Jon Skeet
(<a href="https://twitter.com/jonskeet">@jonskeet</a>) giving the opening keynote, which
was all about passion! Jon started out on the premise that conferences are
useful for connecting and inspiring people and this comes from passion. In fact
Jon went as far as saying that conferences are all about inspiration and not
about learning.</p>
<p>One take home message from Jon’s talk was that we can not always work on things
that we are passionate about. However, we can always strive to find something
interesting in the dullest of tasks. For example by trying to understand the
problem space or by working out how to do something as well as possible. In
other words be curious. If nothing else it will make you a more interesting
person to work with.</p>
<p>How do you share and grow passion with people that you meet and work with? Jon
suggested that you should listen, encourage, nurture and feed other people’s
passion. However, which ideas spread will be partly down to charisma. This
means that we all have a role to make sure that this does not become too
unbalanced. In other words we all have a role in echoing what other people have
done; to amplify other people’s message without taking them over. One of the
great benefits of nurturing passion is that a team with shared passion can
really “motor along” and achieve great things.</p>
<p>The dangers of passion were also dealt with. Jon identified three types of
destructive behaviour and outlined ways in which to deal with them.</p>
<p>The first scenario was when two people on a team disagree. It happens a lot. It
happens a lot with good people. It is inevitable as more often than not there
is more than one solution. What to do? The team needs to pick one way forward.
However, it is at this point that the whole team needs to look after the person
who “lost”. Otherwise the dynamics of the team can get unbalanced. In
particular if the same person “loses” several times there is the danger that
that person will feel that his/her opinions and suggestions are not appreciated
and he/she may well start looking for a different place to work and the team
will loose diversity.</p>
<p>The second danger scenario identified was inter-team disagreement. For example
the fast and furious team having to work with the really careful team. In this
case you get positive feedback loops in both camps, where all the negative
views are constantly being re-enforced by the people you trust, i.e. the people
on your team. In this scenario there will be a need for compromise and the
teams will need to talk face-to-face.</p>
<p>The third danger scenario was a team with no disagreement at all. Although a
team with shared passion can really “motor along”, what happens if the team was
running in the wrong direction from the start? To avoid this scenario Jon’s
suggested solution was to take a step back and make sure that you think about
the business value. What is the ultimate goal of the project? Will it make
people happy?</p>
<p>After describing the dangers Jon did add the caveat that clearly the easy
solutions outlined above are not at all easy in real life.</p>
<p>Finally, Jon gave us a challenge. The challenge was to bathe in
passion! If a speaker did not inspire us with passion we should leave the
session and find passion somewhere else. When we introduced ourselves to
each other in the breaks we should do so with passion!</p>
<p>With those thoughts embedded in our minds we then went out to seek our fortune
in the rest of the conference.</p>
<p>I had a great time! I met up with old friends and I made new friends. I had my
appetite whet for <a href="https://www.docker.com">Docker</a> by Dom Davis
(<a href="https://twitter.com/idomdavis">@idomdavis</a>). I learnt about browser APIs from
Ruth John (<a href="https://twitter.com/Rumyra">@Rumyra</a>); via the medium of
<a href="http://en.wikipedia.org/wiki/VJing">VJing</a>! I was inspired by Seb Rose
(<a href="https://twitter.com/sebrose">@sebrose</a>) to write more and better tests. I had
minor revelations on how to improve my coding style during the talk by Kevlin
Henney (<a href="https://twitter.com/KevlinHenney">@KevlinHenney</a>). It was great!</p>
<p>The day was rounded off by a fast and furious closing keynote by Harry Harrold
(<a href="https://twitter.com/harryharrold">@harryharrold</a>) and Rupert Redington
(<a href="https://twitter.com/rupertredington">@rupertredington</a>). It was an
audio-visual bonanza, shining a light on the Agile manifesto, ending up in
fireworks!</p>
<p>It was a day filled with passion and I am already looking forward to next year’s
instalment of NorDevCon!</p>
Three essential tips for improving your scientific code2015-02-28T00:00:00+00:00http://tjelvarolsson.com/blog/three-essential-tips-for-improving-your-scientific-code<p>Writing scientific code is not dissimilar to writing any other type of code.
What is different is that many people who end up coding during their PhD do
not have or get any formal training in software development best practices.</p>
<p>For me it was very much a case of trial and error and picking things up as I
went along. In the research group that I was in, my peers were doing lab work
culturing cells and characterising proteins, so there was no one to discuss
programming with. So I learnt by reading; reading blogs; reading magazines;
reading books; reading other people’s code. However, it was a slow process
sorting the wheat from the chaff. Furthermore many of the things that I read
seemed a bit over the top for a one man band.</p>
<p>With hindsight I realise that another difficulty in learning software
development best practices is that sometimes the most fundamental aspects of
software development are not explicitly stated as they are taken for granted by
everyone that uses them.</p>
<p>Here are the three most valuable things that I have learnt both from my own
trial and error and by working with great software developers since finishing
my PhD. For anyone developing software professionally, none of this will be new.
However, if you are a scientist who has drifted into programming, these are the
three most important things that you can do to improve your productivity and
the quality of your code.</p>
<h2 id="use-version-control">Use version control</h2>
<p>Using version control is one of the simplest ways of increasing your
productivity. The reason is it reduces your fear of changing existing
code as you can always roll back to a previously working state. One of the tell-tale
signs that you need to use version control is if your project directory
contains files named along the lines of the below (as you can tell I used to do
this before I saw the light).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>new_simulation.py
older_simulation.py
old_simulation.py
simulation.py
simumlation_100606.py
test_simulation.py
</code></pre></div></div>
<p>When I started using version control
<a href="https://subversion.apache.org/">Subversion</a> was the best open source tool
available. However, it was difficult to set up and I was never sure I got it
right. These days you have a choice of two largely equivalent systems
<a href="http://git-scm.com/">Git</a> or <a href="http://mercurial.selenic.com/">Mercurial</a>. These
are very easy to set up and use.</p>
<p>Here I will illustrate how to use Git from the command line. To start a new project.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git init simulation
cd simulation
</code></pre></div></div>
<p>This creates a directory named <code class="language-plaintext highlighter-rouge">simulation</code>, the files in this directory can
now be managed by git. Suppose that we create a <code class="language-plaintext highlighter-rouge">README</code> file in this
directory and want to add it to version control.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git add README
</code></pre></div></div>
<p>When a file is added it is staged to be committed. Let’s commit it as a
snapshot to version control.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>git commit -m "Added README file."
</code></pre></div></div>
<p>That’s it. You can keep using <code class="language-plaintext highlighter-rouge">git add</code> and <code class="language-plaintext highlighter-rouge">git commit</code> to add incremental
changes to your code base until you find that you need to use some more
powerful features of Git, at which point you can learn more about it.</p>
<p>If you already have a project that you want to start tracking using Git you can
use the commands below.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cd my_existing_project
git init
git add "*"
git commit -m "Initial file import."
</code></pre></div></div>
<p>Note that the command above will put all files in the project directory under
version control so do not use it if your project directory contains machine
generated files, for example output files from your program.</p>
<p>Once you have got a little bit of familiarity with Git or Mercurial I would
strongly recommend that you set up an account with
<a href="https://bitbucket.org/">BitBucket</a> or <a href="https://github.com/">GitHub</a> and host
your code there. This has several advantages: you can stop worrying about your
computer crashing and losing all your work, you can access your code from any
machine with an internet connection, and you can collaborate with other people on
your code.</p>
<h2 id="write-code-so-that-it-can-be-understood-by-someone-else">Write code so that it can be understood by someone else</h2>
<p>One thing that is different when developing code in an academic environment is
that it is not unusual to start a project from scratch. This is quite uncommon
when working for a company that develops software. In the latter case one of
the first tasks on the job is to get familiar with the company’s code base.
This means reading code, which is when one learns to appreciate:</p>
<ul>
<li>Explicit names of variables, functions and classes</li>
<li>Comments that explain the intent of code whenever it is not immediately intuitive</li>
<li>Documentation describing the overall architecture of the system</li>
<li>Consistent coding style</li>
</ul>
<p>However, when one starts coding by oneself on a new project none of the above
matters as the logic behind every decision and poorly named variable is
immediately obvious to oneself. Anyway, one tells oneself, “I will deal with
those niceties once the program is working”.</p>
<p>However, once the program is working those things which were immediately
obvious are now obscure and anyway the program is working so one can use it to
generate some results, which is much more interesting than cleaning up code.
Then one realises that the results are not quite as expected and something is
not quite right about the logic of the program. However, the logic of the
program is not immediately clear…</p>
<p>So go on, name your variable <code class="language-plaintext highlighter-rouge">temperature_increase</code> instead of <code class="language-plaintext highlighter-rouge">temp_inc</code> or
<code class="language-plaintext highlighter-rouge">ti</code>. You have to type a few more letters but you will gain so much more. By
the way, does <code class="language-plaintext highlighter-rouge">temp</code> stand for <code class="language-plaintext highlighter-rouge">temporary</code> or <code class="language-plaintext highlighter-rouge">temperature</code> and does
<code class="language-plaintext highlighter-rouge">inc</code> stand for <code class="language-plaintext highlighter-rouge">increment</code> or <code class="language-plaintext highlighter-rouge">increase</code>? Also, if you find that typing
out long explicit names for variables, functions and classes is causing you
frustration then you should go on the hunt for a better text editor (I use vim)
or an integrated development environment, that understands code and offers to
complete names for you.</p>
<p>In terms of commenting your code, the key is to realise that you should document
the intent not the actual code. In other words, I can read your code so I don’t
need it re-iterated using plain English. However, I cannot read your mind so
please tell me what the intention was.</p>
<p>Describing the architecture of the system is just a fancy way of saying that
you should describe how the components of your software interact with each
other. Suppose for example that you were faced with a relatively simple code
base that contained the files:</p>
<ul>
<li><code class="language-plaintext highlighter-rouge">parser.py</code></li>
<li><code class="language-plaintext highlighter-rouge">database.py</code></li>
<li><code class="language-plaintext highlighter-rouge">simulation.py</code></li>
<li><code class="language-plaintext highlighter-rouge">experiment.py</code></li>
</ul>
<p>Can you describe how these Python modules interact with and depend on each
other? What is the difference between a simulation and an experiment?</p>
<p>Now suppose that the author of this hypothetical Python package had been kind
and spent five minutes including the lines below in the README file, would it
enable you to answer the questions above?</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>README
======
parser.py - module for parsing parameter files
database.py - module for storing results
simulation.py - module for running simulations
experiment.py - template for creating a new experiment
The ``experiment.py`` template uses the parser to read
in the parameters for the experiment. The parameters
are then passed on to the simulation (``simulation.py``).
Note that when you instantiate the ``simulation.Simulation``
class you need to provide it with a ``database.Database``
instance. The latter will be used to write the simulation
results to your database of choice.
</code></pre></div></div>
<p>I won’t dwell too long on coding style. Basically be consistent and try to
use the standard one for your language; i.e. if you code in Python use
<a href="https://www.python.org/dev/peps/pep-0008/">PEP8</a>, if you write C code use
<a href="http://en.wikipedia.org/wiki/The_C_Programming_Language">K&R</a> style, and so
forth. If coding style interests you, please read <a href="http://www.artima.com/weblogs/viewpost.jsp?thread=74230">Style is
Substance</a> by Ken
Arnold.</p>
<h2 id="write-tests">Write tests</h2>
<p>This is the most difficult of the tips outlined in this post. Writing good
tests is hard, and continuing to write tests as your code base grows requires
discipline. Furthermore, many scientific algorithms have a stochastic nature to
them, which further compounds the situation.</p>
<p>First of all, before you start writing tests, make sure that you find a suitable
testing framework so that you do not re-invente the wheel. For example if
you are coding in Python you could use
<a href="https://docs.python.org/2/library/unittest.html">Unittest</a>.</p>
<p>If you already have code that is working but have no tests, start by adding
some integration tests. In other words treat your software as a black box that,
given a set of inputs, produces a known set of outputs. Write an automated test that
checks that this is true.</p>
<p>Now once you go in and work on a particular unit of your code, make sure that
you write a test for that particular unit first, then make the change that you
wanted to make.</p>
<p>When the unit that you are testing is so isolated that it does not depend on
any other code or systems (e.g. a database running in the background) then the
test is referred to as a unit test.</p>
<p>There are two advantages to unit tests over integration tests. They make it
easier to identify which part of your code is broken when they fail. Secondly they
run quicker than integration tests so you can have more of them.</p>
<p>Why does the speed of the tests matter? Speed matters because once you have
automated tests in place you need to run them often, at a minimum before every
commit to version control.</p>
<p>At this point I recommend that you get a copy of Martin Fowlers’ book
<a href="http://martinfowler.com/books/refactoring.html">Refactoring: Improving the Design of Existing
Code</a>. As the title suggests it
is about refactoring rather than testing. However, refactoring requires tests
and the book gives loads of practical advice on how to improve existing code by
writing tests and refactoring.</p>
<p>If you are starting out with a clean slate (i.e. no existing code), I highly
recommend that you start writing tests from the start. You could even go to the
extreme and use <a href="http://en.wikipedia.org/wiki/Test-driven_development">Test Driven
Development</a>, where you
write a test before you write any code. Initially the test will fail and then
you implement the code to make the test pass.</p>
<p>Test driven development is a bit more complicated than what I outlined above,
notably it includes a step of refactoring. However I will not go into more
detail here. If test driven development sounds interesting and you are
interested in web development as well I highly recommend Harry Percival’s book
<a href="http://chimera.labs.oreilly.com/books/1234000000754">Test-Driven Development with
Python</a>.</p>
<p>This all sounds like a lot of hard work, why do I need tests anyway? I won’t
dwell on this too much. However, if you don’t have tests how can you have any
confidence that your code is doing what it is supposed to do? Okay, so you have
done manual testing and the results are as expected. Fine, now suppose that you
want to add another feature how can you be sure that you will not introduce a
bug somewhere else? Do you want to do all that manual testing again? If you do
not have tests you will get to the stage where you are afraid to touch the code
for fear of breaking it.</p>
<h2 id="summary">Summary</h2>
<p>This post turned out a bit longer than I initially thought. However the take
home message is simple:</p>
<ul>
<li>Use version control</li>
<li>Write code so that it can be understood by someone else</li>
<li>Write tests</li>
</ul>
<p>Using version control is easy: do it!</p>
<p>Another person that is likely to need to get familiar with your code is <em>you in
six months time</em> so be kind and make your code easy to understand.</p>
<p>Writing good tests is initially hard, and the only way to learn is by practise (I’m still
learning). However, do write them otherwise your code will hold you to ransom.</p>
<p>If you already do all of the above, great, I’m preaching to the converted,
please forward this post to someone less experienced than yourself.</p>
<h2 id="acknowledgements">Acknowledgements</h2>
<p>I’d like to thank Clare Macrae
(<a href="https://twitter.com/ClareMacraeUK">@ClareMacraeUK</a>) for helpful discussions
and feedback.</p>
How to save RGB images using PyLibTiff2015-02-18T00:00:00+00:00http://tjelvarolsson.com/blog/how-to-save-rgb-images-using-pylibtiff<p>In the <a href="/blog/saving-16bit-tiff-files-using-python/">previous post</a>
I showed how to read and write tiff files in Python using PyLibTiff. Here I
will illustrate how to use PyLibTiff to create an RGB tiff file.</p>
<figure>
<img src="/images/canny-fill-holes-segmentation.jpg" alt="Segmented coins" />
<figcaption>
Figure illustrating the Canny edge detection algorithm followed
by a binary filling of holes. The red, green and blue channels represent
the initial edges, the segments identified and the raw data respectively.
</figcaption>
</figure>
<p>The <a href="https://code.google.com/p/pylibtiff/">PyLibTiff on-line documentation</a> is
minimal, so I started off by simply trying to save a list containing three
<code class="language-plaintext highlighter-rouge">numpy.arrays</code>. This was a guess based upon how I would have liked the
package to work.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">libtiff</span> <span class="kn">import</span> <span class="n">TIFF</span>
<span class="o">>>></span> <span class="n">r</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">50</span><span class="p">,</span><span class="mi">50</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span> <span class="o">*</span> <span class="mi">30</span>
<span class="o">>>></span> <span class="n">g</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">50</span><span class="p">,</span><span class="mi">50</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span> <span class="o">*</span> <span class="mi">90</span>
<span class="o">>>></span> <span class="n">b</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">50</span><span class="p">,</span><span class="mi">50</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span> <span class="o">*</span> <span class="mi">120</span>
<span class="o">>>></span> <span class="n">tiff</span> <span class="o">=</span> <span class="n">TIFF</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'initial-test.tiff'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">tiff</span><span class="o">.</span><span class="n">write_image</span><span class="p">([</span><span class="n">r</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">b</span><span class="p">])</span>
<span class="o">>>></span> <span class="n">tiff</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>
<p>To my surprise this created a tiff file without complaining. However,
inspecting the tiff file using
<a href="http://www.sno.phy.queensu.ca/~phil/exiftool/">exiftool</a> revealed that the
file only had one channel per sample.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ exiftool initial-test.tiff
...
Bits Per Sample : 8
Compression : Uncompressed
Photometric Interpretation : BlackIsZero
...
</code></pre></div></div>
<p>I had in fact produced a multi-page tiff file. After some head scratching I
started digging around in PyLibTiff’s built in documentation using <code class="language-plaintext highlighter-rouge">pydoc</code>.
This was very informative. It revealed that the <code class="language-plaintext highlighter-rouge">write_image()</code> function has
an argument named <code class="language-plaintext highlighter-rouge">write_rgb</code>, which by default is set to <code class="language-plaintext highlighter-rouge">False</code>; so I
set it to <code class="language-plaintext highlighter-rouge">True</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">tiff</span> <span class="o">=</span> <span class="n">TIFF</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'rgb-test.tiff'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">tiff</span><span class="o">.</span><span class="n">write_image</span><span class="p">([</span><span class="n">r</span><span class="p">,</span> <span class="n">g</span><span class="p">,</span> <span class="n">b</span><span class="p">],</span> <span class="n">write_rgb</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">tiff</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>
<p>Inspecting the new file revealed that it was indeed a RGB tiff file!</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ exiftool initial-test.tiff
...
Bits Per Sample : 8 8 8
Compression : Uncompressed
Photometric Interpretation : RGB
...
</code></pre></div></div>
<h2 id="how-is-this-useful">How is this useful?</h2>
<p>Microscopy data often contains several channels of information, red and green
fluorescence are common, so it is useful to be able to save these to the red
and green channels respectively.</p>
<p>Furthermore, it can be a quick and dirty way of annotating regions of interest.
Say for example that we wanted to visualise how a segmentation using the Canny
edge detection algorithm followed by a binary filling of holes works in the
context of the raw data. This can be achieved using the code snippet below,
which was used to produce the image at the top of this post.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">skimage</span> <span class="kn">import</span> <span class="n">data</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">skimage.filter</span> <span class="kn">import</span> <span class="n">canny</span><span class="p">,</span> <span class="n">sobel</span>
<span class="o">>>></span> <span class="kn">from</span> <span class="nn">scipy.ndimage</span> <span class="kn">import</span> <span class="n">binary_fill_holes</span>
<span class="o">>>></span> <span class="n">coins</span> <span class="o">=</span> <span class="n">data</span><span class="o">.</span><span class="n">coins</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">edges</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">canny</span><span class="p">(</span><span class="n">coins</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span> <span class="o">*</span> <span class="mi">255</span>
<span class="o">>>></span> <span class="n">filled</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">array</span><span class="p">(</span><span class="n">binary_fill_holes</span><span class="p">(</span><span class="n">edges</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint8</span><span class="p">)</span> <span class="o">*</span> <span class="mi">255</span>
<span class="o">>>></span> <span class="n">tiff</span> <span class="o">=</span> <span class="n">TIFF</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'canny-fill-holes-segmentation.tiff'</span><span class="p">,</span> <span class="s">'w'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">tiff</span><span class="o">.</span><span class="n">write_image</span><span class="p">([</span><span class="n">edges</span><span class="p">,</span> <span class="n">filled</span><span class="p">,</span> <span class="n">coins</span><span class="p">],</span> <span class="n">write_rgb</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">tiff</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>
Saving 16-bit tiff files using Python2015-02-13T00:00:00+00:00http://tjelvarolsson.com/blog/saving-16bit-tiff-files-using-python<p>When dealing with microscopy data it is not uncommon to be dealing with image
files that have 16-bit channels. This presents a difficulty when
working with Python as many imaging libraries struggle to save <code class="language-plaintext highlighter-rouge">numpy.uint16</code>
arrays.</p>
<p>To illustrate the problem let us create a white 50x50 pixel 16-bit image using
<code class="language-plaintext highlighter-rouge">numpy</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>
<span class="o">>>></span> <span class="n">ar</span> <span class="o">=</span> <span class="n">np</span><span class="o">.</span><span class="n">ones</span><span class="p">((</span><span class="mi">50</span><span class="p">,</span><span class="mi">50</span><span class="p">),</span> <span class="n">dtype</span><span class="o">=</span><span class="n">np</span><span class="o">.</span><span class="n">uint16</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">ar</span> <span class="o">=</span> <span class="n">ar</span> <span class="o">*</span> <span class="n">np</span><span class="o">.</span><span class="n">iinfo</span><span class="p">(</span><span class="n">np</span><span class="o">.</span><span class="n">uint16</span><span class="p">)</span><span class="o">.</span><span class="nb">max</span>
</code></pre></div></div>
<p><a href="http://www.pythonware.com/products/pil/">PIL</a>/<a href="https://pillow.readthedocs.org/">Pillow</a>
simply, and helpfully, raises a <code class="language-plaintext highlighter-rouge">TypeError</code>.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">PIL</span> <span class="kn">import</span> <span class="n">Image</span>
<span class="o">>>></span> <span class="n">img</span> <span class="o">=</span> <span class="n">Image</span><span class="o">.</span><span class="n">fromarray</span><span class="p">(</span><span class="n">ar</span><span class="p">)</span>
<span class="n">Traceback</span> <span class="p">(</span><span class="n">most</span> <span class="n">recent</span> <span class="n">call</span> <span class="n">last</span><span class="p">):</span>
<span class="o">...</span>
<span class="nb">TypeError</span><span class="p">:</span> <span class="n">Cannot</span> <span class="n">handle</span> <span class="n">this</span> <span class="n">data</span> <span class="nb">type</span>
</code></pre></div></div>
<p><a href="http://www.scipy.org/">SciPy</a> does save the file, but it converts it to 8-bit.
Personally I do not like this behaviour as it has caused me confusion on
several occasions as subsequent steps of the analysis has read the file and
tried to extract meaningful information from it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">import</span> <span class="nn">scipy.misc</span>
<span class="o">>>></span> <span class="n">scipy</span><span class="o">.</span><span class="n">misc</span><span class="o">.</span><span class="n">imsave</span><span class="p">(</span><span class="s">'scipy.tiff'</span><span class="p">,</span> <span class="n">ar</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">ar2</span> <span class="o">=</span> <span class="n">scipy</span><span class="o">.</span><span class="n">misc</span><span class="o">.</span><span class="n">imread</span><span class="p">(</span><span class="s">'scipy.tiff'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">ar2</span><span class="o">.</span><span class="n">dtype</span>
<span class="n">dtype</span><span class="p">(</span><span class="s">'uint8'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">ar2</span><span class="p">)</span>
<span class="mi">0</span>
</code></pre></div></div>
<h3 id="pylibtiff-to-the-rescue">PyLibTiff to the rescue</h3>
<p><a href="https://code.google.com/p/pylibtiff/">PyLibTiff</a> is a package that provides a
wrapper to the <a href="http://www.remotesensing.org/libtiff/">libtiff</a> library. To use
it simply make sure that you have the libtiff library installed on your system
and then you can use <code class="language-plaintext highlighter-rouge">pip</code> to install PyLibTiff. On a Debian based system.</p>
<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">sudo </span>apt-get <span class="nb">install </span>libtiff-dev
<span class="nb">sudo </span>pip <span class="nb">install </span>libtiff
</code></pre></div></div>
<p>Now let us look at how to save a file using PyLibTiff.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="kn">from</span> <span class="nn">libtiff</span> <span class="kn">import</span> <span class="n">TIFF</span>
<span class="o">>>></span> <span class="n">tiff</span> <span class="o">=</span> <span class="n">TIFF</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'libtiff.tiff'</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s">'w'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">tiff</span><span class="o">.</span><span class="n">write_image</span><span class="p">(</span><span class="n">ar</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">tiff</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
</code></pre></div></div>
<p>To show that everything is working as expected let us open the tiff file and
read in the image from it.</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">>>></span> <span class="n">tiff</span> <span class="o">=</span> <span class="n">TIFF</span><span class="o">.</span><span class="nb">open</span><span class="p">(</span><span class="s">'libtiff.tiff'</span><span class="p">,</span> <span class="n">mode</span><span class="o">=</span><span class="s">'r'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">ar</span> <span class="o">=</span> <span class="n">tiff</span><span class="o">.</span><span class="n">read_image</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">tiff</span><span class="o">.</span><span class="n">close</span><span class="p">()</span>
<span class="o">>>></span> <span class="n">ar</span><span class="o">.</span><span class="n">dtype</span>
<span class="n">dtype</span><span class="p">(</span><span class="s">'uint16'</span><span class="p">)</span>
<span class="o">>>></span> <span class="n">np</span><span class="o">.</span><span class="nb">max</span><span class="p">(</span><span class="n">ar</span><span class="p">)</span>
<span class="mi">65535</span>
</code></pre></div></div>
<h3 id="other-options">Other options</h3>
<p>Another option for working with 16-bit tiff files is
<a href="http://docs.opencv.org/trunk/doc/py_tutorials/py_tutorials.html">OpenCV-Python</a>.
I also believe that
<a href="http://www.lfd.uci.edu/~gohlke/code/tifffile.py.html">tiffile.py</a> can handle
them, although I have not tested this myself. The reason I prefer PyLibTiff
over these is that it can be installed into a virtual environment using pip.</p>