Tjelvar Olsson

Using relative paths in Linux scripts

2020-05-15T00:00:00+00:00

In the preivous post I discussed the difference between absolute and relative paths.

So what is better absolute or relative paths? Which one should be used when one needs to refer to a file in a script?

Let me prefix my answer with the caveat that all paths are a pain. However, absolute paths are more of a pain than relative paths. This is because absolute paths make it difficult to restructure the way that directories are organised on your computer. They also make it difficult for you to share your scripts with collaborators because they would need their computer to be structured in exactly the same way as yours. It is possible to get around some of these issues by using relative paths.

This post will show you how create scripts that have a clear separation between raw and derived data using relative paths. As a bonus the scripts will also be more portable and less fragile with respect to reorganisations of your directory structure.

This post makes use of some more advanced Linux skills including the use of environment variables, the creation of a Bash script and adding execute permissions to the script. You don’t need too worry too much about these details if they are new to you. An environment variable is a means to store a piece of information for use later, a Bash script is a text file with commands to run, and execute permissions allows the script to be run by referencing its path. If you would like to learn more about these topics please let me know.

First of all let us recap to get setup to the where the previous post ended. We need two subdirectories raw_data and scripts. The mkdir command below will create these if they do not already exist (the -p flag means that no errors are generated if the directories already exist).

$ mkdir -p raw_data scripts

We also need a file with raw data. The command below creates this file if it does not exist and overwrites it if it already exists.

$ echo "Raw data isn't baked data" > raw_data/raw_data.txt

To illustrate the use of relative paths in scripts create a file named analysis.sh in your scripts directory, i.e. with the relative path ./scripts/analysis.sh, and copy and paste the code below into it.

#!/bin/bash

# Save the current working directory in an environment variable.
INITIAL_WORKING_DIRECTORY=$(pwd)

# This line changes to current working directory to where
# the analysis.sh file is.
cd "$(dirname "$0")"

# Create an environment variable with the relative path to the
# derived data directory.
DERIVED_DATA_DIRECTORY=../derived_data

# Create the derived data  directory if it does not already exist.
mkdir -p $DERIVED_DATA_DIRECTORY

# This code streams the content of the raw data file into the sed
# stream editor. The sed stream editor is used to edit the content
# of the stream. Finally, the output of sed is redirected to a
# derived_data.txt file in the derived data directory.
cat ../raw_data/raw_data.txt  \
        | sed -e "s/Raw/Fudged/"  \
        | sed -e "s/isn't/is/"  \
        > $DERIVED_DATA_DIRECTORY/derived_data.txt

# Go back to where we were before changing into the
# scripts directory.
cd $INITIAL_WORKING_DIRECTORY

The code above works with relative paths. The paths are relative to the scripts directory. That means that the outcome of the script will be independent of the directory one is in when running the script, i.e. no nasty side effects of input files not being found or output files being written to the wrong directory.

To achieve this the script first makes a note of the directory you are currently in and stores it in the INITIAL_WORKING_DIRECTORY environment variable. The script then changes the working directory to be that of the analysis.sh script. At this point the script can start working with paths relative to the scripts directory.

The details of the analysis in this script do not really matter. It creates a directory for derived data (../derived_data) if it does not already exist. It then takes as input the raw data, transforms it before writing it to a file in the derived data directory.

Finally, and importantly, the script changes the working directory back to whatever it was before the script was invoked.

To test the script script ensure that you are not in the scripts directory. In the command below I change into my home directory.

cd /home/olssont

In the above I’m using an absolute path to make it clear which directory I am referring to. Depending on your setup this path may be different. For clarity, I am referring to the directory in which you created the raw_data and scripts directories.

Before we run the script we need to use the chmod command to make it executable.

chmod +x ./scripts/analysis.sh

We can then run the script by calling its path.

./scripts/analysis.sh

This will have created a directory called dervied_data at the same level as the scripts directory.

$ ls
derived_data  raw_data  scripts

Let us also use the cat command to look at the content of the ./derived_data/derived_data.txt file.

$ cat ./derived_data/derived_data.txt
Fudged data is baked data

In a previous post about data management I talked about the need to keep raw data separate from derived data. In this post I have given you some tips on how you can accomplish this. Setting up scripts in the fashion outlined above also has the benefit that it is easier to rename and reorganise directories without your scripts breaking. Furthermore, it will make it easier for you to share your scripts with collaborators.

Relative and absolute paths in Linux

2020-05-05T00:00:00+00:00

Paths is a topic that causes a lot of confusion for people that want to learn how to make use of the command line in Linux. In this post I will explain what paths are, and the difference between absolute and relative paths. By the end of this post you should be able to understand the diagram below.

The file system on Unix-like systems is built up like a tree starting from the root directory (/). One can view the content of the root directory by typing in the command ls /.

$ ls
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root
run  sbin  srv  sys  tmp  usr  var

The term “path” refers to the name of a file or directory that can be used to uniquely identify it in the file system. For example, the command pwd prints the name of the current working directory, in this case my home directory.

$ pwd
/home/olssont

In the above /home/olssont is the path to the directory that I am currently in. To illustrate this further we can create a file in this directory. The command below creates a file named raw_data.txt.

$ echo "Raw data isn't baked data" > raw_data.txt

We can see the files in the working directory using the command ls.

$ ls
raw_data.txt

Depending on how many files you have in your working directory you may see more output from the command above.

To print the content of the file one can use the command cat.

$ cat raw_data.txt
Raw data isn't baked data

In the above the text raw_data.txt could be used to find the file that we just created. This was because the file was present in the working directory.

So how can we refer to a file if it is not present in the working directory? This is where the use of absolute and relative paths comes into play.

To illustrate this we will use the command mkdir to create a new directory called raw_data.

$ mkdir raw_data

Then we will use the command mv to move the file into that directory.

$ mv raw_data.txt raw_data/

Before illustrating the correct way to refer to the file let us see what happens if we use the same command as previously.

$ cat raw_data.txt
cat: raw_data.txt: No such file or directory

The cat command is no longer able to find the file (or rather the file is no longer there, because we moved it into the raw_data directory). There are two methods that you could use to refer to the file in the raw_data directory. The first is to use the absolute path, in this case /home/olssont/raw_data/raw_data.txt. Note that the absolute path to your file will be different if your username is different to mine and/or if you are not working in your home directory.

$ cat /home/olssont/raw_data/raw_data.txt
Raw data isn't baked data

The second method is to make use of a relative path, that means a path that is relative to your current working directory, in this case raw_data/raw_data.txt. The forward slash (/) is the symbol used to separate directories and files.

$ cat raw_data/raw_data.txt
Raw data isn't baked data

A different way to represent this relative path is to prepend it with ./, where the dot is a symbol that is used to represent the current working directory. Some people prefer this way because it makes it easer to see that it is a relative path.

$ cat ./raw_data/raw_data.txt
Raw data isn't baked data

To illustrate the concept of relative paths further we will create another directory called scripts.

$ mkdir scripts

Now we will change our working directory to be scripts using the cd (change directory) command.

$ cd scripts

Let us see what happens if we run the previous cat command again (you should be able to get back to this by using the up and down arrows on your keyboard to navigate through the command line history).

$ cat ./raw_data/raw_data.txt
cat: ./raw_data/raw_data.txt: No such file or directory

The command above fails because it expects to be able to find a directory named raw_data in the current working directory and because we have moved into the scripts directory there is no such directory.

In order to make this work we need to be able to specify that we want to go up one level in the directory tree. This can be achieved using double dots, i.e. using the prefix ../. To refer to something two levels up in the directory tree one would use the prefix ../../, and so forth.

$ cat ../raw_data/raw_data.txt
Raw data isn't baked data

To summarise:

Paths are used to refer to unique files
An absolute path starts at the top of the directory tree and includes all parent directories separated by slashes as well as the file or directory of interest. Examples of absolute paths include /home/olssont (a directory), /home/olssont/raw_data/raw_data.txt (a file)
Relative paths are used to refer to files and directories with respect to the current working directory
In a relative path the prefix ./ means the current working directory, the prefix ../ means the parent directory, and the prefix ../../ means the parent’s parent directory, and so forth

This should be enough to get you started with paths in the Linux command line. In the next post I will show how relative paths can be used to make scripts more portable, and how they can be used to improve your data management.

Homeworking: opportunities for scientists

2020-03-24T00:00:00+00:00

Covid-19 is causing world wide chaos and it is terrible. However that is not what this post is about. This post is about staying positive and finding opportunities. Why? Amidst this chaos it is important to stay sane. You are no good to anyone if you become a nervous wreck. You need to stay strong to be able to help your family and friends.

My work is basically centered around helping people get more scientific value out of the computational resources we have available to us. Some of it is technical; working with computers. And some of it is softer; working with people. Just over a week ago I started working from home. The technical aspects of my work have been relatively easy to transition. The softer parts of my work still have some way to go. I’m learning more and more about different types of video conferencing software. To me this presents an opportunity because I have wanted to do more home working for a while. The current situation has accelerated that process and I’m keen to make sure I learn to work from home more efficiently.

Clearly it is easier to work from home if your research is mainly computational. However, if you are a bench scientist perhaps this presents new opportunities for you as well. Perhaps this presents an opportunity to learn more about computational approaches? Perhaps this is the time to learn R? If you are looking for something like this I have written a book to help you: The Biologist’s Guide to Computing (it is free).

Or perhaps this presents an opportunity to go over all your old data? (Or, heaven forbid, an opportunity to do some data management? I have also written software to make this easier: dtool. It is also free.) Going over old data with fresh eyes can sometimes lead to new insights and generate ideas for new manuscripts. Speaking of manuscripts perhaps this period of working at home presents an opportunity for you to finish up and submit those manuscripts that have been weighing on your mind for the past couple of years.

On a different note, schools closed in the UK on Monday and it is likely to remain that way for the next four months or so. I’ll therefore be one of the many parents juggling home schooling and working at the same time.

What opportunities present themselves here?

I’ll be spending much more time with my son. That means that I’ll have an opportunity to teach him about things where I have specialist knowledge. In particular I am hoping to teach him Python programming. There are lots of great books out there for teaching kids how to program. I’ve invested in a copy of Computer Coding Python Games for Kids, because I had a good experience with a similar book called Coding Games in Scratch from the same series.

I do not have the skill to do anything to help stop the spread of Covid-19. However, I do have the skills to help you develop your computational and data management expertise. Please get in touch if this is something that you would be interested in. Stay safe, stay sane and stay positive!

New journal for "Patterns" in data

2019-11-14T00:00:00+00:00

This summer I went to a Research Data Alliance meeting in London. The meeting was about persistent identifiers, and my goal was to meet like-minded people that care about data. During the meeting I got talking to Sarah Callaghan. Sarah described Patterns; the new data science journal she was setting up. After the meeting our discussion continued via email and Skype and Sarah asked me to become a member of the academic advisory board.

Data science is an emerging discipline that is gaining more and more attention both in academia and industry. It is a multi-disciplinary field and it is not limited to data analysis. It also includes topics such as data cleaning, computational infrastructure, as well as legal and policy aspects of data. This can present problems for academic researchers. How do you publish, and get credit for, work that you have done on developing data science infrastructure or policy?

In this interview Sarah describes how Cell Press are creating a new journal, called Patterns, to try to help alleviate this and other problems with knowledge sharing in data science. Sarah has a 20-year career in creating, managing, and analysing scientific data and she is Patterns’ Editor-in-Chief.

Tjelvar Olsson: What prompted the creation of Patterns?

Sarah Callaghan: Data are everywhere, and we’re producing more and more of it as time goes by. A lot of the time that data creation is intentional, like when a researcher designs and runs an experiment and collects the data. Sometimes that data creation is unintentional, like when a supermarket customer buys one brand instead of another. But regardless of how the data was created, there are uses for it, whether that’s in developing new science, or in figuring out how to market a new type of toothpaste.

One common trend across all the domains that create and manage data is this: everyone has common problems in dealing with data. Everyone, whether they’re an astronomer or a zoologist, has problems with data collection, cleaning, sharing with other researchers, understanding the legal and policy aspects of data, analysing it, and publishing it. And researchers in different domains have come up with solutions to those problems that work in their particular domain, but could also be usefully shared across domains.

That cross-disciplinary knowledge sharing is growing, but it’s not quite there yet. Patterns is all about providing a forum for researchers to share their data-related solutions, tools, methods and analyses across multiple domains. There is a lot of really exciting and innovative work out there that has not gotten out to the wider world – Patterns is here to help change that!

TO: That sounds fantastic. I certainly know the feeling of struggling to find out how others have tackled problems that I’m facing working with scientific infrastructure and data management. It would be great to be able to read about solutions and lessons learnt by others.

I’ve never met anyone who has set up a new journal before. What are the challenges involved in this?

SC: Lots! First and foremost, the main challenge is getting the word out. People can’t submit their articles to a journal that they don’t know exists. So (at the risk of going all marketing-speak) building a journal brand is important. That includes setting a scope and aims that will suit the journal audience, and recruiting an advisory board who will promote the journal in their own networks.

Getting people enthused and interested in the journal is also vital, especially in the case of Patterns, where we’re bringing together different communities into a new, more-inclusive group. Data are fundamental to research, regardless of what your domain is, so Patterns is bringing together computer scientists, data stewards and engineers, and researchers in data-intensive domains in order to share solutions and knowledge.

Commissioning papers for a new journal is also a challenge. Because Patterns is new, there can be a bit of convincing required to get authors to submit their articles to my journal, rather than a more established one. This is where Patterns cross-disciplinary focus and open access nature have added value – it allows researchers to reach readers outside their usual domains.

From a personal point of view, setting up a new journal means a lot of travelling to conferences and meetings, and even more talking to people about their research in order to commission papers (which to be fair, I do enjoy). And email. Lots of email!

TO: It sounds like you get to talk to lots of people about data science. I think this means that you have your finger on the pulse in this field. How do you think data science and management will develop over the next ten years?

SC: I think there’ll be a lot more of it, and there will be different variations in the roles and job titles associated with data. At the moment, a “data scientist” role can cover a wide range of skills and talents, and as a title, it means different things to different people.

I also think that we’re on the cusp of a change in the way that data is produced and dealt with. The closest analogy I can think of is the industrial revolution, where goods moved from being produced as piecework, done by individuals, to being produced in factories. Historically, with data, datasets have been hand created in their own formats by individuals or small groups. The landscape has moved to large scale data creation, and to deal with the issues that come out of that, you need things like infrastructure and standards to drive tools and services.

Academics aren’t the only ones doing research into data science – there is a lot of very interesting and exciting work being done in the business domain. I expect, in the next decade or so, that we will see more of the innovations developed by business to work with data rolled out more widely across research. This is already happening with advances in computer vision for example. And it’s only a little stretch to see how the same artificial intelligence network that can count people in a crowd could be repurposed to count antelope in a herd.

TO: What types of manuscripts should people be submitting to Patterns?

SC: I am always looking out for exciting, innovative original research where a data science solution has been applied to a problem in a research domain, and that solution has the potential to be applied to different domains too. The solution doesn’t have to be complete, in fact Patterns has developed a Data Science Maturity Level scale in order to help readers understand what stage the research is at.

Patterns also publishes descriptor articles – which are papers that describe a data science resource, whether that’s a dataset, piece of software, infrastructure, workflow, algorithm, even a piece of hardware. As long as the resource can be uniquely and unambiguously identified and is useful to the wider community, then an article about it is in scope. This allows the researchers who spend their time building, for example, infrastructures, to gain academic credit for their work.

I am also interested in opinion pieces and reviews on topics of interest to the community. Reviews can be on the literature around a certain topic in data science (e.g. GANs, blockchain, knowledge graphs, etc.) or can be on types of software and tools, highlighting their strengths, weaknesses and uses for the community.

Fundamentally, I want to publish interesting, exciting and innovative work that people from a wide range of domains want to read!

TO: Where can people find out more about Patterns and how to submit manuscripts to it?

SC: We have a very pretty and informative website at http://www.cell/com/patterns where you can find all the information needed by authors to write and submit their article. This includes details of the article types, and the aims and the scope of Patterns. There’s also the link on that page to the system where you can submit your manuscript, and also another link so that you can get the journal e-table of contents delivered free to your inbox when each issue is released.

We’re on Twitter too (@Patterns_CP) where we’ll be promoting our content and also sharing cool and interesting data science things (and pretty pictures of interesting patterns I come across when out and about).

And of course, if the readers of this interview have any other questions, or want to discuss whether or not their research is suitable for Patterns, then I’d be very glad to hear from you! My email is s.callaghan@cell.com

I’d just like to finish up by saying that the future for data science is bright – let’s make it together!

Packaging data and metadata using dtool

2019-03-19T00:00:00+00:00

Introduction

In the previous post I described four principles for effective data management.

Make it clear who is responsible for what
Keep raw data safe and separate from derived data
Standardise the location and structure of data
Provide metadata

Getting a research group together and discussing these principles can lead to a more coherent strategy for managing data. However, the fourth principle presents a challenge in that currently there is no perfect solution for associating metadata with data.

What is metadata anyway?

Metadata is data about data. Take, for example, an experiment comparing the expression profiles of different tissues in mouse. In this example the species, Mus musculus, is a key piece of metadata that needs to be recorded and associated with the data. Another key piece of metadata to make sense of the data is the tissue. In other words one would need to record the tissue associated each expression profile. These types of metadata are called descriptive metadata. Without these descriptive metadata it would be impossible to draw any conclusions from the data.

When working with digital files one can also think of sizes and checksums of the files themselves as metadata. This type of metadata is called structural metadata. Structural metadata can be useful to ensure that files have not become corrupted. For example, sequencing companies typically provide MD5 checksums alongside the raw data files so that the downloaded sequence files can be verified to contain the expected content.

There is also a third type of metadata called administrative metadata. Administrative metadata is used to manage data as a resource. For example a UniProt identifier is a piece of administrative metadata used manage a protein in the UniProt database.

Although metadata is essential for making sense of data finding solutions for managing metadata can be difficult. In some cases metadata resides inside the heads of individuals. This is not a sustainable solution!

One strategy for associating metadata with files is to include the metadata in directory structures and file names. This takes the form of file paths along the lines of replicate_1/chitin/col0_leaf_1.tif. Here the file name tells us that the image is from leaf sample one from the Colombia-0 ecotype of A. thaliana. The directory structure encodes that this is replicate one and that the sample has been treated with chitin.

Using file names and directory structures to store metadata is better than keeping it in ones head. However, it is also fragile in that the metadata can easily be lost if one moves or renames the file.

In this post I will describe our approach to overcoming this problem.

Executive summary

Our solution was to develop dtool, a utility to package metadata with data and treat the two as a unified whole. In dtool terminology the packaged data and metadata is referred to as a dataset.

A dataset can be likened to a box with items in it and a label on it describing its content. The items in the box are the data and the label the metadata.

There are several benefits to this approach, some of which only become apparent once one has spent some time using dtool to manage data. However, in brief, dtool prevents accidental loss of metadata when moving data around. It also enables researchers to store and work with data in a variety of storage solutions, and it has built in support for verifying the integrity of a dataset. In other words dtool automates a lot of tedious work associated with data management, and gives researchers peace of mind that their data are safe and secure.

The dtool software was recently published in PeerJ Lightweight data management with dtool.

The hairy details

At its core dtool is a command line utility (with a Python API) that can be used to create and interact with datasets.

First of all one needs to install the software. This can be done using the Python package installer pip.

$ pip install dtool

dtool can be used to retrieve and display the descriptive metadata of a dataset. In the example below the URL refers to a dataset hosted in the cloud.

$ dtool readme show http://bit.ly/Ecoli-ref-genome
description: U00096.3 genome with Bowtie2 indices
organism: Escherichia coli str. K-12 substr. MG1655
accession_id: U00096.3
link: https://www.ebi.ac.uk/ena/data/view/U00096.3
index_builder: bowtie2-build version 2.3.3
index_build_cmd: bowtie2-build U00096.3.fasta reference

From this metadata one can discern that this dataset contains an E. coli reference genome with Bowtie2 indices.

Using dtool it is possible to list the files in the dataset.

$ dtool ls http://bit.ly/Ecoli-ref-genome
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74  reference.2.bt2
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf  reference.3.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e  reference.1.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc  U00096.3.fasta
37e2d68bb38271036d96b6979d24666e0d4fd814  reference.rev.1.bt2
23ebd7cd21a905d5f255919ca1d0491901cb8718  reference.4.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404  reference.rev.2.bt2

The output above lists identifiers and the relative paths of all the files in the dataset. In dtool terminology the files in a dataset are referred to as items.

It is also possible to get administrative and structural metadata from a dataset. This can be achieved using the dtool summary command.

$ dtool summary http://bit.ly/Ecoli-ref-genome
name: Escherichia-coli-ref-genome
uuid: 8ecd8e05-558a-48e2-b563-0c9ea273e71e
creator_username: olssont
number_of_items: 7
size: 18.8MiB
frozen_at: 2018-09-26

From this one can see, amongst other things, that the data is 18.8MiB in size and that it has been given the Universally Unique Identifier (UUID) 8ecd8e05-558a-48e2-b563-0c9ea273e71e.

This particular dataset can be useful if one has E. coli RNA sequencing data that one wants to align using Bowtie2. However, in order to make use of the dataset one needs to download it from the cloud to local filesystem. In the example below a directory for storing datasets is created, and dtool is used to download the dataset into this directory.

$ mkdir datasets
$ dtool cp -q http://bit.ly/Ecoli-ref-genome datasets/
file:///Users/olssont/datasets/Escherichia-coli-ref-genome

The command above achieved a lot. It downloaded all the data and metadata from a dataset stored in the cloud, in an Amazon S3 bucket to be precise, and reconstructed the dataset on local disk. Note that this involved working with two different storage technologies, both S3 object storage and filesystem.

All the commands that we have been using on the dataset hosted in the cloud work the same on the dataset stored on local filesystem.

$ dtool readme show datasets/Escherichia-coli-ref-genome
description: U00096.3 genome with Bowtie2 indices
organism: Escherichia coli str. K-12 substr. MG1655
accession_id: U00096.3
link: https://www.ebi.ac.uk/ena/data/view/U00096.3
index_builder: bowtie2-build version 2.3.3
index_build_cmd: bowtie2-build U00096.3.fasta reference

The structure the dataset on the local filesystem is shown below.

$ tree datasets/Escherichia-coli-ref-genome
datasets/Escherichia-coli-ref-genome
├── README.yml
└── data
    ├── U00096.3.fasta
    ├── reference.1.bt2
    ├── reference.2.bt2
    ├── reference.3.bt2
    ├── reference.4.bt2
    ├── reference.rev.1.bt2
    └── reference.rev.2.bt2

1 directory, 8 files

From the above we can see that the data files are stored in a subdirectory named data. The descriptive metadata is stored in the README.yml file.

$ cat datasets/Escherichia-coli-ref-genome/README.yml
description: U00096.3 genome with Bowtie2 indices
organism: Escherichia coli str. K-12 substr. MG1655
accession_id: U00096.3
link: https://www.ebi.ac.uk/ena/data/view/U00096.3
index_builder: bowtie2-build version 2.3.3
index_build_cmd: bowtie2-build U00096.3.fasta reference

On filesystem the data and metadata are stored in files. Furthermore, the metadata files are plain text and make use of open standards. This makes it possible to read and understand them without the need for specialised tools.

In this section three important features of dtool have been highlighted:

The dtool command line interface can be used to inspect a dataset’s metadata allowing one to understand the content of the dataset.
When copying a dataset with dtool both the data and the metadata are copied across. This means that it is possible to copy datasets, for example to long-term storage systems, without fear of loosing metadata.
dtool supports several storage systems including both filesystem and Amazon S3 object storage. This make it possible to copy datasets between different storage systems without having to learn the specifics (and quirks) of the various storage systems.

Creating a dataset

So far the use and benefits of dtool have been illustrated using an existing dataset. Now we will go through the process of creating a dataset.

The creation of a dataset happens in three stages:

One creates a “proto” dataset that one can add data and metadata to
One adds the data and metadata to the proto dataset
One converts the proto dataset into a dataset by “freezing” it

This can be likened to creating an open box (the proto dataset), putting items (data) into it, sticking a label (metadata) on it, and closing the box (freezing the dataset).

Now we will create a minimal dataset containing a single file with the content Hola Mundo. The command below creates a dataset named hello in the datasets directory.

$ dtool create hello datasets/
Created proto dataset file:///Users/olssont/datasets/hello
Next steps:
1. Add raw data, eg:
   dtool add item my_file.txt file:///Users/olssont/datasets/hello
   Or use your system commands, e.g:
   mv my_data_directory /Users/olssont/datasets/hello/data/
2. Add descriptive metadata, e.g:
   dtool readme interactive file:///Users/olssont/datasets/hello
3. Convert the proto dataset into a dataset:
   dtool freeze file:///Users/olssont/datasets/hello

Now we add a file named greeting.txt to the proto dataset.

$ echo "Hola Mundo" > datasets/hello/data/greeting.txt

There are several ways to add descriptive metadata to a dataset. Below we make use of dtool’s built-in template to interactively prompt for metadata to describe the dataset.

$ dtool readme interactive datasets/hello
description [Dataset description]: Hello World greeting in Spanish
project [Project name]: dtool demo
confidential [False]:
personally_identifiable_information [False]:
name [Tjelvar Olsson]:
email [tjelvar.olsson@dtool-solutions.com]:
username [olssont]:
Updated readme
To edit the readme using your default editor:
dtool readme edit file:///Users/olssont/datasets/hello

Finally, we need to convert the proto dataset into a dataset by freezing it.

$ dtool freeze datasets/hello
Generating manifest  [####################################]  100%  greeting.txt
Dataset frozen file:///Users/olssont/datasets/hello

Congratulations, you have just created your first dtool dataset!

Validating the integrity of a dataset

The dtool freeze command generates a manifest containing structural metadata. In the manifest each file in the data directory is given an identifier that is the SHA1 checksum of the file’s relative path in the data directory. The identifiers are used to create one record for each data item containing the file’s relative path, size, checksum and timestamp. Below is the content of the manifest file.

$ cat datasets/hello/.dtool/manifest.json
{
  "dtoolcore_version": "3.8.0",
  "hash_function": "md5sum_hexdigest",
  "items": {
    "0ce56d0a6e9baa0c5d170001592c9b9c65d19276": {
      "hash": "b4b9e397fb7e08bfeaa54090d2989e53",
      "relpath": "greeting.txt",
      "size_in_bytes": 11,
      "utc_timestamp": 1551631241.827989
    }
  }
} 

This information can be used to verify the integrity of the dataset by checking that the expected items are present and that they have the correct size and content.

Using dtool this type of integrity check can be performed using the dtool verify command.

$ dtool verify --full datasets/hello
All good :)

In the command above we use the --full flag to include the step to compute and compare the checksum. Only item identifiers and sizes are verified by default as computing checksums can be time consuming for datasets that contain lots of large files.

We can simulate data corruption by editing the data/greeting.txt file in the dataset.

$ echo "Bonjour le Monde" > datasets/hello/data/greeting.txt

The data/greeting.txt file no longer contains the expected content, it has been corrupted. Let’s see the output of the dtool verify command.

$ dtool verify --full datasets/hello
Altered item size: 0ce56d0a6e9baa0c5d170001592c9b9c65d19276 greeting.txt
Altered item hash: 0ce56d0a6e9baa0c5d170001592c9b9c65d19276 greeting.txt

In the above the content of the hello/data directory is compared against the expected content stored in the manifest. In this case both the file size and checksum of the greeting.txt file are different and this is reported back to the user.

DISCUSSION

In this post I have shown how one can use dtool to package data and metadata into a unified whole. Using dtool to manage data provides several benefits:

It prompts people to add metadata to describe their data, making the data more reusable
It standardises the structure of the metadata, making it easier to access the metadata
It makes it possible to verify the integrity of dataset, providing peace of mind that data is intact
It makes it possible to copy a dataset without fear of loosing metadata
It makes it possible to copy a dataset between different types of storage systems, e.g. from filesystem to Amazon S3 object storage

There are several aspects of dtool this post did not go into. For example, it is possible to customise the template used to prompt for descriptive metadata. This, and other more advanced topics, will be the topics of future blog posts.

If you are keen to find out more about dtool I suggest having a look at the paper Lightweight data management with dtool and the dtool documentation.

If you have made it this far you deserve a lollipop!

Data management for biologists

2019-02-26T00:00:00+00:00

Introduction

Data management is a great challenge in the biological sciences and discussing it is often difficult because it is a multi-faceted problem and the term “data management” often means different things to different people.

Over the past couple of years I have become more and more involved in helping biological research groups manage their data. The first step in this process is typically a meeting that gathers all the members of the research group or project with the aim of getting people onto the same page.

During these sessions the participants are often surprised to find out how different their point of view are to other people in the group. There is typically a split across two axis. One axis being project leaders vs group members. The former being more concerned with the long term safety and viability of the data produced by the group and the latter being more concerned with the limitations of the tools available for them to do their day to day work. The other axis across which people have different points of view is that between experimental biologists vs bioinformaticians. The former having pain points around managing distributed versions of Word and Excel files and the latter struggling with having enough storage quota on the computer cluster to analyse the high-throughput sequencing data produced by the group.

Having mediated many such meeting it has become clear that there is not one solution that fits all. Each research group and project has its own quirks and the members need to find a solution that works for them. However, there are some general principles that can help guide a group towards a more consistent and coherent way of managing their data.

In this post I’d like to share these guiding principles that I use to mediate these types of group data management sessions.

Principle 1: Make it clear who is responsible for what

In terms of data management responsibilities are often implicitly assumed to be with someone else. Let’s illustrate this with an example.

Ambitious Anna, is an established group leader who has started making more and more use of next generation sequencing. Two of the people in Ambitious Anna’s group are Fastidious Fatima and Binary Beatrice. Fastidious Fatima, an experimental biology post doc, prepares a large batch of samples and sends it off for sequencing with Nebulous New Sequencing Ltd. After a month of waiting Nebulous New Sequencing sends Fastidious Fatima an email with instructions for how to download her 100GB of sequencing data. Fastidious Fatima is busy preparing more samples and she asks Binary Beatrice, the group bioinformatician, to download the data. Binary Beatrice is happy to help, particularly as she needs to process the data anyway.

Ambitious Anna, the group leader, has neither touched the experimental sample nor the raw data produced by Nebulous New Sequencing Ltd. So Ambitious Anna implicitly assumes that the people in her group are managing the data.

Fastidious Fatima, the experimental biologist, has a record of her experimental work and samples in her lab notebook. However, since she did not download or process the sequencing data produced by Nebulous New Sequencing Ltd she assumes that Binary Beatrice and Ambitious Anna are managing that data.

Binary Beatrice, the bioinformatician, is overworked. As well as analysing data produced by Fastidious Fatima she also has another six experimental biologists to support. On top of this she needs to find another post-doc as her contract runs out in three months time. Binary Beatrice therefore thinks that it is Ambitious Anna’s job, as the group leader, to ensure that the data is managed properly.

In this contrived example all the actors implicitly assume that the management of data is somebody else’s responsibility.

Getting everyone into a room to discuss data management can help improve this situation. By explicitly stating who is responsible for what data are less likely to fall between the cracks.

One may consider using the template below for assigning responsibilities.

Ultimately data management is the responsibility of the group leader. However, in practise the group leader is unlikely to be working with data on a day to day basis so he or she needs to delegate this responsibility to a data champion. The data champion then becomes responsible for ensuring that the existing and new members of the group are aware of the group data management processes.

Principle 2: Keep raw data safe and separate from derived data

Most researchers are aware that they should keep their data safe by backing it up. If possible it is also worth protecting raw data by making it read only. This means that you cannot accidentally delete or modify it. More good suggestion on this topic can be found in Ten Simple Rules for Digital Data Storage.

However, here I would like to emphasize another point, the importance of keeping raw data separate from derived data.

Let’s illustrate this with another story. Once upon a time Binary Beatrice was making the transition from experimental biology to bioinformatics. She had got her first sequencing data and was eager to analyse it.

Binary Beatrice wanted to run a tool called The Latest & Greatest Aligner, which after she had spent three weeks installing it, was ready for her to use. Half a year earlier, as preparation, Binary Beatrice had attended the institute’s cluster computing course and she had learnt how to write a batch submission script to submit jobs to the cluster. She therefore wrote such a batch submission script to run her Latest & Greatest Aligner. The Latest & Greatest Aligner needed to know where the data was so she put the batch submission script next to the raw data. That way Binary Beatrice did not have to worry about file paths (the bane of scientific computing).

To Binary Beatrice’s surprise The Latest & Greatest Aligner worked out of the box and produced great results. It also produced lots and lots of files. However, her analysis did not end there she also had to run The Latest & Greatest Normaliser and The Greatest and Latest Plotter. These tools produced even more files.

Then something terrible happened. Binary Beatrice hit her storage quota and could not write any more files. At this point she had a directory filled with millions of files. Some of them were raw data, some of them were batch submission scripts, some of them were intermediate files and some of them were figures that she wanted to use in her paper.

Because all the derived file names were based on the names of the raw data files Binary Beatrice did not dare create an expression for deleting files in bulk. She therefore spent two weeks cleaning up her data.

At this point Binary Beatrice made a promise to herself to always keep raw data separate from derived data. In fact all her new projects have a structure with four directories: raw_data, scripts, intermediate_data, and final_data. When she hits her quota it is now easy for her to remove the files in the intermediate_data directory.

In the fictional example above Binary Beatrice learnt from her mistake immediately. This is not always the case. In real life many people ask to get their storage quota increased and don’t learn the lesson of separating raw data from derived data. Eventually, when these people leave the group, no one can work out what their raw/derived data is.

If you are interested in some practical tips on how to do this in Linux have a look at this post.

Principle 3: Standardise the location and structure of data

It is natural, and common, for PhD students and post docs to think of the data that they generate as their own. This tends to lead to a situation where the data is organised per research group member. For example, Ambitious Anna might have a shared folder for her group and at the top level are the folders with names of the group members Fastidious-Fatima, Binary-Beatrice, etc. Fastidious Fatima then organises her work and her data in the Fastidious-Fatima folder and Binary Beatrice organises her work and her data in the Binary-Beatrice folder.

This is not necessarily a bad way to organise the group’s data. Group leaders sometimes find it easier to remember data based on who generated it. However, it is important to realise that (unless otherwise stated) the data generated when working in a research group does not belong to the individual generating it. The data belongs to the group leader. If this is not stated explicitly, and made clear within the group, it is easy for each member of the group to invent their own way of structuring the data within their own folder. When this happens files and data often become incomprehensible once the person who organised them leaves the group.

It is therefore highly recommended that the location and structure of data is standardised, and ideally that this standard is recorded in a document that can be read by everyone at the top level of the shared folder. If a data champion has been nominated it is his/her responsibility to ensure that this document is kept up to date and that the other members of the group know that they need to follow the standards for organising data outlined in this document.

I also highly recommend having a separate folder at the top level of the shared folder dedicated to storing raw data, see Principle 2 above. Below is an example that still gives individuals their own working space.

Ambitious-Anna
├── GROUP-MEMBERS
│   ├── Binary-Beatrice
│   └── Fastidious-Fatima
├── RAW-DATA
└── README.txt

Below is an example that structures work based on projects rather than individuals. This can be useful if more than one person is working on a project.

Ambitious-Anna
├── PROJECTS
│   ├── Cure-Cancer
│   └── Feed-The-World
├── RAW-DATA
└── README.txt

Obviously it is possible to mix and match according to need. However, it is useful to document the rational of the structure and how it is intended to be used. In the examples above this information is recorded in the README.txt file.

Principle 4: Provide metadata

Metadata is a fancy term for information that put data into context. For example, in a microscopy experiment the pixels captured are data and information about the experiment such as the magnification and the X/Y scales are metadata. More formally, metadata is data about data. Without metadata raw data is meaningless as it cannot be understood.

It is important to think about what metadata one needs to capture. Often this is closely linked to the design of the experiment. For example, if one is performing a time series study, it is important to associate the date/time with each data point.

This type of metadata is called descriptive metadata. Descriptive metadata is important as it allows data to be put into the context of a scientific question. For example, if one performs a RNA sequencing experiment to compare expression profiles in different tissues it is important to record which data are associated with which tissues.

When thinking about how to organise data it is worth thinking about how descriptive metadata should be recorded and associated with the data. This is a non-trivial problem. It is not uncommon for metadata to be stored in an individual’s memory. This is not a safe strategy! Another common approach is to store descriptive metadata in file names and directory structures. This is better, but is also fragile as it is easy to loose metadata when moving and/or renaming files.

Another type of metadata is structural metadata and includes things such as sizes and checksums of files. Structural metadata can be used to verify that the raw data files have not become corrupted. For example, sequencing companies typically provide MD5 checksums along side the raw data files so that one can verify that the downloaded files contain the expected content.

This fourth principle, is more complicated than the previous ones. Although, it is easy to understand that metadata is important there is currently not an easy way to bundle arbitrary metadata with files on disk. The poor mans solution is to capture this metadata using some sort of directory structure. However, this is fragile and makes it difficult to add more meta data on an ad-hoc basis.

Furthermore, there is not really a neat solution for capturing structural metadata such as sizes and checksums of files. It is therefore rarely done within research groups. Ideally this is something that should be automated as it is not a productive use of researchers’ time to calculate and record these types of file properties.

Discussion

Using these principles to mediate discussions about working practises can result in a much more coherent strategy to managing data.

The first three principles are relatively easy for a research group to get to grips with. They can be implemented by discussing how the group think things should be done and by coming to a mutual understanding and agreement on how data should be structured and organised.

The fourth principle highlights the importance of recoding metadata. However, having metadata separate from data, for example in directory structures and file names, is fragile. The metadata can easily be lost when moving and renaming files. In the next post I will describe our solution to this problem.

Python for biologists

2016-10-13T00:00:00+00:00

Python is a high-level scripting language that is growing in popularity in the scientific community. It uses a syntax that is relatively easy to get to grips with and that encourages code readability.

This post aims to give you a flavour of what it feels like to work with Python. We will use Python to calculate the guanine-cytosine (GC) content of a DNA sequence. In the process you will also learn about some key aspects of programming namely variables, functions and loops.

Getting a flavour of Python

The most traditional way of working with Python is to write your code in a script and run it using the python command. For example, if you had your code in a file named analysis.py you could run it using the command below.

$ python analysis.py

However, there are other ways of interacting with Python.

These days so called “notebooks” are becoming more and more popular. They are used for creating and sharing documents that include explanations of code as well as code blocks that can be run interactively. Check out the Jupyter project for more details.

Python can also be run interactively in your terminal using its interactive mode.

To start Python in its interactive mode simply type python into your terminal.

$ python
Python 2.7.10 (default, Jul 14 2015, 19:46:27)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.39)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>>

This prints out information about the version of Python that is being used and how it was compiled before leaving you at the interactive prompt. In this instance I am using Python version 2.7.10.

The three greater than signs (>>>) represent the primary prompt into which commands can be entered.

>>> 1 + 2
3

There is also a secondary prompt that is represented by three dots (...). It is used as a continuation line.

>>> line = ">myseq1"
>>> if line.startswith(">"):
...     print(line)
...
>myseq1

The rest of this post will use this “interactive” Python format. You can try to follow along using Python running interactively in a terminal or using a Python notebook. You can access a Python notebook from Try Jupyter.

Variables

A variable is a means of storing a piece of information using using a descriptive name. The use of variables is encouraged as it allows us to avoid having to repeat ourselves.

In Python variables are assigned using the equals sign.

>>> pi = 3.14

When naming variables being explicit is more important than being succinct. One reason for this is that you will spend more time reading your code than you will writing it. Avoiding the mental overhead of trying to understand what all the acronyms mean is a good thing. For example, suppose that we wanted to create a variable for storing the radius of a circle. Please avoid the temptation of naming the variable r, and go for the longer but more explicit name radius.

>>> radius = 1.5

Determining the GC count of a sequence

One feature of interest when examining DNA is the guanine-cytosine (GC) content. DNA with high GC-content is more stable than DNA with low GC-content.

Suppose that we had a string representing a DNA sequence.

>>> dna_string = "attagcgcaatctaactacactactgccgcgcggcatatatttaaatata"
>>> print(dna_string)
attagcgcaatctaactacactactgccgcgcggcatatatttaaatata

A string is a data type for representing text. As such it is not ideal for data processing purposes. In this case the DNA sequence would be better represented using a “list”, with each item in the list representing a DNA letter. A list, also known as an array, is a data structure representing a collection of elements with a specific order.

In Python we can convert a string into a list using the built-in list() function.

>>> dna_list = list(dna_string)
>>> print(dna_list)
['a', 't', 't', 'a', 'g', 'c', 'g', 'c', 'a', 'a', 't', 'c',
 't', 'a', 'a', 'c', 't', 'a', 'c', 'a', 'c', 't', 'a', 'c',
 't', 'g', 'c', 'c', 'g', 'c', 'g', 'c', 'g', 'g', 'c', 'a',
 't', 'a', 't', 'a', 't', 't', 't', 'a', 'a', 'a', 't', 'a',
 't', 'a']

Python’s list has got a method called count() that we can use to find out the counts of particular elements in the list.

>>> dna_list.count("a")
17

To find out the total number of items in a list one can use Python’s built-in len() function, which returns the length of the list.

>>> len(dna_list)
50

When using Python you need to be careful when dividing integers, because in Python 2 the default is to use integer division, i.e. to discard the remainder.

>>> 3 / 2
1

One can work around this by ensuring that at least one of the numbers is represented using floating point.

>>> 3 / 2.0
1.5

Warning: In Python 3, the behaviour of the division operator has been changed, and dividing two integers will result in normal division.

One can convert an integer to a floating point number using Python’s built-in float() function.

>>> float(2)
2.0

We now have all the information required to calculate the GC-content of the DNA sequence.

>>> gc_count = dna_list.count("g") + dna_list.count("c")
>>> gc_frac = float(gc_count) / len(dna_list)
>>> 100 * gc_frac
38.0

Creating reusable functions

Suppose that we wanted to calculate the GC-content for several sequences. In this case it would be very annoying, and error prone, to have to enter the commands above into the Python shell manually for each sequence. Rather, it would be advantageous to be able to create a piece of code that could be called repeatedly to calculate the GC-content. We can achieve this using the concept of functions. In other words functions are a means for programmers to avoid repeating themselves.

Let us create a simple function that adds two items together.

>>> def add(a, b):
...     return a + b
...
>>> add(2, 3)
5

In Python functions are defined using the def keyword. Note that the def keyword is followed by the name of the function. The name of the function is followed by a parenthesized set of arguments, in this case the function takes two arguments a and b. The end of the function definition is marked using a colon.

The body of the function, in this example the return statement, needs to be indented. The standard in Python is to use four white spaces to indent code blocks. In this case the function body only contains one line of code. However, a function can include several indented lines of code.

Warning: Whitespace really matters in Python! If your code is not correctly aligned you will see IndentationError messages telling you that everything is not as it should be. You will also run into IndentationError messages if you mix white spaces and tabs.

Now we can create a function for calculating the GC-content of a sequence. As with variables explicit trumps succinct in terms of naming.

>>> def gc_content(sequence):
...     gc_count = sequence.count("g") + sequence.count("c")
...     gc_fraction = float(gc_count) / len(sequence)
...     return 100 * gc_fraction
...
>>> gc_content(dna_list)
38.0

List slicing

Suppose that we wanted to look at local variability in GC-content. To achieve this we would like to be able to select segments of our initial list. This is known as “slicing”, as in slicing up a salami.

In Python slicing uses a [start:end] syntax that is inclusive for the start index and exclusive for the end index. To illustrate slicing let us first create a list to work with.

>>> zero_to_five = ["zero", "one", "two", "three", "four", "five"]

To get the first two elements we therefore use 0 for the start index, as Python uses a zero-based indexing system, and 2 for the end index as the element from the end index is excluded.

>>> zero_to_five[0:2]
['zero', 'one']

Note that the start position for the slicing is 0 by default so we could just as well have written.

>>> zero_to_five[:2]
['zero', 'one']

To get the last three elements.

>>> zero_to_five[3:]
['three', 'four', 'five']

We can use list slicing to calculate the local GC-content measurements of our DNA.

>>> gc_content(dna_list[:10])
40.0
>>> gc_content(dna_list[10:20])
30.0
>>> gc_content(dna_list[20:30])
70.0
>>> gc_content(dna_list[30:40])
50.0
>>> gc_content(dna_list[40:50])
0.0

Loops

It can get a bit repetitive, tedious, and error prone specifying all the ranges manually. A better way to do this is to make use of a loop construct. A loop allows a program to cycle through the same set of operations a number of times.

In lower level languages while loops are common because they operate in a way that closely mimic how the hardware works. The code below illustrates a typical setup of a while loop.

    >>> cycle = 0
    >>> while cycle < 5:
    ...     print(cycle)
    ...     cycle = cycle + 1
    ...
    0
    1
    2
    3
    4

In the code above Python moves through the commands in the while loop executing them in order, i.e. printing the value of the cycle variable and then incrementing it. The logic then moves back to the while statement and the conditional (cycle < 5) is re-evaluated. If true the commands in the while statment are executed in order again, and so forth until the conditional is false. In this example the print(cycle) command was called five times, i.e. until the cycle variable incremented to 5 and the cycle < 5 conditional evaluated to false.

However, when working in Python it is much more common to make use of for loops. For loops are used to iterate over elements in data structures such as lists.

>>> for item in [0, 1, 2, 3, 4]:
...     print(item)
...
0
1
2
3
4

In the above we had to manually write out all the numbers that we wanted. However, because iterating over a range of integers is such a common task Python has a built-in function for generating such lists.

>>> range(5)
[0, 1, 2, 3, 4]

So a typical for loop might look like the below.

>>> for item in range(5):
...     print(item)
...
0
1
2
3
4

The range() function can also be told to start at a larger number. Say for example that we wanted a list including the numbers 5, 6 and 7.

>>> range(5, 8)
[5, 6, 7]

As with slicing the start value is included whereas the end value is excluded.

It is also possible to alter the step size. To do this we must specify the start and end values explicitly before adding the step size.

>>> range(0, 50, 10)
[0, 10, 20, 30, 40]

We are now in a position where we can create a naive loop for for calculating the local GC-content of our DNA.

>>> for start in range(0, 50, 10):
...     end = start + 10
...     print(gc_content(dna_list[start:end]))
...
40.0
30.0
70.0
50.0
0.0

Loops are really powerful. They provide a means to iterate over lots of items and as such to automate repetitive tasks.

Summary

I hope this post has given you a flavour of what it feels like to work with Python.

The key take home messages were:

You can explore Python’s syntax using its interactive mode
Variables and functions help us avoid having to repeat ourselves
When naming variables and functions explicit trumps succinct
Loops are really powerful, they form the basis of automating repetitive tasks

If you enjoyed this post please check out the book that I am working on The Biologist’s Guide to Computing!

Biologist's Guide to Python string manipulation

2016-10-01T00:00:00+00:00

Because information about DNA and proteins are often stored in plain text files many aspects of biological data processing involves manipulating text. In computing text is often referred to as strings of characters. String manipulation is is therefore a common task both for processing biological sequences and for interpreting sequence identifiers.

This post provides a quick summary of how Python can be used for such string manipulation, using the FASTA description line as an example.

The Python string object

When reading in strings from a text file one often has to deal with lines that have leading and/or trailing white spaces. Commonly one wants to get rid of them. This can be achieved using the strip() method built into the Python string object.

>>> "  text with leading/trailing spaces ".strip()
'text with leading/trailing spaces'

Another common use case is to replace a word in a line. For example, when we strip out the leading and trailing white spaces one might want to update the word “with” to “without” to make the resulting string reflect its current state. This can be achieved using the replace() method.

>>> "  text with leading/trailing spaces ".strip().replace("with", "without")
'text without leading/trailing spaces'

In the example above we chain the strip() and replace() methods together. In practise this means that the replace() method acts on the return value of the strip() method.

Python’s string object also comes with a startswith() method. This can, for example, be used to identify FASTA description lines.

>>> ">MySeq1|description line".startswith(">")
True

The endswith() method complements the startswith() method and is often used to examine file extensions.

>>> "/home/olsson/images/profile.png".endswith("png")
True

The example above only works if the file extension is in lower case.

>>> "/home/olsson/images/profile.PNG".endswith("png")
False

However, we can overcome this issue by adding a call to the lower() method, which converts the string to lower case.

>>> "/home/olsson/images/profile.PNG".lower().endswith("png")
True

Another common use case is to search for a particular string within another string. For example one might want to find out if the UniProt identifier “Q6GZX4” is present in a FASTA description line. To achieve this one can use the find() method, which returns the index position (zero-based) where the search term was first identified.

>>> ">sp|Q6GZX4|001R_FRG3G".find("Q6GZX4")
4

If the search term is not identified find() returns -1.

>>> ">sp|P31946|1433B_HUMAN".find("Q6GZX4")
-1

When iterating over lines in a file one often wants to split the line based on a delimiter. This can be achieved using the split() method. By default this splits on white space characters and returns a list of strings.

>>> "text without leading/trailing spaces".split()
['text', 'without', 'leading/trailing', 'spaces']

A different delimiter can be used by providing it as an argument to the split() method.

>>> ">sp|Q6GZX4|001R_FRG3G".split("|")
['>sp', 'Q6GZX4', '001R_FRG3G']

There are many variations on the string operators described above. It is useful to familiarise yourself with the Python documentation on strings.

Regular expressions

Regular expressions can be defined as a series of characters that define a search pattern.

Regular expressions can be very powerful. However, they can be difficult to build up. Often it is a process of trial and error. This means that once they have been created, and the trial and error process has been forgotten, it can be extremely difficult to understand what the regular expression does and why it is constructed the way it is.

Warning: only use regular expression as a last resort!

A good rule of thumb is to always try to use string operations to implement the desired functionality and only switch to regular expressions when the code implemented using these become more difficult to understand than the equivalent regular expression.

To use regular expressions in Python we need to import the re module. The re module is part of Python’s standard library. Importing modules in Python is achieved using the import keyword.

>>> import re

Let us store a FASTA description line in a variable.

>>> fasta_desc = ">sp|Q6GZX4|001R_FRG3G"

Now, let us search for the UniProt identifier Q6GZX4 within the line.

>>> re.search(r"Q6GZX4", fasta_desc)  # doctest: +ELLIPSIS
<_sre.SRE_Match object at 0x...>

There are two things to note here:

We use a raw string to represent our regular expression, i.e. the string prefixed with an r
The regular expression search() method returns a match object (or None if no match is found)

What is a “raw” string? In Python “raw” strings differ from regular strings in that the bashslash \ character is interpreted literally. For example the regular string equivalent of r"\n" would be "\\n" where the first backslash is used to escape the effect of the second (remember that \n represents a newline). Raw strings were introduced in Python to make it easier to create regular expressions that rely heavily on the use of literal backslashes.

The index of the first matched character can be accessed using the match object’s start() method. The match object also has an end() method that returns the index of the last character + 1.

>>> match = re.search(r"Q6GZX4", fasta_desc)
>>> if match:
...     print(fasta_desc[match.start():match.end()])
...
Q6GZX4

In the above we make use of the fact that Python strings support slicing. Slicing is a means to access a subsection of a sequence. The [start:end] syntax is inclusive for the start index and exclusive for the end index.

>>> "012345"[2:4]
'23'

To see the merit of regular expressions we need to create one that matches more than one thing. For example a regular expression that could match all the patterns id0, id1, …, id9.

Now suppose that we had a list containing FASTA description lines with these types of identifiers.

>>> fasta_desc_list = [">id0 match this",
...                    ">id9 and this",
...                    ">id100 but not this (initially)",
...                    "AATCG"]
...

Note that the list above also contains a sequence line that we never want to match.

Let us loop over the items in this list and print out the lines that match our identifier regular expression.

>>> for line in fasta_desc_list:
...     if re.search(r">id[0-9]\s", line):
...         print(line)
...
>id0 match this
>id9 and this

There are two noteworthy aspects of the regular expression. Firstly, the [0-9] syntax means match any digit. Secondly, the \s regular expression meta character means match any white space character.

If one wanted to create a regular expression to match an identifier with an arbitrary number of digits one can make use of the * meta character, which causes the regular expression to match the preceding expression 0 or more times.

>>> for line in fasta_desc_list:
...     if re.search(r">id[0-9]*\s", line):
...         print(line)
...
>id0 match this
>id9 and this
>id100 but not this (initially)

It is possible to extract specific pieces of information from a line using regular expressions. This uses a concept known as “groups”, which are indicated using parenthesis. Let us try to extract the UniProt identifier from a FASTA description line.

>>> print(fasta_desc)
>sp|Q6GZX4|001R_FRG3G
>>> match = re.search(r">sp\|([A-Z,0-9]*)\|", fasta_desc)

Note how horrible and incomprehensible the regular expression is!

It took me a couple of attempts to get this regular expression right as I forgot that | is a regular expression meta character that needs to be escaped using a backslash \.

The regular expression representing the UniProt idendifier [A-Z,0-9]* means match capital letters (A-Z) and digits (0-9) zero or more times (*). The UniProt regular expression is enclosed in parenthesis. The parenthesis denote that the UniProt identifier is a group that we would like access to. In other words, the purpose of a group is to give the user access to a section of interest within the regular expression.

>>> match.groups()
('Q6GZX4',)
>>> match.group(0)  # Everything matched by the regular expression.
'>sp|Q6GZX4|'
>>> match.group(1)
'Q6GZX4'

Note that there is a difference between the groups() and the group() methods. The former returns a tuple containing all the groups defined in the regular expression. The latter takes an integer as input and returns a specific group. However, confusingly group(0) returns everything matched by the regular expression and group(1) returns the first group; making the group() method appear as if it used a one-based indexing scheme.

Finally, let us have a look at a common pitfall when using regular expressions in Python: the difference between the methods search() and match().

>>> print(re.search(r"cat", "my cat has a hat"))  # doctest: +ELLIPSIS
<_sre.SRE_Match object at 0x...>
>>> print(re.match(r"cat", "my cat has a hat"))  # doctest: +ELLIPSIS
None

Basically match() only looks for a match at the beginning of the string to be searched. For more information see the search() vs match() section in the Python documentation.

There is a lot more to regular expressions in particular all the meta characters. For more information have a look at the regular expressions operations section in the Python documentation.

This blog post was adapted from a section in the book that I am working on: The Biologist’s Guide to Computing. Please check it out if you found this post useful!

Biologist's Guide to Computing - almost there

2016-09-24T00:00:00+00:00

Last week I announced the launch of the Biologist’s Guide to Computing website and the response was tremendous.

I've created a website for the book that I am working on: The biologist's guide to computing. Please spread the wordhttps://t.co/zfsLikrsTq
— Tjelvar Olsson (@tjelvar_olsson) September 17, 2016

Thank you all!

I thought I’d take the opportunity to provide a status update.

The first draft of the book is finished. Yay! In its current form it has 47,000 words, spread over 195 pages, split into 14 chapters.

Although, I call it the first draft each chapter has undergone several revisions as I have received feedback from many of my colleague, to whom I am very grateful. Thanks to Nadia Radzman, Sam Mugford and Anna Stavrinides for providing feedback on early versions of the initial chapters. Many thanks to Tyler McCleary for continued in depth feedback and suggestions for improvements. Thanks also to Nick Pullen for feedback and discussions on the data visualisation chapter. Finally, many thanks to Matthew Hartley for all the discussions and encouragement.

So what happens now?

I will go over the draft with a red pen and I’m sure that I will find plenty of things that needs fixing.

Then I will make the book available more broadly under creative commons licence to get more feedback.

I’m currently debating if I should try to get a publisher or if I should self publish. Does anyone have any thoughts or recommendations with regards to this?

If I go down the self publishing route I’d like to find a copy editor that has some grasp of both coding and biology. Does anyone know of such a person?

Finally, I am pondering how I can publicise the book. Please do help me spread the word by pointing people at the Biologist’s Guide to Computing website. The website has a newsletter sign up form, please do sign up to it. By signing up you encourage me to finish the project! Also, if you are a blogger and would consider writing a review of the book I will give you early access to it. Please do get in touch if you would be interested in this.

Taking the effort out of server configuration using Ansible

2016-03-06T00:00:00+00:00

This article was originally published in the NorDevCon 2016 conference programme.

Ansible is an IT automation tool that is growing in popularity. It is ideally suited for configuration management, i.e. automating the configuration of your development and production infrastructure.

Ansible is a relatively new addition to the “DevOps” arena (first released in 2012) and it has quite a different philosophy to some of the more well established players in the field. Most notably, it is “agent-less”; i.e. there is no need to have an “agent” pulling updates from a “master” configuration manager.

Ansible has been designed to be easy to use and it achieves this through two aspects of its architecture:

It uses a push based method to interact with the hosts (the machines to be configured)
It uses OpenSSH as its authentication method

What this means in practise is that you can install Ansible on your laptop and as long as you have setup password-less ssh to the machines you are wanting to interact with you are ready to go. In other words no master, no databases, no services; no fighting the system that is meant to be making your life easier!

Listing your inventory

Ansible has the concept of an inventory where you list all of the hosts that you want to be able to interact with through Ansible. The inventory is a plain text file using an INI-like format. The default path to the inventory file is /etc/ansible/hosts. However, you can provide an alternative path as a command line argument using the -i option.

Below is an example hosts file that groups three web servers into a webservers group.

[webservers]
web1.example.com
web2.example.com
web3.example.com

It is also possible to create aliases and specify host specific variables. Below is an example of an alias to enable Ansible to communicate with a Vagrant generated virtual machine.

testserver ansible_ssh_host=127.0.0.1 ansible_ssh_port=2222 ansible_ssh_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/default/virtualbox/private_key ansible_sudo=yes

Interacting with Ansible

There are two different ways of interacting with ansible: ad-hoc commands and playbooks. Ad-hoc commands are useful for quick tasks that you don’t want to save for later. Whereas playbooks provides a means to specify reproducible configuration recipes.

Ad-hoc commands are accessed using the ansible program and can be useful if you want to do something quickly. For example, suppose that there was a critical security patch for bash that you needed to apply to all your webservers rapidly.

ansible webservers -m apt -a "name=bash state=latest update_cache=yes"

There are a few things to note in the command above. The first argument (webservers) is the host group defined in the inventory. The -m apt option states that we want to use Ansible’s apt module, in other words our servers are running on a Debian based OS such as Ubuntu. Finally, the -a ... option specifies the arguments to supply, i.e. we want to install the latest version of bash.

It is worth noting that Ansible does not try to abstract away the layer of package management. As such there are separate modules for apt, yum, homebrew, etc. This is useful in that the functionality of Ansible modules are not limited by the constraint of a set of common features. For example in this case we make use of the update_cache option to run the equivalent of apt-get update before the we try to install the latest version of bash.

Batteries included as Ansible modules

Ansible comes with a large number of built-in modules for configuring your systems. These include modules for package management, performing systems administration tasks, working with files, interacting with source control programs, and much more.

Reproducible configuration scripts using playbooks

Ansible playbooks allow you to create reproducible recipes for configuring your servers. Playbooks are written in YAML and are meant to be human readable. As YAML is simply a data serialisation language, a playbook can be thought of as “infrastructure as data”.

Below is a basic playbook for configuring a firewall using firewalld. In real life a playbook would be written to configuration the entire system.

---

- name: configure a web server firewall using firewalld
  hosts: testserver
  
  tasks:
    - name: install firewalld 
      apt: name=firewalld state=present

    - name: ensure that firewalld is started and enable at boot
      service: name=firewalld enabled=yes state=started

    - name: open up port 80 for tcp
      firewalld: port=80/tcp permanent=yes state=enabled
      notify: restart firewalld

  handlers:
    - name: restart firewalld
      service: name=firewalld state=restarted

There are a few things to note in the playbook above. The host(s) that the playbook is defined to be applied to are defined in the hosts entry. In this case it is only run against the Vagrant testserver alias we defined in the inventory earlier. The apt module is used to install firewalld and the service module is used to ensure that firewalld is started and enabled at boot. Finally, we use Ansible’s firewalld module to open up port

Note that this task makes use of the notify action to trigger the restart firewalld handler, which we define at the end of the playbook.

What now?

There is much more to Ansible than what has been outlined here. The key is to build things up bit by bit. You don’t need to use every feature of Ansible to get a job done.

For more information about Ansible have a look at the Ansible documentaiton. If you liked this article you may also be interested in the other Ansible tutorials on this site, which illustrate the use of Ansible as a tool to install scientific software.

Biologist's Guide to Computing - a work in progress

2015-12-05T00:00:00+00:00

The reason there has been a bit of a radio silence on this blog for the past couple of months is that I have been spending most of my spare time working on a booklet about computing.

The booklet is intended for biologists that want to learn more about data analysis. It will provide an introduction to some fundamental aspects of computing required for learning scripting and programming. Furthermore, as well as outlining basic principles of programming it will introduce some best practices for keeping track of work and collaborating on projects.

These days many parts of the biological sciences are become more and more data driven. Technological advancements have led to a huge increase in the generation of biological data. Data analysis is required to extract biological insights from this data. To a large extent the rate limiting factor in generating biological insight is the lack of appropriate data analysis tools.

In these instances computers can be powerful allies. They are ideal for automating repetitive tasks. Furthermore, they can perform calculations and analysis that would be infeasible for the human brain alone.

The purpose of this booklet is not to provide a bundle of useful scripts and regular expressions. Its purpose, is rather, to outline a more productive way of working that will make you a better scientist.

If this sounds interesting please encourage me to spend more time writing by spreading the word on Twitter and signing up for the monthly newsletter.

How to build a basic image viewer using FreeImage and SDL2

2015-10-10T00:00:00+00:00

In this blog post we will use FreeImage and SDL2 to create a basic image viewer in C. FreeImage is an open source library for working with image files. It supports over 30 file formats, gives access to meta-data and provides basic image manipulation routines. SDL (Simple DirectMedia Layer) is a cross-platform library which provides low level access to things like the keyboard and mouse as well as graphics hardware using OpenGL and Direct3D. SDL provides official supports for Windows, Mac, Linux, iOS and Android.

By the end of this post we will have created a C program that can be used to view RGB and grayscale images from the command line.

Argument parsing

Let us start by adding some basic argument parsing. Add the code below to a file named see.c.

#include <stdlib.h>
#include <stdio.h>

char *parse_args_get_filename(int argc, char *argv[]) {
    if (argc != 2) {
        fprintf(stderr, "Usage: %s FILENAME\n", argv[0]);
        exit(2);
    }
    char *filename = argv[1];
    return filename;
}

There are a few things going on in the above. We include the <stdlib.h> and <stdio.h> header files. The former provides the exit() function and the latter the fprintf() function.

We also define the function parse_args_get_filename() that returns a pointer to a char array, the char array that will hold our input file name. Note that the argc variable is an integer that holds the number of arguments supplied from the command line and argc is a list of pointers to char arrays holding the values of the strings supplied from the command line. In our program argc will contain two such pointers, one to the name of the program and one to the filename of the image we want to view.

Reading in the image using FreeImage

We will now use FreeImage to read in an image. Let us start by including the FreeImage header.

#include <stdlib.h>
#include <stdio.h>

#include <FreeImage.h>

Now let us add a function for creating a FreeImage bitmap from an image file. The function will return a pointer to the bitmap.

/** Initialise a FreeImage bitmap and return a pointer to it. */
FIBITMAP *get_freeimage_bitmap(char *filename) {
    FREE_IMAGE_FORMAT filetype = FreeImage_GetFileType(filename, 0);
    FIBITMAP *freeimage_bitmap = FreeImage_Load(filetype, filename, 0);
    return freeimage_bitmap;
}

Here we use the FreeImage_GetFileType() function to determine the image file type by analysing the bitmap signature. According to the FreeImage documentation the second parameter (size) is currently not in use and can be set to 0.

We then use the FreeImage_Load() function to initialise the bitmap from the input file name. The third parameter (flags) can be used to change the loading behaviour for certain file types. As we are not interested in using this functionality we can set it to 0.

Note that the FreeImage bitmap is flipped vertically with respect to the coordinate system used by SDL. We will deal with this in the next section.

Creating a SDL surface from a FreeImage bitmap

It is time to start using SDL. Let us therefore include the SDL header.

#include <stdlib.h>
#include <stdio.h>

#include <FreeImage.h>

#include <SDL2/SDL.h>

Now we will create a function that takes our FreeImage bitmap and returns a SDL_Surface (a structure containing pixel information). To achieve this we will make use of the SDL_CreateRGBSurfaceFrom() function.

/** Initialise a SDL surface and return a pointer to it.
 *
 *  This function flips the FreeImage bitmap vertically to make it compatible
 *  with SDL's coordinate system.
 *
 *  If the input image is in grayscale a custom palette is created for the
 *  surface.
 */
SDL_Surface *get_sdl_surface(FIBITMAP *freeimage_bitmap, int is_grayscale) {

    // Loaded image is upside down, so flip it.
    FreeImage_FlipVertical(freeimage_bitmap);

    SDL_Surface *sdl_surface = SDL_CreateRGBSurfaceFrom(
        FreeImage_GetBits(freeimage_bitmap),
        FreeImage_GetWidth(freeimage_bitmap),
        FreeImage_GetHeight(freeimage_bitmap),
        FreeImage_GetBPP(freeimage_bitmap),
        FreeImage_GetPitch(freeimage_bitmap),
        FreeImage_GetRedMask(freeimage_bitmap),
        FreeImage_GetGreenMask(freeimage_bitmap),
        FreeImage_GetBlueMask(freeimage_bitmap),
        0 
    );

    if (sdl_surface == NULL) {
        fprintf(stderr, "Failed to create surface: %s\n", SDL_GetError());
        exit(1);
    }

    if (is_grayscale) {
        // To display a grayscale image we need to create a custom palette.
        SDL_Color colors[256];
        int i;
        for(i = 0; i < 256; i++) {
            colors[i].r = colors[i].g = colors[i].b = i;
        }
        SDL_SetPaletteColors(sdl_surface->format->palette, colors, 0, 256);
    }

    return sdl_surface;
}

We start off by flipping the bitmap vertically. Since this is done in memory it is a side-effect of the function.

We then create the SDL surface using the SDL_CreateRGBSurfaceFrom() function, which (amongst others) takes as input the red, green and blue masks of the FreeImage bitmap. The functions for accessing these masks (FreeImage_GetRedMask(), etc) work even if the FreeImage bitmap comes from a single channel input image (gray scale). If the input image is in gray scale we therefore need to create a custom palette for it and associate this with the SDL surface that we have created.

Creating a SDL window

Our image viewer will display the image in a SDL window. For sake of minimalism (and simplicity) this will be a border-less window displayed in the centre of the screen.

/** Initialise a SDL window and return a pointer to it. */
SDL_Window *get_sdl_window(int width, int height) {
    if (SDL_Init(SDL_INIT_VIDEO) < 0) {
        fprintf(stderr, "SDL couldn't initialise: %s.\n", SDL_GetError());
        exit(1);
    }

    SDL_Window *sdl_window;
    sdl_window = SDL_CreateWindow( "Image",
        SDL_WINDOWPOS_CENTERED,
        SDL_WINDOWPOS_CENTERED,
        width,
        height,
        SDL_WINDOW_BORDERLESS);

    return sdl_window;
}

Rendering the surface as a texture in the window (a.k.a. displaying the image)

We now need some code to render the surface as a texture in the window. In the code we will do this back to front, by using the window to generate a renderer and using the renderer to generate a texture. Finally, the renderer is cleared before adding the texture and presenting it.

It is worth noting that a SDL_Texture is a structure that contains an efficient, driver-specific representation of pixel data. Which means that, unlike a SDL_Surface, it can be processed by the GPU.

/** Display the image by rendering the surface as a texture in the window. */
void render_image(SDL_Window *window, SDL_Surface *surface) {
    SDL_Renderer* renderer = SDL_CreateRenderer(window, -1, 0);
    if ( renderer == NULL ) {
        fprintf(stderr, "Failed to create renderer: %s\n", SDL_GetError());
        exit(1);
    }

    SDL_Texture* texture = SDL_CreateTextureFromSurface(renderer, surface);
    if ( texture == NULL ) {
        fprintf(stderr, "Failed to load image as texture\n");
        exit(1);
    }

    SDL_RenderClear(renderer);
    SDL_RenderCopy(renderer, texture, NULL, NULL);
    SDL_RenderPresent(renderer);
}

Note that the third parameter of the SDL_RenderCopy() function (srcrect) is a pointer to the source rectangle and can be used to implement zooming. However, here we set it to NULL to display the entire texture.

Giving the user the chance to view the image

At this point we need some sort of event loop to make sure that the image does not vanish instantaneously after having been rendered in the window. Below is is a simple event loop that ends when the user presses a key on the keyboard.

/** Loop until a key is pressed. */
void event_loop() {
    int done = 0;
    SDL_Event e;
    while (!done) {
        while (SDL_PollEvent(&e)) {
            if (e.type == SDL_KEYDOWN) {
                done = 1;
            }
        }
    }
}

Putting it all together

Finally we add the main logic of the code. This includes some functionality for checking if the input image is gray scale. We achieve this by checking if the FreeImage colour type is FIC_MINISBLACK. Other colour types include FIC_RGB and FIC_CMYK.

As gray scale images can be more than 8-bits (quite common when dealing with microscopy images) we make sure that we compress the data using the FreeImage_ConvertToGreyscale() function.

int main(int argc, char *argv[]) {
    char *filename = parse_args_get_filename(argc, argv);
    FIBITMAP *freeimage_bitmap = get_freeimage_bitmap(filename);

    int is_grayscale = 0;
    if (FreeImage_GetColorType(freeimage_bitmap) == FIC_MINISBLACK) {
        // Single channel so ensure image is compressed to 8-bit.
        is_grayscale = 1;
        FIBITMAP *tmp_bitmap = FreeImage_ConvertToGreyscale(freeimage_bitmap);
        FreeImage_Unload(freeimage_bitmap);
        freeimage_bitmap = tmp_bitmap;
    }

    int width = FreeImage_GetWidth(freeimage_bitmap);
    int height = FreeImage_GetHeight(freeimage_bitmap);
    SDL_Window *sdl_window = get_sdl_window(width, height);
    SDL_Surface *sdl_surface = get_sdl_surface(freeimage_bitmap, is_grayscale);

    render_image(sdl_window, sdl_surface);
    event_loop();

    FreeImage_Unload(freeimage_bitmap);
    SDL_FreeSurface(sdl_surface);
    return 0;
}

Note that we free up the dynamically allocated bitmap and surface memory using FreeImage_Unload() and SDL_FreeSurface() before we exit the program.

Compiling and linking

Now we can compile the code.

$ gcc -c see.c

This creates the object file see.o which contains machine code as well as information that allows a linker to find out which symbols (global objects, functions, etc) it requires in order to work.

Let us link our object file.

$ gcc -o see see.o -lfreeimage -lsdl2

This produces the executable file see, which we can test using the command below.

$  ./see image.png

Conclusion

FreeImage and SDL are useful C libraries for working with images and graphical user interfaces, respectively. In this post we have used the two in combination to create a basic image viewer that can parse over 30 image file formats and display both RGB and gray scale images correctly.

Acknowledgements

This blog post was inspired by and based on some of the code in the github.com/JIC-CSB/eye project.

How to continuously test your Python code on Windows using AppVeyor

2015-09-04T00:00:00+00:00

In the previous post I illustrated how to setup continuous integration testing of your Python code using Travis CI. Travis CI is great when working on Linux. However, what can you do if you wanted to setup automated continuous integration testing on Windows?

To me, a Linux enthusiast, this problem sounded almost insurmountable…

AppVeyor to the rescue

However, it turns out that AppVeyor has provided a service for solving this problem.

One simply needs to create an appveyor.yml file to configure the running of the test suite. The code below creates a testing matrix for running the test suite on 32-bit Python 2.7, 3.3 and 3.4 using the nosetests test runner.

build: false

environment:
  matrix:
    - PYTHON: "C:\\Python27"
      PYTHON_VERSION: "2.7.8"
      PYTHON_ARCH: "32"

    - PYTHON: "C:\\Python33"
      PYTHON_VERSION: "3.3.5"
      PYTHON_ARCH: "32"

    - PYTHON: "C:\\Python34"
      PYTHON_VERSION: "3.4.1"
      PYTHON_ARCH: "32"


init:
  - "ECHO %PYTHON% %PYTHON_VERSION% %PYTHON_ARCH%"

install:
  - "%PYTHON%/Scripts/pip.exe install nose"
  - "%PYTHON%/Scripts/pip.exe install coverage"

test_script:
  - "%PYTHON%/Scripts/nosetests"

Note that we use pip to install the nose and coverage packages before we run the test suite.

Commit and push this file and login to AppVeyor using your GitHub account. Sync your GitHub repositories and then select the project you want AppVeyor to run continuous integration testing on.

Job done!

Using Minconda to test projects that depend on the `numpy`/`scipy` stack

Again testing projects that depend on numpy and scipy present problems in that these packages take too long to build from scratch. However, just like in the previous post we can make use of Miniconda.

In fact the kind people at AppVeyor have already deployed Minicoda to their build workers (github.com/appveyor/ci/issues/359).

So to test a project that depends on numpy and scipy one can simply use the appveyor.yml file below.

build: false

environment:
  matrix:
    - PYTHON_VERSION: 2.7
      MINICONDA: C:\Miniconda
    - PYTHON_VERSION: 3.4
      MINICONDA: C:\Miniconda3

init:
  - "ECHO %PYTHON_VERSION% %MINICONDA%"

install:
  - "set PATH=%MINICONDA%;%MINICONDA%\\Scripts;%PATH%"
  - conda config --set always_yes yes --set changeps1 no
  - conda update -q conda
  - conda info -a
  - "conda create -q -n test-environment python=%PYTHON_VERSION% numpy scipy nose"
  - activate test-environment
  - pip install coverage

test_script:
  - nosetests

The script above installs numpy, scipy and nose using the Conda package manager. However, the Conda package manager does not contain the coverage package. We therefore install that using pip instead after the virtual environment has been activated.

The fact that Miniconda is included in the AppVeyor makes it trivial to test Python code with scientific dependencies.

Great stuff!

Five steps to add the 'bling' factor your Python package

2015-08-30T00:00:00+00:00

Introduction

In previous posts I have shown how to create a Python package.

We started by using Cookicutter to generate a basic structure for our project. We then looked at how to setup and use clean development environments. This was followed by an outline of Python tools for testing and the implementation of the Python package using test-driven development. Finally we looked at how to generate beautiful technical documentation using Sphnix.

Now it is time to show off our hard work. In this post I will show you how to make use of cloud services to host your documentation, run continuous integration tests and distribute your package. Furthermore, we will add neat looking badges to the README file.

Step 1: Host the documentation on readthedocs

You have spent hours documenting your package using Sphinx. It is time to share it with the world. Register with readthedocs and sync your GitHub account with it. Then you can simply select the project that you want readthedocs to host documentation for.

If you are using Sphinx’s autodoc functionality and your package depends on numpy/scipy/matplotlib you may run into trouble as Readthedocs’ server may not be able to compile the C extensions. The first thing to try is to go into the advanced settings section of your project in Readthedocs’ web interface and make sure that the project is set to install into a virtual environment and that the option to “Give the virtual environment access to the global site-packages dir” is selected. The system packages now appear to include numpy, scipy, and matplotlib so this should go a long way. However, if you are still running into trouble you may need to mock out the dependencies.

Step 2: Set up continuous integration testing on Travis Ci

You have spent hours using test-driven development to create a solid Python package. It is time to automate the running of the test suite and to get automatic testing of the code on different versions of Python.

Sign into Travis CI using your GitHub account. Select the project that you want to test and add a .travis.yml file to the root of your code repository.

Below is a simple setup for testing a Python package with no dependencies on Python versions 2.7, 3.2, and 3.4 using the nose test runner.

language: python
python:
  - "2.7"
  - "3.2"
  - "3.3"
  - "3.4"
script: nosetests

If your code includes dependencies on numpy and scipy things get a little bit trickier as Travis CI can time out trying to install these from source. The solution is to make use of Miniconda.

The .travis.yml file below is based on the template from the conda documentation and Dan Blanchard’s post Quicker Travis builds that rely on numpy and scipy using Miniconda. It installs Miniconda with numpy, scipy and nose and runs the test suite on Python 2.7, 3.3. and 3.4.

python:
  # We don't actually use the Travis Python, but this keeps it organized.
  - "2.7"
  - "3.3"
  - "3.4"
install:
  - sudo apt-get update
  # We do this conditionally because it saves us some downloading if the
  # version is the same.
  - if [[ "$TRAVIS_PYTHON_VERSION" == "2.7" ]]; then
      wget https://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh -O miniconda.sh;
    else
      wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -O miniconda.sh;
    fi
  - bash miniconda.sh -b -p $HOME/miniconda
  - export PATH="$HOME/miniconda/bin:$PATH"
  - hash -r
  - conda config --set always_yes yes --set changeps1 no
  - conda update -q conda
  # Useful for debugging any issues with conda
  - conda info -a
  - conda create -q -n test-environment python=$TRAVIS_PYTHON_VERSION numpy scipy nose
  - source activate test-environment
  - python setup.py install
# command to run tests
script: nosetests

Step 3: Calculate your code coverage using Codecov

As you have developed your code using test-driven development you have a high degree of code coverage. It is time to integrate the code coverage calculation into the Travis CI testing. We will use Codecov to do this.

Sign in using your GitHub account, sync your repos and add the project that you want to measure the code coverage for. Then edit the .travis.yml file to look like the below.

language: python
python:
  - "2.7"
  - "3.2"
  - "3.3"
  - "3.4"
script: nosetests
before_install:
  pip install codecov
after_success:
  codecov

Step 4: Upload your Package to PyPi

You have developed a great Python package, it is time to share it with the world. This is done, most effectively, by uploading it to PyPi.

Peter Down has written a great post explaining how to submit a package to PyPi.

Hosting your package on PyPi makes it easy for people to install using pip.

Step 5: Add badges to your project’s README file

Finally the part that we have all been waiting for: cool looking badges!

Readthedocs, Travis CI and Codecov all provide badges as part of their service. For the PyPi package we will make use of Version Badge.

Below is part of the reStructuredText markup I use for my tinyfasta package.

.. image:: https://badge.fury.io/py/tinyfasta.svg
   :target: http://badge.fury.io/py/tinyfasta
   :alt: PyPI package

.. image:: https://travis-ci.org/tjelvar-olsson/tinyfasta.svg?branch=master
   :target: https://travis-ci.org/tjelvar-olsson/tinyfasta
   :alt: Travis CI build status (Linux)

.. image:: https://codecov.io/github/tjelvar-olsson/tinyfasta/coverage.svg?branch=master
   :target: https://codecov.io/github/tjelvar-olsson/tinyfasta?branch=master
   :alt: Code Coverage

.. image:: https://readthedocs.org/projects/tinyfasta/badge/?version=latest
   :target: https://readthedocs.org/projects/tinyfasta/?badge=latest
   :alt: Documentation Status

The images in the README.rst file gets rendered by GitHub into a neat looking header with the badges below.

Conclusion

You should now have a Python package that looks loved and cared for.

It is easy to install using pip
It has online documentation
It gets tested every time code is pushed to GitHub
It has its code coverage measured

Day 12: Multi-level modelling in morphogenesis

2015-07-24T00:00:00+00:00

The twelth day of the multi-level modelling in morphogenesis course was started by Dr Christen Mirth outlining the methods and theory of evolutionary-developmental biology (eco-devo).

In nature (and the lab) one can observe polymorphism. Sources of these polymorphic variations include both genetic and environmental factors. Meaning that the phenotype is composed of interactions between the genotype and the environment. This results in phenotypic plasticity, the ability of an organism to react to an environmental input with a change in form, state, or behaviour.

Many different organisms show striking examples of phenotypic plasticity. Famous examples include:

Daphnia cuclatta
Nemorai arizonaria
Priceis coenia

Furthermore there are many different inducers of phenotypic plasticity: predators, food, temparture and day length respectively for the examples listed above.

Looking at different phenotypes one can observe that they can have differing degrees of plasticity.

In order to go from an environmental queue to a difference in phenotype several mechanistic steps are required. The queue must be sensed by the organism. Some sort of signal must then be sent to the relevant tissue. The tissue must then interpret the signal and respond accordingly.

Dr Mirth then used the example of horn development in male dung beetles as a case study to illustrate how these processes may occur. Male dung beetles can either develop horns or become hornless. This dimorphism depends on larval nutrition provided by the mother.

One of the central problems for this particular example is that the horned and hornless dung beetles grow to the same size given the same amount of nutrition. So how does it regulate the size of the horn independently of the body size?

This is where the degrees of plasticity come into play. Where the initial levels of nutrients during development affect the plasticity of the horn growth relative growth.

In the last section of her talk Dr Mirth discussed the patterning of Drosophila wings, focussing in particular on the question of pattern coordination. From an impressive set of experiments looking at the expression levels of different hormones during development Dr Mirth managed to show, by perturbing the system, that the organ patterning is coordinated by a set of specific milestones.

The participants were then invited to experiment with computational models linking network evolutions to cellular behaviour.

After lunch Dr Mirth gave a keynote lecture expanding on the concept outlined in the morning describing phenotypic plasticity and the evolution of polyphenisms.

During her talk Dr Mirth described how nutrition can affect two different aspects of plasticity in Drosophila:

Body size
Ovarian size

The latter example was analogous to the male dung beetle horn development where the size of the tissue is reprogrammed in fashion that is independent of the whole body size.

Through a set of beautiful experiments Dr Mirth was able to show that different processes during different stages can be used to reprogram tissue growth.

After the keynote the course was wrapped up by the participants being handed their certificates and everyone breathing a sigh of relief before saying goodbye to their new found friends.

Stay in touch!

Day 11: Multi-level modelling in morphogenesis

2015-07-23T00:00:00+00:00

The eleventh day of the multi-level modelling in morphogenesis course was started by Dr Arthur D. Lander giving pedagogical lecture on morphogen gradients from an engineering viewpoint.

Dr Lander started off by showing a painting by Hanbusa Itcho representing the allegory of blind monks examining an elephant. Where each monk, holding onto a different part of the elephant, is convinced that his “view” of the elephant is the truth.

Many people working on biological problems approach them using a physics mindset. In physics the goal is to understand the phenomena. Organisation arises by emergence. Understanding therefore means grasping how casual laws produce complex phenomena. In essence the question under investigation is: how does it work?

An alternative approach is that used in engineering. In engineering the goal is to understand performance. Organisation arises by selection for performance. Understanding therefore means grasping how complex systems achieve specific goals. In essence the question under investigation is: why is it built that way?

These different approaches have led to contrasting views of how biological complexity has arisen. The physics mindset can lead to thinking that complexity has arisen by chance as “frozen accidents”, whereas complexity using the engineering mindset appears to be something that has arisen out of necessity. These two contrasting views are very similar to the blind monks examining the elephant.

Dr Lander then turned his attention to the vein patterning of Drosophila wings, which is controlled by two morphogen gradients. The “textbook” representation of the gene regulatory network involved in this wing patterning is deceptively simple. But in fact when one looks at all the data available for the network it balloons into a complex “hairball” diagram.

Creating a morphogen gradient is not inherently difficult. All you need is:

Localised production
Random diffusion
Any sort of decay process

So in this instance the question is not how, but why. Why is the gene regulatory network for patterning the Drosophila wing so complex if it is easy to create a morphogen gradient?

To answer this one needs to consider what the performance objectives might be. Reasonable biological performance objectives include:

Stability
Efficiency (especially important for bacteria)
Timing
Evolvability
Robustness

The reliability of morphogenesis suggests it has high robustness. Examples include the formation of complex tissues such as the heart and the striking similarities of monozygotic twins.

Robustness can be defined as being able to achieve a desired goal in face of perturbations.

So what might the goals be? Some reasonable goals include:

Constancy (homeostasis)
Accurately reaching an endpoint
Adaptation

Perturbations include:

Altered system parameters (e.g. temperature)
Altered initial conditions
Noise (extrinsic or intrinsic)

Dr Lander then went into some detail on how one can use the engineering concept of sensitivity coefficients to quantify robustness in a unitless way. This looks at what the fold change in the output is in response to any fold change of the input. In engineering, a process is considered reasonably robust if the sensitivity coefficient is lower than 0.3.

Dr Lander then illustrated how one could analytically evaluate the robustness of different models for generating morphological gradients using sensitivity. This concept was expanded on, after some coffee, when the participants of the course were invited to experiment with the concept analytically using Mathematica.

After lunch Dr Nick Monk gave a talk outlining some of the concepts from his paper The Inheritance of Process: A Dynamical Systems Approach.

As scientists we want to understand evolution. So far most of our understanding has been based on the assumption that there map relating genotype to phenotype has got a one-to-one (bijective) mapping.

However, the “traditional” view of the genotype-phenotype map as a one-to-one relationship is probably too simplistic. Take for example the case of polyphenism, an ecologically important trait that helps organisms adapt to variable environment. Polyphenism is widespread and occurs in many plants and animals.

This means that one needs to take the environment into account when thinking about the genotype-phenotype map. Where a genotype actually encodes a set of potentialities. Which potentiality is realised is influenced by the environment and can be stochastic. It is not a simple bijective mapping.

Dr Monk continued to illustrate how the traditional view of the genotype-phenotype map falls short in terms of helping us understand evolution as a process before presenting a new formalism for genotype-phenotype maps by representing them as dynamical systems.

In this formalism a genotype is specified by:

A network topology
A set of interaction parameters

Which can be written down using a network model (e.g. boolean network, or a set of ODEs). The genotype-phenotype map can then be represented using phase space, which has a number of descriptors that can be useful for understanding the mapping.

Attractors (fixed points, cycels, chaotic)
Basins of attraction
Separatrices
Trajectories
Initial conditions

Thinking of a genotype as a network means that one has the possibility to trigger a number of different phenotypes. In other words a single genotype (network) can have multiple attractors (end points) and phenotypes correspond to trajectories (not just to attractors). The latter is useful as phenotypes are plastic.

Dr Monk finished by stating that rather than focussing on how evolution can change the genotype-phenotype, perhaps we should be thinking about how evolution can change phase space. Perhaps we should think of evolution as an inheritance of process where discrete changes in genome sequence and allele frequency results in a change to an underlying continuous dynamics.

Dr Monk’s talk was followed by a keynote talk by Dr Lander which expanded on some of the concepts outlined in the morning’s lecture.

Biology is driven by performance objectives. The mechanisms that exist to achieve these kinds of gaols are collectively referred to as control.

What are the problems that you get into when you want to achieve “control”?

Basically control makes things more complex. In fact there is a very strong relationship between the two and in engineering this is referred to as the “no-free-lunch principle”. In other words, achieving good performance in one arena often comes at the expense of good performance in another.

In fact curious things happen when you try to achieve control over a great many things at the same time. The possible solutions start diverging and scatter exponentially (for more detail see Landscape analysis of constraint satisfaction problems; Krzakala and Kurchan; 2007).

Dr Lander then illustrated the interplay between control, complexity and tradeoffs using the example of Drosophila wing patterning. Using modelling, Dr Lander showed many examples of how introducing a process for controlling a particular aspect of the morphogen gradient also resulted in loss of control for a different aspect.

A fundamental issue, in terms of Drosophila wing patterning, may be that there is not be enough information in a single morphogen gradient. Dr Lander then illustrated how more control can be realised by using two morphogen gradients, particularly in conjunction with the toggle-switch architecture.

Dr Lander concluded by stating that if we wish to understand not just what happens in biology, but why biological systems are built the way they are, we need to interpret biological organisation in light of principles of control, and the constrains imposed by selection for control. Trade-offs (the no-free-lunch principle) are likely to drive the evolution of complexity. By focussing on performance, trade-offs and control, we can find potential explanations for at least some of the intricate feedback and feed-forward interactions that we observe in patterning systems.

Day 10: Multi-level modelling in morphogenesis

2015-07-22T00:00:00+00:00

The tenth day of the multi-level modelling in morphogenesis course was started by Professor Enrico Coen giving a pedagogical lecture on modelling growth and deformation.

Most of us are used to thinking about geometrical deformations. However, the mathematical principles behind geometrical deformations are quite different from the principles behind biological deformation. For example, transforming a square into a trapezoid, as if viewing the box from a perspective, is easy using a geometrical deformation/transformation. However, achieving this type of deformation in a biological system using growth is non-trivial. One of the complicating factors is that the tissue in a biological system is an interconnected material.

What is growth anyway? Does it depend on the number of cells? Is it something continuous or something discrete? Is the process of growth absolute (accretion) or relative to the size of the tissue? Much time was spent discussing the implications of these questions with regards to coming up with a definition of growth.

For the purposes of his talk Professor Coen defined growth as a tissue getting bigger, independently of the number of cells. And on the flip-side he defined a tissue reducing its size as “shrinkage”.

Professor Coen then expanded on the concept of growth rates, which he defined to be relative and continuous in the context of the growth of a tissue. He then explained how the growth rates can be inferred from velocities. If a velocity is changing one is observing either growth or shrinkage. Similar to Dr Kabla, the preivous day, professor Coen took the derivative of the velocity field to get a growth tensor. The growth tensor can be represented as:

Growth rate
Anisotropy
Direction
Rotation

The growth tensor can be estimated by microscopy time laps movies, where one can track cell vertices over time.

Furthermore, the growth tensor concept can be used to model tissue growth. By specifying the growth rate, anisotropy and direction one can get deformations, conflicts and rotations as emergent features.

The participants where then invited to experiment with these concepts by modelling tissue growth in three dimensions.

Day 9: Multi-level modelling in morphogenesis

2015-07-21T00:00:00+00:00

The ninth day of the multi-level modelling in morphogenesis course was started by Professor Enrico Coen who continued on the theme of polarity.

Professor Coen did not set out to study polarity. He was interested in how tissues grew. However, after some time he came to the conclusion that to understand tissue growth he would have to understand polarity.

One of the main players of polarity in plants is the hormone auxin. In fact some of the main markers for polarity in plants are the PIN proteins, which actively transports auxin across the plasma membrane. By coordinating the location of PIN proteins to one side of all cells a tissue can create a polarity field. Furthermore, auxin gradients have been shown to regulate tissue polarity.

At the time there where two main models describing auxin regulated tissue cell polarity:

Cell-cell comparison, a.k.a. “up-the-gradient” hypothesis
Flux or gradients at the interface, a.k.a. “with-the-flow” hypothesis

The former could explain PIN locations and the latter veins in leafs. However, both models also had issues. How could a cell, in a cell-cell comparison scenario, know the concentration of its neighbors? On the flip-side, in the “with-the-flow” hypothesis, it was unclear how the cell would be able to measure the flux.

Professor Coen then turned the attention to animals models that had been proposed to coordinate the hair orientation in Drosophila wings. Again there were two major classes of models:

Cell-cell comparison models
Interface models, i.e. receptors intaracting at interfaces between cells

These were very similar to the plant models. The main difference being that plant cells could not interact directly as they are separated by a cell wall.

However, all the plant and animal models had an assumption in common. Namely that cells are unpolarised in the absence of an asymmetric ligand distribution or polarised neighbours.

But we know of at least some systems where this assumption is not true. For example budding yeast and migrating neutrophils.

Dr Coen then showed that by assuming that cells have an intrinsic polarity one can create a model where the polarised cells arrange themselves through local interactions with their neighors. For example, in an animal model, one can imagine a scenario where the cells’ “front” and “back” factors directly bind with their neighbour’s “front” and “back” factors. This leads to coordination. However, the emerging pattern is a bunch of spirals. He then showed that very strict, orientated tissue polarity can be established if organizers are located somewhere on the tissue, or the polarities interact with some kind of concentration gradient.

One way to use this to model how a plant organises polarity is to assume high auxin efflux at one end of the tissue and no export of auxin at the other end. By reading out the same concentrations between the cells, both cells tend to align. Thus, using this indirect-signalling mode, one gets a similar result to the cell-cell comparison models. This is a bit counterintuitive. The patterning is working to remove the signal that is causing the pattern.

However, one can use the same model to create a different emergent tissue polarity. A model with high auxin production at one end of the tissue and low auxin degradation at the other end. In this case one ends up with results similar to those from the “with-the-flow” hypothesis.

So in essence we have a model that produces two different behaviours, previously thought to be two different processes altogether, and consolidates them in a parsimonious and locally-based manner.

Professor Coen’s talk was followed by a presentation by Dr Alexandre J. Kabla on the mechanobiology of cell migration and cell rearrangements.

Dr Kabla started off by illustrating that there is a massive amount of motion during development. In fact most shapes are created by cell migration.

This motion arises from several different processes:

Sheet bending/folding
Convergence extension
Collective migration

Dr Kabla then described a methodology for understanding some of these processes, in particular the latter two.

Using microscopy one can record time laps movies of developing embryos. These images can then be segmented into individual cells, which can be tracked over time. By looking at the motions of individual cells one can calculate velocities. All of the velocities can then be used to create a velocity field. By differentiating the velocity field one obtains a deformation field, which is a useful representation for trying to understand tissue formation by motion during development. The deformation field can, in fact, be used to identify separate tissues from a blob of cells.

Dr Kabla then went on to describe how one could analyse cell interacalation (convergence extension) in more detail using the deformation field representation.

The talk was followed by lunch, which was followed by another talk by Dr Kabla describing on how modelling can be used to study collective migration. After the talk the participants of the course were invited to try out some of these analysis using data simulated using cellular Potts model programs used earlier in the course.

Day 8: Multi-level modelling in morphogenesis

2015-07-20T00:00:00+00:00

The eight day of the multi-level modelling in morphogenesis course was started by Dr Yogi Jaeger giving an introductory lecture to parameter estimation using reverse-engineering. The talk was illustrated using a case study of segmentation during fly embrogenesis.

Along the way Dr Jaeger highlighted many of the pitfalls that one needs to be aware of when modelling biological systems. Particular emphasis was put on the model being a tool, not reality! An outcome of this is that one needs to pay attention to the model and understand its limitations.

In its most simple form reverse-engineering can be split into three stages:

Creating a dynamical model
Obtaining quantitative measurements of data
Fitting the model to the data

When fitting the model to the data there are two main questions to consider.

How do you measure the similarity between the data and the model?
Which algorithm are you going to use to fit the data?

One of the simplest ways of measuring the similarity between the model and the data is to calculate the root mean square residual. However, other measures are available and the selection of one over another is context dependent. It is therefore something that one needs to pay attention to.

Fitting the model to the data, i.e. estimating the parameters, is a global optimisation problem and there are a number of algorithms available to tackle it. Traditionally people have been using evolutionary strategy and simulated annealing algorithms for these types of problems. Evolutionary strategy algorithms are relatively quick. However, when using them one suffers from not knowing whether or not the solution identified is the real global minimum. Simulated annealing algorithms can be more robust, but they are also slower.

Dr Jaeger then mentioned that his lab has had great success with the scatter search algorithm. In his hands it can be up to ten times quicker than simulated annealing.

Once one has found a solution one needs to ask whether or not it is appropriate. This can be achieved by parameter identifiability analysis though bootstrapping, i.e. fitting the model to noisy data. The results can be projected onto the parameter landscape as ellipsoid confidence regions. However, this can be slow. A quicker way to estimate these confidence regions is to calculate the Hessian matrix of the system using linear approximation.

After lunch Dr Veronica Grieneisen gave a talk about cell polarity and how one can understand it through breaks of symmetry.

If one considers a morphogen gradient, how can it be “read” by cells? Further, how can this lead to coordinated cell orientations? Any solution will require some process of comparison.

The talk then took a slight detour into physics.

How can you make a compass? You can take a needle and magnetize it with an external field. If you have many needles they will all align in the field. Importantly each subunit (needle) will have a “north-south” polarity in the magnetic field.

Without going too far with the analogy Dr Grieneisen noted that by giving a cell the concept of polarity it is given a mechanism for aligning within a larger polarity such as a chemical gradient or a tissue polarity.

Dr Grieneisen then presented work using the cellular Potts model illustrating how small G-proteins, which can act as molecular switches, can give rise to cell polarity. However, the modelling found that there was an additional requirement. The inactive form had to be able to diffuse on a quicker time-scale than the active form. In this particular case this was achieved by the active form being constitutively membrane bound, whereas the inactive form was able to diffuse freely in the cytosol.

Dr Grieneisen pedagogical lecture was followed by a keynote lecture by Dr Jaeger giving a more detailed exposition of his top-down approach to extracting structures of networks from gene expression data and his analysis of the model by looking at phase space and the attractors within it.

By using this reverse-engineering approach Dr Jaeger managed to establish that the AC/DC circuit is a recurring motif in the gap gene network. The AC/DC circuit is interesting in that it can act both as a positive and a negative feedback loop. By analysing the phase space and attractors in the AC/DC circuit one finds that this simple network can give rise to different functions. Specifically it can act as a:

switch
oscillator
damped oscillator

Dr Jaeger then showed that the AC/DC circuit in the gap system could be used to create:

Stable boundaries in the anterior of the fly embryo, set by attractors
Moving boundaries in the posterior of the fly embryo, governed by a damped oscillator

The participants of the course were then invited to an open-panel session to discuss “what models are for”. This was followed by more hands-on computational exercises simulating cell polarity in animal and plant cells.

Day 5: Multi-level modelling in morphogenesis

2015-07-17T00:00:00+00:00

The fifth day of the multi-level modelling in morphogenesis course was started by Professor Przemyslaw Prusinkiewicz giving an introductory lecture to L-systems.

However, before going into L-systems Professor Prusinkiewicz introduced the topic of computational modelling more generally. In particular he highlighted J. C. R. Licklider, who amongst other things was one of the founders of the internet. In his essay Man-Computer Symbiosis Licklinder had the vision that man will start interacting with computers in the same way as they would interact with a colleague.

Professor Prusinkiewicz also highlighted the work of Alan Kay who had the vision of A Personal Computer for Children of All Ages in 1972.

At the time the computational resources were not available for either of these visions. However, now they are! Which means that this is a great time to be a modeller. You can now treat your laptop as a colleague whose skills supplement your own.

Professor Prusinkiewicz then went on to discuss some of the issues of modelling in developmental biology. There are two main issues. First of all development is a spatio-temporal process. Secondly, a developing organism is a dynamical system with a dynamical structure. For example, one could look at a plant as a system of components that are all developing over time.

These issues were dealt with by Astrid Lindenmayer in Mathematical models for cellular interactions in development. The formalisms developed by Lindenmayer are now often referred to as L-systems or Lindenmayer systems.

A L-system basically consists of an alphabet, a set of productions (rules for converting letters of the alphabet into new strings of letters from the alphabet), and an axiom (the starting string). Using a recursive algorithm a string, starting from the axiom, is continually transformed by the production rules.

Using turtle geometry L-systems can used to create fractal structures using very simple rules.

Professor Prusinkiewicz then went on to describe how the L-system can be used to model the development of plants by having an alphabet of basic plant modules such as stems, branches, flowers, etc.

This was followed by a practical session where people got to experiment with L-system modelling using the Vlab software.

The practical session was followed by a pedagogical lecture on phyllotaxis by Dr Yves Couder.

Pyllotaxis is the arrangement of leaves on a plant stem. There are a very small number of archetypes of organisations. Leaves can be organised into spiral nodes or whorled modes.

Spiral patterns are also important in parastichy, which can be observed for example in pine cones and the organisation of sun flower seeds.

Interestingly the spiral node pattern of leaves can be related to parastichy by compressing the stem.

Previously many people thought that these types of complex patterns could only be the result of complex biological processes. Particularly as these patterns where only observed in botany.

However, Dr Couder illustrated how he could reproduce these pattern using a physical system consisting of a ferrous material dropped into oil on a petri dish. The ferrous drops where pushed towards the edge of the petri dish by a magnetic field. As more ferrous drops were added the parastichy pattern emerged by virtue of the repulsive dipole interactions with the nearest neighbor drops that had already been deposited in the oil.

Dr Couder then went into a more mathematical description of these patterns and how they relate to Fibonacci numbers and the golden ratio. For more details on this fascinating work relating botany to mathematics see the excellent Science News article The Mathematical Lives of Plants.

After lunch the participants were invited to continue thinking about modelling by going out into the field and taking photographs of interesting plant patterns. This was followed by a show and tell session where people were encouraged to think about and discuss possible mechanisms that could be used to produce the pattern in question. This was followed by more computer modelling using L-systems.

Day 4: Multi-level modelling in morphogenesis

2015-07-16T00:00:00+00:00

The fourth day of the multi-level modelling in morphogenesis course was started by Dr Stan Maree giving a talk outlining how modelling has helped us understand the life-cycle of cellular slime mold.

Under normal conditions cellular slime molds act as individual cells feeding on bacterial. However, during starvation hundred of thousands of these cells come together to form a slug like creature that can migrate across a surface guided by, amongst other, thermotaxis. Finally the slug culminates by transforming itself into a fruiting body consisting of a spore head on a small, tapering stalk.

Dr Maree then illustrated, though modelling, that all of the aspects of the life-cycle unfold by combining only a few processes. The main processes being:

excitable media through cAMP
chemotaxis towards cAMP
differential cell adhesion
cell differentiation

The fact that this parsimonious description can account for the complex behaviour of development is mainly due to the fact that one can get differing behaviours by operating on different levels, e.g. individual cells versus clusters of cells versus a slug of cells. For example pressure waves emerge and guide the culmination stage solely due to cell adhesion and excitable media. While adhesion is a property linked to cell membranes which ensures that cells adopt different shapes and that clusters of cells also develop certain topologies. At the highest level the Dictyostelium slug can even act as a lens and use this effect to be able navigate up light gradients, which could be understood through modelling. This combination of modelling at many different scales is a core aspect of this course.

The participants were then invited to work though workshop material on aggregation and cAMP waves, aggregation and slug formation, thermotaxis and culmination.

In the afternoon there was a keynote lecture by Dominique Bergmann.

Dr Bergmann’s group is interested in the the development of stomata, in particular the spatial organization of the stomatal lineage. Dr Bermann’s group is largely experimental. However she is very keen to interact with modellers and it was great to see her inviting the participants of the course to tackle questions that her group are currently battling with. Questions which could potentially be answered by modelling.

Dr Bergmann talk gave fascinating insight into the mechanisms by which stomata are patterned across leaves, starting from the simple rule that no stomata may touch each other. The talk revealed that flexible patterned development can arise by regulating the expression of key transcription factors through positive an negative feedback loops. Furthermore by studing the differences between the regulatory networks in grass and Arabidopsis Dr Bermann managed to refine important features of the mechanism of patterning, highlighting the value of studying different organisms.

Day 3: Multi-level modelling in morphogenesis

2015-07-15T00:00:00+00:00

The third day of the multi-level modelling in morphogenesis course started off with a lecture by Dr Veronica Grieneisen, giving an introduction to the cellular Potts model.

The talk started off with a statement that biophysics, or simply physics, constrains and drives tissue development. And that (embryonic) tissues share properties with fluids.

So why do clusters of cells form similar structures to froths and bubbles? Basically some of the principles are the same: the tendency to minimise area and conserve (topological) constraints give rise to a frustration which generates characteristics configurations.

In 1964 Steinberg formulated the differential adhesion hypothesis in which he made the comparison between cells and immiscible fluids. He used the idea that cell types present different adhesive and cohesive interactions to postulate that the final configurations are established by obtaining a minimal interface free energy through successive changes in cell contacts.

This means that differential inter-cellular adhesion is one of the most important factors in cell sorting.

It did however take some time before this could be modelled. People tried doing it using cellular automata. However, this turned out to be (close to) impossible.

Some time later people started experimenting with the cellular Potts model. A significant difference between the cellular Potts model and cellular automata is the representation of a biological cell. In cellular automata a cell is represented by a single pixel, whereas in the cellular Potts model a biological cell is represented by lots and lots of pixels. The latter is therefore able to represent cell shape.

The cellular Potts model is simply a Hamiltonian consisting of terms for adhesion, volume conservation and cortical tension. The latter being a more recent addition.

The cellular Potts model is driven by Monte Carlo sampling where the edges of the cells are allowed to change state into neighboring cells. The energy of each pixel is then evaluated and if the energy is lowered the change is accepted. If a change increases the energy the change can still be accepted. The likelihood of accepting an energetically unfavourable change is evaluated using a probability function that makes it more unlikely as the energy difference increases. The reason for accepting some energetically unfavoured changes is so that the system can be driven towards the global minimum over time.

As it turns out the cellular Potts model can capture both the stochastic nature of cell dynamics and cell sorting behaviour.

The formalisms of the cellular Potts model were then described in some detail before the talk was concluded by stating that the cellular Potts model is an energy-based model that describes surface mechanics (adhesion, membrane fluctuations, internal pressures, cortical tension). As a result one can use it to talk about macroscopic phenomena such as cell shape, cell sorting, tissue surface tension, cell movement, stresses and shape changes, as well as stresses and strains through a tissue.

It is also possible to extend the cellular Potts to model specific biological problems such as chemotaxis and cell differentiation. Furthermore it can be used in conjunction with other models to look at sub-celluar details such as gene regulatory networks. It therefore serves as a great tool for cell-based modelling of morphogenesis.

The lecture was followed by a practical session where the participants got hands on experience of using the cellular Potts model for studying cell sorting.

After lunch Dr Stan Maree took over where the morning’s lecture had ended by illustrating how the cellular Potts model can be used to model cellular movement and morphogenesis.

The first example illustrated how the cellular Potts model could be extended to model movements of cells in lymph nodes. The movement of T-cells in the lymph node can be described as random persistent motion. The reason for this type of movement is that T-cells want to move past as many dendritic cells as possible (and vice versa) in a short a time as possible.

The model was created from:

T-cells set to be persistently moving
Dendritic cells (including extensions)
Reticular network (undeformable)
Correct sizes, densities, and shapes of cells
Fitting speed and motility

Where the persistent T-cell movement was created from:

Continuously adjusting the target direction
Continuously adjusting the directional persistence
Adjustment according to the reticular network

The model created managed to reproduce: short term persistent motion, long term random motion, and the experimentally observed “stop-and-go” behaviour. The latter had not been incorporated into the model and was previously thought to occur from a syncronised clock in the T-cells. These and other simulations then promted further and longer time-lapse experiments from the experimental biologists that disproved the internal clock hypothesis.

The second example illustrated how the cellular Potts model could be combined with chemotaxis, cell differentiation and gene regulatory networks to model complex developmental changes during gastrulation.

Dr Maree’s lecture was followed by a keynote talk by Professor Shigeru Kondo.

Professor Kondo gave a fascinating talk describing his quest to find evidence of reaction-diffusion system animals.

He did this by turning the traditional work flow of a molecular biologist on its head. Usually a molecular biologist would:

Find mutants
Identify all the genes involved in the phenomenon
Identify the functions of all those genes
Clarify the whole interaction network
Do calculation to make sure the identified system can reproduce the interested phenomenon

However professor Kondo decided that if he was to find evidence for the reaction-diffusion system he would need to use the theory before doing the experiments. As such he set out to:

Do many computer simulations
Extract important characteristics
Predict something unexpected
Show that it can happen!

He then illustrated how his group had applied this methodology of modelling first and experimenting second to find extraordinary evidence of the reaction-diffusion system in fish.

For example, one prediction made from the Turing reaction-diffusion system was that fish stripes should be able to migrate, specifically stripes that bifurcate. At the time Professor Kondo asked leading experts of fish developmental biology if this had every been observed. However, they replied that it had not. Undeterred Dr Kondo started looking for evidence of this behaviour and was able to find it in the skin of maring anglefish, see Kondo and Asai; Nature (1995).

After several other striking examples of prediction followed by experimental evidence Professor Kondo concluded by stating that Turing systems had proved an effective tool for understanding patterning. However, he made the point clear that the underlying mechanism is probably encoded in cell motility rules rather than necessarily in an activator and inhibitor.

Day 2: Multi-level modelling in morphogenesis

2015-07-14T00:00:00+00:00

The theme of the second day of the multi-level modelling in morphogenesis. course was emergent patterns and morphogens.

The day started with a presentation by Dr Stan Maree who posed the question:

Can we get spontaneous formation of patterns out of nothing?

During the previous day we learnt that equilibria can be stable or unstable. Now imagine a reaction with a stable equilibrium and combine it with diffusion. Can we get patterns from that combination?

In 1952 Alan Turing showed, theoretically, that a stable equilibrium could become unstable solely due to the diffusion of the chemicals involved.

Dr Maree then worked through the logic of Turing’s original paper The Chemical Basis of Morphogenesis. In the end Turing showed that in a systems where A and I activate and inactivate each other (or vice versa) an instability will occur if A also activates its own production, while I also inhibits its own production and the diffusion of I is sufficiently faster than the diffusion of A.

The patterns that are formed by the reaction-diffusion system are known as Turing patterns.

Dr Stan Maree’s talk was followed by a more sociological talk by Dr Nick Monk who examined the interactions between experimental and theoretical biologists in the context of reaction-diffusion systems.

In the early days the reaction-diffusion system did not really gain traction with embryologist who did not like it as a model for development. They thought it was too sensitive to environmental conditions and as such too messy for development.

In 1970 Turing patterns started to gain traction with experimental biologists primarily through the work of Hans Meinhardt. These models did not have a molecular basis that could be linked to experimental data - in the spirit of the time, the were posed in terms of “activity”. A significant problem was that there was no easy/obvious way of linking experimental data to the models.

Unfortunately things took a turn for the worse in the 1980s, where perhaps modellers tried to over reach. At the time molecular genetics really got into its stride. For a few systems, intense effort brought into focus the molecular complexity of processes such as Drosophila segmentation and limb development. At the same time, some reaction-diffusion modellers became bolds about the role of their models. Unfortunately there was a lack of iterative modeling and experimentation. Some modellers also failed to realise that a similarity between ones model and an experimental system does not mean that it is the model is correct.

At the time George Oester provided a voice of reason:

However, many developmental biologists are now talking a hard look at the actual contribitnos pattern formaiton models have made ot their field, and i sense some disillusionment."

He used lots of data and examined lots of models to find that:

physical and chemical mechanisms hypothesized by the models may be quite different, they all predict very similar kinds of spatial patterns. Therefore, since the underlying mechanism cannot in general be deduced from the pattern itself, other criteria must be applied in evaluation the usefulness of pattern formation models.

Unfortunatley the paper was published in Mathematical Biosciences 90: 265-286 1988 and as such received little attention by experimental biologists. Instead a paper by Michael Akam Making stripes inelegantly made the headlines. The paper which appeared as a News and Views article in Nature claimed that patterns are just made by messy specific systems and that reaction-diffusion systems were of little importance. This paper had big negative impact on modellers being able to be listened to by experimentalists for most of the 1990s.

However, things are getting much better now. Experimentalists are starting to take reaction-diffusion systems seriously again and the interpretation of the models are becoming more nuanced. Some experimentalists are even being inspired by Turing to measure diffusion rates in order to be able to create better models. Furthermore people are realising that reaction-diffusion systems act in concert with other mechanisms such as gene regulatory networks. The later point was something that Turing pointed out in his original paper:

most of an organism, most of the time, is developing from one pattern into another, rather than from homogeneity into a pattern

Dr Monk finished off by re-iterating the key message of his talk, a message that was made by Oester in 1988: it is the type of instability that is important. That is what will give you insight. Then you can try to understand what the underlying players are which allows the mechanism give rise to the right pattern.

The participants were then invited to work through a number of exercises and play around with various programs illustrating various aspects of Turing patterns.

After lunch Dr Veronica Grieneisen, gave a lecture on morphogen gradients and plant development.

The lecture started off by outlining the French Flag Theory created by Wolpart in 1969. Noting in particular that the original paper was not originally concerned with morphogen gradients per say, but with how morphogen gradients can be used to tackle the issue of scaling.

Important considerations were then outlined in terms of:

Spatial scales: what are the characteristic length and relevant tissue growth?
Temporal scales: what is the time required to spread a signal growth?
Robustness: how sensitive is the system’s behaviour to perturbations?

The latter point of robustness is multi faceted. It can be parametric robustness (dosage of genes, levels or rates of enzymes). As well as the precision of gradients which needs to be considered in terms of having natural variation among individuals vs. stochasticity within the individual growth

These topics were then considered in the context of a mesoscopic modelling exercise of the auxin levels at the quiescent center of the root tip during root growth. As it turned out the high concentration of auxin at the quiescent centre were achieved by a reflux-driven maximum. Interestingly, this type of maximum turned out to have isomorphic counterparts in volcanic micro currents as well as in counter current transport in kidneys.

Dr Grieneisen lecture was followed by a keynote lecture by Dr Yves Couder in which he outlined how the patterns formed by leaf venation are similar to the cracks observed in old porcelain, as well as dried mud. In all cases the cracks join up with each other, as opposed to the patterns observed during crystal growth which does not join up.

These cracks observed in old porcelain can be explained by growth in a tensor field. Dr Couder then went on to illustrate several beautiful examples where they manged to replicate various leaf vein patterns by drying gels on glass plates (where the static glass plate provided the stress for the tensor field).

The talk then went into more theory and experiments looking at the role for mechanical stress in growth. Concluding that:

Externally applied mechanical stress generates an orientation of the microtubules (and cell divisions) along the direction of main stress
In a normal meristem the microtubules become oriented along the direction of the mains stresses induced in the L1 layer by turgor pressure
The deposited microfibrils and the new cell walls strengthen the tissue along the direction of largest stress

After all the talks the participants of the course were invited back to experiment with workshop material on morphogen gradients, diffusion and permeability, source-decay models, directed transport and the reflux model.

Day 1: Multi-level modelling in morphogenesis

2015-07-13T00:00:00+00:00

Today was the first day of a two week course on multi-level modelling in morphogenesis.

During the introductory lecture, given by Dr Veronica Grieneisen, the goals of the course were outlined.

A goal of the course is to spread a broader understanding of developmental biology and biological modelling. More specifically to:

Define and understand generic principles guiding developmental biology
Learn how to identify and unravel processes
Understand and discuss at what level one should “model” a phenomena
Get exposed to different biological paradigm systems, as well as modelling formalisms
How to obtain models with predictive value and explanatory power that create isomorphisms

In this context isomorphisms are corresponding abstractions and conceptual models that can be applied to different phenomenon.

At another level a goal of the course is to discuss practical aspects of modelling. As scientists we need techniques to be able to express ourselves. As such the course aims to open the black boxes that biologists often make use of.

At yet another level the course is about communication. In particular enabling communication with a common language. The participants of the course are intentionally a mix of experimental and computational biologists from many different fields of biology. Bridging the gap between experimental and computational biologists is a central theme of the course. As is learning how to express oneself to people from different fields.

The section describing the goals was followed by a brief introduction to partial differential equations starting from the conservation equation (the differentiation form of the continuity equation), leading into a discussion about flux, Fick’s first law and it’s relation to diffusion.

The overview of partial differential equations was followed by a discussion on cellular automata, “to have an object or not to have an object”. Three examples were described:

The participants were then immersed in a hands on workshop exploring majority voting and Conway’s game of life. Followed by an exploration of diffusion simulated by partial differential equations and Margolous alternation. The latter was accomplished by an exercise exploring diffusion-limited aggregation. The purpose of these exercises was to make biologist more familiar with thinking algorithmically and for everyone to think critically about the impact of the choice of modelling technique. What can be seen as a feature in one instance can be an artifact in another. It all depends on the phenomena that one is trying to model.

Then it was time for lunch and socialising.

After lunch Dr Stan Maree introduced three seemingly different phenomenon: cellular slime mold chemotaxis, Belousov-Zhabotinsky reaction (chemistry) and action potentials in neurophysiology.

The Hodgkin-Huxley model was described in all its complexity. Followed by a statement that it can be described as “unpleasantly complex” and a quote from FitzHugh that “the usefulness of an equation to an experimental physiologist (…) depends on his understanding of how it works”. The FitzHugh-Nagumo model was then briefly introduced. However, the details of it and the implications of the model were not described as it was to be explored during the afternoons practical session.

Instead the focus shifted to how one can gain an understanding of systems of linear ordinary differential equations. Time plots were contrasted with phase plane plots. And the importance of visualising nullclines as lines of zero change for a particular parameter was highlighted. In particular the fact that one can identify all equilibria from the intersections of nullclines in a phase plane plot.

Stability of equilibria was then discussed and simple rules for quickly analysing the stability of equilibria were derived from the fact that:

For an equilibria to be stable its eigenvalues need to be negative
Summing two eigenvalues results in the trace
Multiplying two eigenvalues results in the determinant

So by plotting the trace vs the determinant we can get a plot illustrating different types of equilibria, see also.

The Jacobian matrix was then introduced and the idea that the Jacobian can be approximated by plotting the nullclines on the phase plane plot and making small perturbations around the equilibria.

The participant where then invited to explore the temporal dynamics of the FitzHugh-Nagumo model using the software grind. This was followed by exercises looking at the spatio-temporal dynamics of the same model using partial differential equations. This led on to looking at spirals formed when introducing introducing a temporary barrier and it was highlighted that these spirals could never have been identified if one did not take the spatial regime into account. Finally the link to the other phenomenon outlined at the beginning of the afternoon session, slime mold chemotaxis and Belousov-Zhabotinsky reaction was pointed out and the isomorphic nature of these phenomenon was highlighted.

How to generate beautiful technical documentation

2015-07-11T00:00:00+00:00

In the previous post I gave some motivational tips to inspire you to document your coding project. In this post I will illustrate how you can convert documentation written as plain text files into beautiful HTML documentation using a tool called Sphinx.

Installing Sphinx

Sphinx is a documentation generation tool written in Python and it can be installed using pip. If you do not yet have pip installed on your system please have a look at the pip installation notes.

Let us install Sphinx.

$ sudo pip install -U Sphinx

Generating boilerplate files for the documentation

Suppose that we are at the early stages of our project. All we have is a README file with the content below.

README
======

This project aims to inspire people to write more and better documentation.

However, we know that we want to store more extensive documentation in a subdirectory named docs. Let us create that directory and add some Sphinx boilerplate files to it.

$ mkdir docs
$ cd docs
$ sphinx-quickstart

The last command will prompt you for answers to a bunch of questions on how you want to setup your documentation and what extensions you want to enable. I tend to accept the defaults for everything except the question on whether or not I want to separate the source and build directories.

> Separate source and build directories (y/n) [n]: y

The input fields for project name, author name(s) and project version require you to provide some information. Below are the answers that I gave to these questions in this instance.

> Project name: Better documentation
> Author name(s): Tjelvar Olsson
> Project version: 0.0.1

Let’s see what was generated.

$ tree
.
├── Makefile
├── build
├── make.bat
└── source
    ├── _static
    ├── _templates
    ├── conf.py
    └── index.rst

4 directories, 4 files

Let us go through the files one by one. The Makefile allows us to build the documentation using make. The make.bat file allows us to build the documentation on Windows based systems. The source/conf.py file contains configurations for building the documentation (we will edit this later). The index.rst file is the root file of the documentation we are about to write.

Let’s build some documentation

Before we do anything else let us see what we get when we build the documentation.

$ make html

This will create output in the directory build/html, open the build/html/index.html file in your browser of choice. You should see something along the lines of the below.

Now have a look at the content of the source/index.rst file.

.. Better documentation documentation master file, created by
   sphinx-quickstart on Mon Jun 29 11:00:21 2015.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

Welcome to Better documentation's documentation!
================================================

Contents:

.. toctree::
   :maxdepth: 2



Indices and tables
==================

* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

The first section is a comment (the section starting with ..). This is followed by a header (denoted by the = underline). The .. toctree:: section is Sphinx’s way of denoting that a list of other files should be included (at the moment we have none). Finally, in the Indices and tables section there are links to index, module and search pages. If you are documenting a Python package the module page will contain links to the modules in your package.

Adding some more content

Let us add some more content. Create the file source/intro.rst and copy and paste the text below into it.

Introduction
============

The purpose of this project is to help scientists write better documentation.

Now add a link to it in source/index.rst.

Contents:

.. toctree::
   :maxdepth: 2

   intro

Note that the reference of the file to be included does not need the .rst extension. Furthermore it needs to be indented to the same level as :maxdepth: (by default this is three spaces). The latter has caught me out many times as I tend to indent four spaces.

If you rebuild the documentation using make html you will see the content of the source/intro.rst file included in the documentation.

reStructuredText markup

You may have noticed that we can create headers by underlining them with special characters. Sphinx uses reStructuredText as a markup language. For a quick introduction to the reStructuredText syntax have a look at A ReStructuredText Primer followed by Quick reStructuredText. Another good source is Sphinx’s ReStructuredText Primer.

Including code snippets in the documentation

Sphinx has taken advantage of the fact that reStructuredText is extensible and has added directives of its own. We have already seen one of these: the toctree directive.

Let us have a look at the code-block directive, which can be used to include code snippets. Create the file source/code_example.rst and add the text below to it.

Code example
============

Here is a Python function.

.. code-block:: python

    def greet(name):
        print("Hello {}".format(name))

Here is a C function.

.. code-block:: C

    int add(int a, int b) {
        return a + b;
    }

Remember to include the file into the toctree of source/index.rst.

.. toctree::
   :maxdepth: 2

   intro
   code_example

Use make html to build the documentation and behold the beautifully generated code snippets included in your documentation.

It is also possible to include whole files of source code in your documentation. Copy and paste the text below into a file named source/example_script.py.

"""This is an example script."""
import sys

def greet(name):
    """Return greeting."""
    return "Hello {}!".format(name)

if __name__ == "__main__":
    name = sys.argv[1]
    print(greet(name))

Now we will use Sphinx’s include directive to include the content of this script into the “Code example” page. Add the lines below to the end of the source/code_example.rst file.

Below is the content of a Python sample script.

.. literalinclude:: example_script.py
   :language: python

Sphinx also has several options for styling the display of your code snippets. For example you can add line numbers and emphasize particular lines. For more inspiration on how to include code snippets in your documentation have a look at Showing code examples in the Sphinx documentation.

Generating API documentaiton for Python projects

Sphinx has got particularly good support for documenting Python projects.

Let us create a module named chemistry for us to document at the root level of the project .

$ cd ../
$ mkdir chemistry
$ ls
README
docs
chemistry

Create the file chemistry/__init__.py and add the code below to it.

"""Basic chemistry module.

The :mod:`chemistry` module contains three classes:

- :class:`chemistry.Atom`
- :class:`chemistry.Bond`
- :class:`chemistry.Molecule`

One can use the :func:`chemistry.Molecule.add_atom` and
:func:`chemsitry.Molecule.add_bond` functions to build up a molecule.

Example illustrating how to create a methane molecule.

>>> from chemistry import Molecule
>>> mol = Molecule('Methane')
>>> carbon_index = mol.add_atom(atomic_number=6)
>>> hydrogen1_index = mol.add_atom(atomic_number=1)
>>> hydrogen2_index = mol.add_atom(atomic_number=1)
>>> hydrogen3_index = mol.add_atom(atomic_number=1)
>>> hydrogen4_index = mol.add_atom(atomic_number=1)
>>> bond1_index = mol.add_bond(carbon_index, hydrogen1_index)
>>> bond2_index = mol.add_bond(carbon_index, hydrogen2_index)
>>> bond3_index = mol.add_bond(carbon_index, hydrogen3_index)
>>> bond4_index = mol.add_bond(carbon_index, hydrogen4_index)
"""

class Atom(object):
    """Class representing an atom."""

    def __init__(self, atomic_number):
        self.atomic_number = atomic_number
        self.bonds = []

    def bond_to(self, other_atom):
        """Return the :class:`chemistry.Bond` formed between the two atoms.

        :param other_atom: :class:`chemistry.Atom` to form :class:`chemistry.Bond` to
        :returns: :class:`chemistry.Bond`
        """
        bond = Bond(self, other_atom)
        self.bonds.append(bond)
        other_atom.bonds.append(bond)
        return bond

class Bond(object):
    """Class representing a bond between two atoms."""
    
    def __init__(self, atom1, atom2):
        self.atoms = (atom1, atom2)

class Molecule(object):
    """Class representing a molecule consisting of atoms and bonds."""

    def __init__(self, identifier):
        self.identifier = identifier
        self.atoms = []
        self.bonds = []

    def add_atom(self, atomic_number):
        """Return the list index of the atom added to the molecule.

        :param atomic_number: atomic number of the atom to be added
        :returns: index of the atom in the molecule
        """
        atom = Atom(atomic_number)
        self.atoms.append(atom)
        return len(self.atoms) - 1

    def add_bond(self, atom1_index, atom2_index):
        """Return the list index of the bond added to the molecule.

        :param atom1_index: atom's index in molecule
        :param atom2_index: atom's index in molecule
        :returns: index of the bond in the molecule
        """
        atom1 = self.atoms[atom1_index]
        atom2 = self.atoms[atom2_index]
        bond = atom1.bond_to(atom2)
        self.bonds.append(bond)
        return len(self.bonds) - 1

We will now use Sphinx’s autodoc functionality to generate API documentation for this module. First of all we need to add the sphinx.ext.autodoc extension to the docs/source/conf.py file.

# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = ['sphinx.ext.autodoc']

In the same file we also need to specify the path to the module that we want to generate documentation for.

# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('../../'))

Now create the file docs/source/api.rst and copy and paste the text below into it.

API documentaiton
=================

.. automodule:: chemistry
   :members:

We also need to remember to include the api.rst file in the toctree. Edit the docs/source/index.rst file to match the below.

.. toctree::
   :maxdepth: 2

   intro
   code_example
   api

Finally, regenerate the documentation by running make html in the docs directory and behold the beautifully generated API documentation.

If you interact with the generated HTML documentation you will note that the constructs following the pattern below have been converted into hyperlinks.

:mod:`chemistry`
:class:`chemistry.Molecule`
:func:`chemistry.Molecule.add_atom`

These directives can be used anywhere in your documentation to link to the relevant section in the API documentation. Having descriptive documentation that contains links to the more technical API documentation is very pleasant and these directives make it very easy to do so.

It is also worth commenting on the :param: and :returns: directives used in the docstrings. These are part of a larger set of description directives that are formatted nicely by Sphinx. For more information have a look at the info field list section in the Sphinx documentation.

What about the original README file?

Let us finish off by including the content of the original README file into the generated HTML documenation.

Create the file docs/source/README.rst and copy and paste the text below into it.

.. include:: ../../README

This will include the content of the top level README file into the documentation.

Styling the documentaiton

The default theme of Sphinx is currently Alabaster. It is very beautiful. However, personally I prefer the Sphinx ReadTheDocs theme. In particular because of its left hand side navigation bar. Let’s check it out.

First of all we install the theme using pip.

$ pip install sphinx_rtd_theme

Now we need to edit theme section in docs/source/conf.py to look like the below.

# The theme to use for HTML and HTML Help pages.  See the documentation for
# a list of builtin themes.
#html_theme = 'alabaster'

# on_rtd is whether we are on readthedocs.org
on_rtd = os.environ.get('READTHEDOCS', None) == 'True'

if not on_rtd:  # only import and set the theme if we're building docs locally
    import sphinx_rtd_theme
    html_theme = 'sphinx_rtd_theme'
    html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]

# otherwise, readthedocs.org uses their theme by default, so no need to specify it

Note that the code above includes some logic for handling the cases where one hosts the documentation on readthedocs.

Regenerate the documentation by running make html in the docs directory and explore the look and feel of this new theme. Note in particular the behaviour of the left hand side navigation bar and the clear “next” and “previous” buttons at the bottom of each page.

For more information on the ReadTheDocs theme have a look here.

ReadTheDocs

Whilst on the subject it is worth mentioning the ability to host your documentation on readthedocs. Simply sign-up for an account, link your GitHub/BitBucket account and then you can select the projects that you want to host on readthedocs. It is great!

It is worth noting that if your project documentation includes links to packages such as numpy and scipy you will need to mock these out in the conf.py file. For more information have a look at this readthedocs faq. For a real life example have a look at this conf.py.

Five tips to help you document your coding project

2015-06-28T00:00:00+00:00

Do you want other people to make use of the project that you are working on?

If you answered yes to the question above you need to write some form of documentation outlining how to make use of it.

Do you enjoy writing documentation?

If you answered no to the question above please read on to find some tips to make the experience more enjoyable.

Tip 1: Start early and start small

A common scenario is to treat documentation as an afterthought. However, if a project is nearing completion and one has no documentation, the thought of writing it can be daunting. As a result one never starts working on the BIG documentation task, but rather spends time on smaller and more satisfying tasks such as adding nice-to-have features.

The solution is to start early and to start small. Before you write any code create a README file and include a sentence stating what problem the project solves.

Once you have some code add some basic instructions on how to run it to the README file.

Tip 2: Include documentation in your definition of done

Suppose that you have implemented a new feature. You are proud of it. You have even written tests for it! Don’t stop there! Complete the task by writing some descriptive documentation outlining how to make use of the feature. Furthermore, if you have release notes add a bullet point with a link to the section that you have just written.

Tip 3: Reap the benefits of explaining your code to someone else

Writing documentation is an act of trying to explain something to someone else. What often happens when one tries to explain a solution to someone else is that one finds the solution lacking or sub optimal. I often find that the act of documenting a feature results in me realising that the feature is not actually fit for purpose in its current state - giving me the opportunity to fix it before it is released.

Discovering improvements by writing documentation is similar to rubber duck debugging, where one tries to discover the source of a bug by explaining code line by line to a rubber duck.

Tip 4: Store your documentation alongside your code in version control as plain text files

Documentation should be stored alongside your code in version control as plain text files.

Storing your code and documentation in the same repository allows them to be kept in line with each other.

The benefits of plain text files are outlined in The Pragmatic Programmer. In fact the book has an entire chapter devoted to it.

What is so special about plain text files?

In short: they are portable, easy to use and there is no lock-in. For a more extensive answer have a look at CM Smith’s Lifehack post Why Geeks Love Plain Text (And Why You Should Too).

Tip 5: Make use of tools that can convert your plain text files to beautifully formatted documents

Although text files have many advantages they are not ideal for consuming (reading) documentation. When reading documentation you want it to be pleasant on the eye and easy to navigate.

I highly recommend using Sphinx it is a great tool for writing technical documentation. It can produce a range of output formats including HTML and PDF. It has great support for cross-referencing and the HTML output has built-in support for searching. Furthermore, if you use Sphinx you can host your documentation on Read the Docs. I will explain how to use Sphinx in my next post.

Conclusion

Documentation is a sign that someone cares about a project. This makes it easier for other people to care about it too.

I hope that you found this post useful. If nothing else I hope that it has given you the motivation to add a README file to your current project with a line explaining what problem the project solves.

Test-driven development for scientists

2015-06-13T00:00:00+00:00

The test-driven development cycle.

Introduction

In Three essential tips for improving your scientific code I talked about the importance of writing tests for your scientific code base. Tests provide a means to verify that new code does what it is intended to do and a means to alert you if you inadvertently break an existing piece of functionality when modifying the code base.

Furthermore, if you have a well tested code base you feel less scared of making changes to it. Whilst coding have you ever thought to yourself:

I could really do with re-writing this to make it simpler, but I'm not sure what else I would break....

If your code base had better test coverage you would not feel this way. Having tests give you the ability to make sweeping changes to the code whilst retaining confidence that you have not broken any vital piece of functionality.

Tests also provide a type of living documentation of your code base, a specification of how the code is intended to work.

In fact tests are so important that some people write them before they write any code in a method known as test-driven development.

In this post we will make use of the skills we built up in Four tools for testing your Python code to explore test-driven development. We will use test-driven development to create a Python FASTA parser package.

What is test-driven development?

Test-driven development, often abbreviated as TDD, can be thought of as a three step process.

Write a test for the functionality that you have in mind and watch it fail
Write minimal code to make the test pass
Refactor the code if required

Don’t worry if the above sounds a bit abstract. The purpose of the rest of this post is to illustrate how this works in practise.

What are the benefits of test-driven development?

The three main reasons I love test-driven development are:

It makes me think about how I want my code to behave up front
It makes me write tests
It is fun

Of course I could write tests after having implemented a piece of code. However, in practise when I code first and test later, the “test later” rarely happens.

This may sound silly, but it is not much fun writing a test for something that already works. It feels like a menial task. On the other hand, writing a test before an implementation exists stimulates my brain, I have to think about how I want my code to behave.

Furthermore, a failing test is like a challenge. In writing a failing test I am giving myself a tiny puzzle to solve. The test-driven development cycle essentially gamifies my working day, with the positive side-effect of producing an extensive test suite.

For a more exhaustive list of benefits of test-driven development have a look at Mark Levison’s post: Advantages of TDD.

If you are interested in this topic I also recommend reading Kane Mar’s three part post: The benefits of TDD are neither clear nor are they immediately apparent.

Spiking

It is not wrong to develop code without tests. However, if you are doing test-driven development you should treat such exploratory code as “throw away” and use it as a guide to write tests when doing things properly. In this context “properly” means writing the tests first. People who practise test-driven development refer to such exploratory coding as a spike. Here we will treat the exploration from the prevoius FASTA post as a spike.

Creating a project template

We will start by creating a project template using cookiecutter

$ cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
...
repo_name (default is "mypackage")? tinyfasta
version (default is "0.0.1")? 
authors (default is "Tjelvar Olsson")? 
...
$ cd tinyfasta

And setting up a clean Python development environment.

$ virtualenv ~/virtualenvs/tinyfasta
$ source ~/virtualenvs/tinyfasta/bin/activate
(tinyfasta)$ python setup.py develop

Note that you can view this project and its progression on GitHub.

Start with a functional test

When practising test-driven development it is often useful to start with a functional test. A functional test differs from a unit test in that it tests a slice of functionality in the system as opposed to an individual unit. The rational for starting with a functional test is that it allows us to take a step back and think about the larger picture.

We can translate the learning from our spike into a functional test. The code below parses FASTA records from the dummy.fasta file and writes the records to another file tmp.fasta. The test then ensures that the contents of the two files are identical.

ebcc524 tests/tests.py

    def test_output_is_consistent_with_input(self):
        from tinyfasta import FastaParser
        input_fasta = os.path.join(DATA_DIR, "dummy.fasta")
        output_fasta = os.path.join(TMP_DIR, "tmp.fasta")
        with open(output_fasta, "w") as fh:
            for fasta_record in FastaParser(input_fasta):
                fh.write("{}\n".format(fasta_record))
        input_data = open(input_fasta, "r").read()
        output_data = open(output_fasta, "r").read()
        self.assertEqual(input_data, output_data)

Here is a link to the input FASTA file tests/data/dummy.fasta.

Start building up functionality using unit tests

Another reason for starting with a functional test is that it can act as a guide for what to implement. When we run the functional test we immediately find out that we need a class named FastaParser.

Traceback (most recent call last):
  File "/Users/olssont/junk/tinyfasta/tests/tests.py", line 31, in test_output_is_consistent_with_input
    from tinyfasta import FastaParser
ImportError: cannot import name FastaParser

At this point we add a unit test for initialising a FastaParser instance.

a6d2253 tests/tests.py

    def test_FastaParser_initialisation(self):
        from tinyfasta import FastaParser
        fasta_parser = FastaParser('test.fasta')
        self.assertEqual(fasta_parser.fpath, 'test.fasta')

After having run the test and watched it fail we add minimal code to make the unit test pass.

a6d2253 tinyfasta/__init__.py

class FastaParser(object):
    """Class for parsing FASTA files."""

    def __init__(self, fpath):
        """Initialise an instance of the FastaParser."""
        self.fpath = fpath

The implementation makes the unit test pass. So we continue by running the functional test again.

Traceback (most recent call last):
  File "/Users/olssont/junk/tinyfasta/tests/tests.py", line 40, in test_output_is_consistent_with_input
    for fasta_record in FastaParser(input_fasta):
TypeError: 'FastaParser' object is not iterable

Okay, so we need a test to make sure that the class is iterable.

abfdeee tests/test.py

    def test_FastaParser_is_iterable(self):
        from tinyfasta import FastaParser
        fasta_parser = FastaParser('test.fasta')
        self.assertTrue(hasattr(fasta_parser, '__iter__'))

At this point it may be worth reflecting on how we should make this test pass. In test-driven development we want to add minimal implementation to get the tests to pass. The code below is pretty minimal and it makes the test pass.

abfdeee tinyfasta/__init__.py

    def __iter__(self):
        """Yield FastaRecord instances."""
        yield None

As the docstring above suggests we want the FastaParser to yield FastaRecord instances. So at this point we can start building up the FastaRecord class using small incremental steps of test and code. To get a feel for this have a look at the commits:

At this point we have all the functionality we need to add a proper implementation of the FastaParser.__iter__() method, which we hope will make the functional test pass.

75e3272 tinyfasta/__init__.py

    def __iter__(self):
        """Yield FastaRecord instances."""
        fasta_record = None
        with open(self.fpath, 'r') as fh:
            for line in fh:
                if line.startswith('>'):
                    if fasta_record:
                        yield fasta_record
                    fasta_record = FastaRecord(line)
                else:
                    fasta_record.add_sequence_line(line)
        yield fasta_record

Let us make sure that all the tests pass.

$ nosetests
........
Name        Stmts   Miss  Cover   Missing
-----------------------------------------
tinyfasta      26      0   100%
----------------------------------------------------------------------
Ran 8 tests in 0.027s

OK

Great we have a basic working implementation of our tinyfasta.py module.

And iterate

Now that we have the basics implemented we want to add more functionality and by now you know what that means: another test. As we are wanting to add new functionality we start all over again with another functional test.

In the commit history of the tinyfasta project one can see how functionality for searching the FASTA description line was added.

Followed by functionality for searching the biological sequence.

Refactoring

Up until this point we have followed the work flow below

Write a test
Write minimal code to make the test pass

However, this is not the whole story as it leaves out an important aspect of test-driven development: refactoring.

Let us start with a simple example of factoring out code duplication. After having added functionality for using either strings or compiled regular expressions to search the description and sequence we notice that there is a lot of code duplication.

e748ac3 tinyfasta/__init__.py

    def description_matches(self, search_term):
        """Return True if the search_term is in the description."""
        if hasattr(search_term, "search"):
            return search_term.search(self.description) is not None
        return self.description.find(search_term) != -1

    def sequence_matches(self, search_motif):
        """Return True if the motif is in the sequence.

        :param search_motif: string or compiled regex
        :returns: bool
        """
        if hasattr(search_motif, "search"):
            return search_motif.search(self.sequence) is not None
        return self.sequence.find(search_motif) != -1

As we have been using test-driven development we have tests for all the functionality of interest. We can therefore refactor the code to the below.

2b988b9 tinyfasta/__init__.py

    @staticmethod
    def _match(string, search_term):
        """Return True if the search_term is in the string.
        :param string: string to be searched
        :param search_term: string or compiled regex
        :returns: bool
        """
        if hasattr(search_term, "search"):
            return search_term.search(string) is not None
        return string.find(search_term) != -1

    def description_matches(self, search_term):
        """Return True if the search_term is in the description.
        :param search_term: string or compiled regex
        :returns: bool
        """
        return FastaRecord._match(self.description, search_term)

    def sequence_matches(self, search_motif):
        """Return True if the motif is in the sequence.
        :param search_motif: string or compiled regex
        :returns: bool
        """
        return FastaRecord._match(self.sequence, search_motif)

And run the tests.

$ nosetests
......................
Name        Stmts   Miss  Cover   Missing
-----------------------------------------
tinyfasta      43      0   100%   
----------------------------------------------------------------------
Ran 22 tests in 0.032s

OK

As all the tests pass we can have some level of confidence that everything is still working as intended.

Improving the design of the code

At some point whilst documenting how to use the tinyfasta package I realised that the function names description_matches and sequence_matches were a little bit misleading and that the names description_contains and sequence_contains would be more appropriate. This was a relatively simple change to make, see commit 0496373.

However, some time later I realised that it would be much nicer if the API of the tinyfasta package would allow code that looked like the below. Note that the description is no longer a function, but an instance of some sort which has a contains function.

>>> from tinyfasta import FastaParser
>>> for fasta_record in FastaParser("tests/data/dummy.fasta"):
...     if fasta_record.description.contains('seq1'):
...         print(fasta_record)
...
>seq1|contains 2x78 A's
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Although, the change to the feel of the API is minor (an underscore swapped for a full stop), the change to the underlying behaviour of the tinyfasta package is major.

However, because of all the tests the change was not so hard to implement. First I went into the tests and changed all the calls to the description_contains and sequence_contains to description.contains and sequence.contains. Then I simply “listened to my tests” as they guided me through all the changes that needed to be made for the package to become functional again. Have a look at commit 7fb248f to see the resulting changes to the code base.

Conclusion

I hope this post inspires you to try out test-driven development. However, don’t be surprised if you find that it is harder than it looks. Like everything it requires practise. If you feel really stuck, try using a spike to get you going and then use the resulting code to inspire a functional test.

I can also highly recommend Harry Percival’s book Test-Driven Development with Python. It is what inspired me to start using test-driven development.

Happy coding!

Four tools for testing your Python code

2015-05-30T00:00:00+00:00

Introduction

It is important to test your code. Tests provide a means to verify that code does what it is intended to do. However, repeated manual testing is tedious and error prone.

In this post I will highlight four tools for helping you automate the testing of your code base.

Background

In a previous post we discussed how to set up clean Python development environments using virtualenv and cookicutter.

$ cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
...
repo_name (default is "mypackage")? awesome
...
$ cd awesome
$ virtualenv ~/virtualenvs/awesome
$ source ~/virtualenvs/awesome/bin/activate
(awesome)$ python setup.py develop

In this post we will make use of some of the files generated using this setup.

1. Unittest - a Python module for creating tests

Python comes with batteries included and built into the standard library is a module named unittest, which can be used to write tests.

As a side note: tests can be classified into many different types: unit tests, integration tests, functional tests, acceptance tests. Mark Simpson has written a nice overview of the different types of tests on stackoverflow. As the post implies the subject of classifying tests is rather subjective and you get different answers depending on where you look. Personally, I simply use two broad categories: unit tests and functional tests. Where the latter incorporates both acceptance and integration tests.

No matter how you classify your tests you can use Python’s unittest module to write them.

Below is a bare bones skeleton for writing a test using the unittest module. To write a test we create a subclass of the unittest.TestCase base class. Now any functions in our test class that start with test_ will be tested when we call the unittests.main() function. Copy and past the code below into a file named basic_unittest.py.

import unittest

class MyTest(unittest.TestCase):
    def test_something(self):
        pass

if __name__ == "__main__":
    unittest.main()

Let’s see what happens when we run this code.

(awesome)$ python basic_unittest.py
.
----------------------------------------------------------------------
Ran 1 test in 0.000s

OK

Okay, now let us have a look at the tests/tests.py file generated earlier on by our cookiecutter template.

import unittest
import os
import os.path
import shutil

HERE = os.path.dirname(__file__)
DATA_DIR = os.path.join(HERE, 'data')
TMP_DIR = os.path.join(HERE, 'tmp')


class UnitTests(unittest.TestCase):

    def test_can_import_package(self):
        # Raises import error if the package cannot be imported.
        import awsome

    def test_package_has_version_string(self):
        import awsome
        self.assertTrue(isinstance(awsome.__version__, str))

class FunctionalTests(unittest.TestCase):

    def setUp(self):
        if not os.path.isdir(TMP_DIR):
            os.mkdir(TMP_DIR)

    def tearDown(self):
        shutil.rmtree(TMP_DIR)


if __name__ == "__main__":
    unittest.main()

There are several things to note here.

Let us start by looking at the test_package_has_version_string() function. It makes use of unittest.TestCase.assertTrue() to check that the version number of the awesome package we are developing is a string. There are many other useful “assert” functions built into the unittest.TestCase base class, one of the most used ones being unittest.TestCase.assertEqual().

At the top of the file we import several additional modules: os, os.path, shutil. The os.path module is used to create some variables for defining input and output directories for our functional tests.

The unittest.TestCase.setUp() and unittest.TestCase.tearDown() functions provide a way to ensure test isolation. They are run before and after each individual test function in a test class. The os module is used to create the tests/tmp directory during the set up of a functional test and similarly the shutil module is used to remove the tests/tmp directory when a functional test is finished.

Hopefully this quick overview has provided a enough detail for you to get started writing your own tests. For more information have a look at the unittest documentation.

2. Nose - a test runner for your tests

As you build up more and more tests you want to have a way of running them all automatically. One way to do this is to use nose.

Let us install it using pip.

(awesome)$ pip install nose

Now we can run the test suite using the nosetests command.

(awesome)$ nosetests
nose.plugins.cover: ERROR: Coverage not available: unable to import coverage module
..
----------------------------------------------------------------------
Ran 2 tests in 0.004s

OK

There are two things to note in the above. First of all, the nosetests command automatically found and ran our tests. Yay!

Secondly, it complained about not being able to import the coverage module. There are two reasons for this:

We have not installed the coverage module yet
The awesome/setup.cnf file specifies that it should be used

[nosetests]
detailed-errors=1
with-coverage=1
cover-package=awesome
cover-erase=1
verbosity=1

What is coverage all about anyway?

3. Coverage - measuring your code coverage

The coverge module measures code coverage. Code coverage is a measure of how many lines of code are being exercised by your tests. It is particularly useful for identifying areas of the code-base that need more tests.

Let us install it.

(awesome)$ pip install coverage

Now let us run the tests again.

(awesome)$ nosetests
..
Name     Stmts   Miss  Cover   Missing
--------------------------------------
awsome       1      0   100%   
----------------------------------------------------------------------
Ran 2 tests in 0.009s

OK

Awesome we have 100% test coverage!

Let us add some more functionality to see what happens when we have code that is not tested. Add the fpaths_in_dir() function to the awesome/__init__.py file.

"""awesome package."""
import os

__version__ = "0.0.1"

def fpaths_in_dir(directory):
    """Return the paths to the files in the directory."""
    fpaths = []
    for fname in os.listdir(directory):
        fpaths.append(os.path.join(directory, fname))
    return fpaths

If we run the tests again we find out that lines 8-11 have not been convered by the tests.

(awseome)$ nosetests
..
Name      Stmts   Miss  Cover   Missing
---------------------------------------
awesome       7      4    43%   8-11
----------------------------------------------------------------------
Ran 2 tests in 0.010s

OK

Let’s add a test for them! But wait… Errr…

How do we add a reliable test for something that wants to read information from the file system?

4. Mock - faking objects for unit tests

We can make use of mock objects to solve these types of problems. Mock objects mimic the behaviour of real objects in controllable ways. For more background have a look at the Mock object wikipedia page.

As of Python 3.3 mock is part of the standard library. However, users of older versions of Python can install it using pip.

(awseome)$ pip install mock

Now we can write a test for our function. Add the test function below to the UnitTests class in the awesome/tests/tests.py file.

    def test_fpaths_in_dir(self):
        from mock import MagicMock
        from awesome import fpaths_in_dir
        os.listdir = MagicMock(return_value=['test1.txt', 'test2.txt'])
        fpaths = fpaths_in_dir('some/dir')
        expected = ['some/dir/test1.txt', 'some/dir/test2.txt']
        self.assertEqual(fpaths, expected)

Let us run the tests again.

(awseome)$ nosetests
...
Name      Stmts   Miss  Cover   Missing
---------------------------------------
awesome       7      0   100%
----------------------------------------------------------------------
Ran 3 tests in 0.043s

OK

Great all the tests are passing! Now we can relax again.

The mock module can do much more than what I have shown above. Have a look at the mock documentation for some more inspiration.

Conclusion

Python comes with lots of useful tools for helping you test your code base. In this post I have described some of the most established ones. However there are others around. Experiment and find out what works for you.

In the next post I will continue the theme of testing by illustrating some aspects of test-driven development.

Five exercises to master the Python debugger

2015-05-15T00:00:00+00:00

How do you do your debugging?

Introduction

When programming (in Python) it is common to find oneself inserting print statements all over the code when trying to find out why things are not working as expected. This can often be a quick way of working out what is going on.

However, it can become tedious whenever the problem is not resolved by the first print statement. I have often found myself spending significant amounts of time scattering print statements all over my code to work out what is going on. Usually this is followed by me spending time hunting through my code for the print statements so that I can delete them. After which I often realise that I still needed them.

There is a more powerful way of finding out what a program is doing: using a debugger. However, people often shy away from debuggers because of their arcane interfaces. This post contains five exercises to help you master the Python debugger.

By the end of this post I hope that you will be substituting your print statements with import pdb; pdb.set_trace().

Exercise 1: stepping through a program

Let us start by stepping through a simple program. Copy and paste the code snippet below into a file named pdb_exercise_1.py.

name = 'alice'
greeting = 'hello ' + name
print(greeting)

Now invoke the script using the python debugger via the command below.

python -m pdb pdb_exercie_1.py

In the above pdb is the three letter acronym for the Python Debugger. You should be greeted by the prompt below.

> pdb_exercise_1.py(1)<module>()
-> name = 'alice'
(Pdb) 

The debugger shows the next line to be executed (-> name = 'alice') as well as the prompt for interacting with the debugger ((Pdb)).

Type in n, short for next, to execute the line displayed. You should now see the output below.

(Pdb) n
> pdb_exercise_1.py(2)<module>()
-> greeting = 'hello' + name

Let us check the value of the newly assigned name variable. Type in p name (p as in “print”). It should tell you that the name is “alice”. Type in n again to execute the next command. The greeting variable should now have been assigned the string “hello alice”.

(Pdb) p name
'alice'
(Pdb) n
> pdb_exercise_1.py(3)<module>()
> -> print(greeting)
> (Pdb) p greeting
> 'hello alice'

When debugging it is quite easy to lose the frame of reference as to where one is in the code. To put things into context type in l as in list (the source code for the current file). You should see output below.

(Pdb) l
  1     name = 'alice'
  2     greeting = 'hello ' + name
  3  -> print(greeting)
[EOF]

Okay, so we are almost at the end. Type in n again to execute the last command.

(Pdb) n
hello alice

Finally, type in q to quit the debugger.

Well done! You have just used the Python debugger to step through a program.

Exercise 2: stepping into functions

Let us create a script with a function. Copy and paste the code snippet below into a file named pdb_exercise_2.py.

def greet(name):
    greeting = 'hello ' + name
    return greeting

greeting = greet('alice')
print(greeting)

Start the debugger.

python -m pdb pdb_exercise_2.py

This time, rather than stepping through the program, press c (which stands for “continue execution”). You should see the output below.

(Pdb) c
hello alice
The program finished and will be restarted
> pdb_exercise_2.py(1)<module>()
-> def greet(name):
(Pdb) 

Basically the program ran from beginning to end, printing out the greeting, and then it restarted itself leaving us at the (Pdb) prompt.

This time use n to walk through the script. Note that you only need to enter n three times to get to the end of the program and that the debugger does not step into the greet() function. You should see the output below.

> pdb_exercise_2.py(1)<module>()
-> def greet(name):
(Pdb) n
> pdb_exercise_2.py(5)<module>()
-> greeting = greet('alice')
(Pdb) n
> pdb_exercise_2.py(6)<module>()
-> print(greeting)
(Pdb) n
hello alice

In other words n continues execution until the next line in the current function is reached or it returns.

Press c to restart the program and press n once to get to the line where the greet function is about to be called.

> pdb_exercise_2.py(1)<module>()
-> def greet(name):
(Pdb) n
> pdb_exercise_2.py(5)<module>()

This time we will use s to step into the greet() function, then we will continue walking through the program using n. Note the difference now that you have stepped into the greet() function.

(Pdb) s
--Call--
> pdb_exercise_2.py(1)greet()
-> def greet(name):
(Pdb) n
> pdb_exercise_2.py(2)greet()
-> greeting = 'hello ' + name
(Pdb) n
> pdb_exercise_2.py(3)greet()
-> return greeting
(Pdb) n
--Return--
> pdb_exercise_2.py(3)greet()->'hello alice'
-> return greeting
(Pdb) n
> pdb_exercise_2.py(6)<module>()
-> print(greeting)
(Pdb) n
hello alice
--Return--
> pdb_exercise_2.py(6)<module>()->None
-> print(greeting)
(Pdb) 

Finally let us have a look at the r command, which stands for return. This is similar to the c command, but rather than continuing to the end of the program r runs to the end of the function.

Let us try it out, start off by entering c to restart the program then enter n and s. You should now be in the greet() function.

(Pdb) c
The program finished and will be restarted
> pdb_exercise_2.py(1)<module>()
-> def greet(name):
(Pdb) n
> pdb_exercise_2.py(5)<module>()
-> greeting = greet('alice')
(Pdb) s
--Call--
> pdb_exercise_2.py(1)greet()
-> def greet(name):
(Pdb)

As a sanity check, use l to list where you are in the code. You should see the below.

(Pdb) l
  1  -> def greet(name):
  2         greeting = 'hello ' + name
  3         return greeting
  4     
  5     greeting = greet('alice')
  6     print(greeting)
[EOF]
(Pdb) 

Now press r as in “return”.

(Pdb) r
--Return--
> pdb_exercise_2.py(3)greet()->'hello alice'
-> return greeting
(Pdb) 

Note that we are immediately placed at the end of the function where it is about to deliver its return value.

Exercise 3: getting help

When using a tool infrequently it is easy to forget what the commands are named and what they do. However, using the help command it is easy to refresh your memory.

(Pdb) help

Documented commands (type help <topic>):
========================================
EOF    bt         cont      enable  jump  pp       run      unt   
a      c          continue  exit    l     q        s        until 
alias  cl         d         h       list  quit     step     up    
args   clear      debug     help    n     r        tbreak   w     
b      commands   disable   ignore  next  restart  u        whatis
break  condition  down      j       p     return   unalias  where 

Miscellaneous help topics:
==========================
exec  pdb

Undocumented commands:
======================
retval  rv

Let us have a look at the help descriptions of the commands that we have been using so far.

(Pdb) help n
n(ext)
Continue execution until the next line in the current function
is reached or it returns.
(Pdb) help s
s(tep)
Execute the current line, stop at the first possible occasion
(either in a function that is called or in the current function).
(Pdb) help c
c(ont(inue))
Continue execution, only stop when a breakpoint is encountered.
(Pdb) help r
r(eturn)
Continue execution until the current function returns.
(Pdb) help l
l(ist) [first [,last]]
List source code for the current file.
Without arguments, list 11 lines around the current line
or continue the previous listing.
With one argument, list 11 lines starting at that line.
With two arguments, list the given range;
if the second argument is less than the first, it is a count.
(Pdb) help help
h(elp)
Without argument, print the list of available commands.
With a command name as argument, print help about that command
"help pdb" pipes the full documentation file to the $PAGER
"help exec" gives help on the ! command
(Pdb) 

Exercise 4: interacting with the program under inspection

Up until this point we have not actually had any errors in our scripts to correct. Let us change that. Copy and paste the code below into a file named pdb_exercise_4.py.

import sys

def magic(x, y):
    return x + y * 2

x = sys.argv[1]
y = sys.argv[1]

answer = magic(x, y)
print('The answer is: {}'.format(answer))

Suppose that we run this script with the inputs 1 and 50 expecting the result 101.

python pdb_exercise_4.py 1 50
The answer is: 111

What is going on?

Now, rather than inserting print statements all over the code to work it out, let us examine the code in the debugger.

python -m pdb pdb_exercise_4.py 1 50

Let us get to the point where we have access to the variables x and y.

> pdb_exercise_4.py(1)<module>()
-> import sys
(Pdb) n
> pdb_exercise_4.py(3)<module>()
-> def magic(x, y):
(Pdb) n
> pdb_exercise_4.py(6)<module>()
-> x = sys.argv[1]
(Pdb) n
> pdb_exercise_4.py(7)<module>()
-> y = sys.argv[1]
(Pdb) n
> pdb_exercise_4.py(9)<module>()
-> answer = magic(x, y)
(Pdb) 

First of all let us see what attributes are available in the scope of the program. We can do this using p for print.

(Pdb) p dir()
['__builtins__', '__file__', '__name__', '__package__', 'magic', 'sys', 'x', 'y']

There is also pp for pretty print.

(Pdb) pp dir()
['__builtins__',
 '__file__',
 '__name__',
 '__package__',
 'magic',
 'sys',
 'x',
 'y']

So what is x?

(Pdb) p x
'1'

Hey, that looks suspiciously like a string. Note that we can use raw Python within the debugger. Let us find out type x is.

(Pdb) type(x)
<type 'str'>

The fact that we can execute Python within the debugger means that we can change the input variables dynamically.

(Pdb) x = int(x)
(Pdb) y = int(y)

Let us just check the values before we run the program.

(Pdb) p x, y
(1, 1)

What y is 1 not 50?

Inspecting the code we find that I forgot to update the index when I copied the input parsing line (note line 7 in the code listing below).

(Pdb) l
       return x + y * 2
   
   x = sys.argv[1]
   y = sys.argv[1]
   
-> answer = magic(x, y)
   print('The answer is: {}'.format(answer))
[EOF]
(Pdb) 

Ok, let us just change the value of y to 50 in the debugger before checking if the code works as expected by letting it run to completion.

(Pdb) y = 50
(Pdb) c
The answer is: 101
The program finished and will be restarted
> pdb_exercise_4.py(1)<module>()
-> import sys
(Pdb) 

Ok, so the example is a little bit naff. However, I hope it illustrates the power of working with the debugger, particularly if you are working on a more complicated code base.

Exercise 5: using breakpoints

So far we have been stepping though the scripts from beginning to end. However, when working on larger programs this is often not practical. To simulate such a situation, copy and paste the code below into a file named pdb_exercise_5.py.

import time

def slow_subtractor(a, b):
    """Return a minus b."""
    time.sleep(5)
    return a - b

some = slow_subtractor(12, 8)
crazy = slow_subtractor(12, 78)
scientific = slow_subtractor(56, 31)
experiment = slow_subtractor(101, 64)

total = some + crazy + scientific + experiment

experimental_fraction = experiment / total

When we run this code we get a ZeroDivisionError.

$ python pdb_exercise_5.py
Traceback (most recent call last):
  File "pdb_exercise_5.py", line 15, in <module>
    experimental_fraction = experiment / total
ZeroDivisionError: integer division or modulo by zero

Stepping through the code in the debugger would be annoying as you would have to press n every time the slow_subtraction() function was called. Let us instead insert a breakpoint before the line that generates the error. This is achieved by importing the pdb module and using the pdb.set_trace() function.

import time

def slow_subtractor(a, b):
    """Return a minus b."""
    time.sleep(5)
    return a - b

some = slow_subtractor(12, 8)
crazy = slow_subtractor(12, 78)
scientific = slow_subtractor(56, 31)
experiment = slow_subtractor(101, 64)

total = some + crazy + scientific + experiment

import pdb; pdb.set_trace()
experimental_fraction = experiment / total

If we run the code now we get dumped into a debugger session before the offending line is executed.

python pdb_exercise_5.py
> pdb_exercise_5.py(17)<module>()
-> experimental_fraction = experiment / total
(Pdb) p total
0
(Pdb) p some, crazy, scientific, experiment
(4, -66, 25, 37)

Ok, so it looks like there is something funny going on with the crazy variable. Perhaps the input arguments were given the wrong way around.

The take home message is that setting breakpoints is a powerful way of getting to the point of interest in your code when you want to examine what is going on.

Conclusion

In this post we have worked our way through some rather academic exercises to get ourselves familiar with the Python debugger and how to interact with it. Hopefully you now feel that you have the skill to step through and query the state of your program from with the debugger.

However, if you only take one thing away from this post please let it be the commitment to insert the line import pdb; pdb.set_trace() just above your code of interest the next time you feel tempted to print the value of a variable in a program that is not behaving as expected.

Beginner's Guide: creating clean Python development environments

2015-05-09T00:00:00+00:00

Introduction

Code interacts with its environment. For example, you can only run a Python script if you have Python installed on the system. Furthermore, a Python script will only run without raising ImportError exceptions if all the required packages are installed.

It therefore becomes important for you as a developer / computational scientist to understand and control the environment in which your code operates.

In this post I will illustrate a work flow for creating clean Python development environments.

Example: developing a Python package

In the previous post I illustrated how you could use a static code generator (cookiecutter) to create a basic template to develop a Python package.

Now suppose that we wanted to develop a Python package named “awesome”. Let us use a GitHub hosted Cookiecutter template to create a basic project layout.

$ cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
Cloning into 'cookiecutter-pypackage'...
remote: Counting objects: 48, done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 48 (delta 13), reused 37 (delta 8), pack-reused 0
Unpacking objects: 100% (48/48), done.
Checking connectivity... done.
repo_name (default is "mypackage")? awesome
version (default is "0.0.1")? 
authors (default is "Tjelvar Olsson")?

This creates the directory awesome.

$ cd awesome/

With a number of files and directories in it.

$ tree
.
├── README.rst
├── awesome
│   └── __init__.py
├── docs
│   ├── Makefile
│   ├── make.bat
│   └── source
│       ├── README.rst
│       ├── conf.py
│       └── index.rst
├── setup.cfg
├── setup.py
└── tests
    ├── __init__.py
    └── tests.py

4 directories, 11 files

You may notice that there are some tests included by default. Let us try to run them.

$ python tests/tests.py
EE
======================================================================
ERROR: test_can_import_package (__main__.UnitTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/tests.py", line 15, in test_can_import_package
    import awesome
ImportError: No module named awesome

======================================================================
ERROR: test_package_has_version_string (__main__.UnitTests)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "tests/tests.py", line 18, in test_package_has_version_string
    import awesome
ImportError: No module named awesome

----------------------------------------------------------------------
Ran 2 tests in 0.001s

FAILED (errors=2)

That’s not very good! What is going on? It seems that we cannot import the awesome module.

Depending on your level of familiarity with Python the problem may be obvious to you. However, when I started out with Python this caused me a lot of confusion. I clearly could import the awesome module!

$ python -c "import awesome; print(awesome.__version__)"
0.0.1

One of the places where Python looks for modules is within the directory of the calling script, which is why the command above works. However, when we run the tests/tests.py script there is no awesome package to be found within the tests directory, illustrated below.

$ cd tests/
$ python -c "import awesome; print(awesome.__version__)"
Traceback (most recent call last):
  File "<string>", line 1, in <module>
ImportError: No module named awesome
$ cd ../

At this point one could start manually configuring the PYTHONPATH environment variable. However, let us look at a more elegant solution.

Making use of `setuptools`

In the previous post we started building up a basic setup.py file, which made use of the setuptools module.

You are probably already familiar with setuptools from installing other Python packages using the command python setup.py install. This installs the package of interest into your Python distribution’s site-packages directory.

However, this is not what we want to do because the package would be copied there and any changes that we made to our local development files would not take effect until we reinstalled the package. We want to be able to edit our local development files and see the effects take place immediately.

The solution to this problem is to use python setup.py develop which creates an .egg-link to our local development directory in the site-packages directory.

$ sudo python setup.py develop
Password:
running develop
...
Processing dependencies for awesome==0.0.1
Finished processing dependencies for awesome==0.0.1

Let us re-run the tests now that site-packages contains an awesome.egg-link.

$ python tests/tests.py 
..
----------------------------------------------------------------------
Ran 2 tests in 0.000s

OK

Great, we have a working development environment!

Before continuing let us square the circle by removing the development package we just installed.

$ sudo python setup.py develop --uninstall
Password:
running develop
...
Removing awesome 0.0.1 from easy-install.pth file

So far so good, but there are two issues with what we are currently doing. First of all we need to have root permissions to run python setup.py develop and python setup.py develop --uninstall when using the system’s Python. Secondly, when using the system’s Python we are using a (potentially) polluted environment.

Let me expand on the second issue. Suppose that you had the PyYAML package installed in your system’s Python. It is a very useful package, but it is not part of Python’s standard library. Suppose further that your package needed to be able to parse YAML files. You therefore start using PyYAML. Some time later you want to share your code with your friend Alice. You run your tests, of course you are writing tests as you go along, and they all pass. You feel happy and send the package Alice. However, Alice has not yet installed the PyYAML package and consequently her first experience of your code is an ImportError.

This ImportError could have been avoided by adding pyyaml as a requirement to our setup.py file. For more details see the “Specifying Dependencies” section in Scott Torborg’s How To Package Your Python Code.

However, the question is how could we have detected this issue before sending our code to Alice?

Creating a virtual Python development environment

There is a way to avoid making use of the system’s “polluted” Python, which also lets us work without requiring root privileges. When I first heard about this it sounded like magic.

The solution is to make use of virtualenv. From the virtualenv website:

virtualenv is a tool for creating isolated Python environments.

Let us install virtualenv using pip.

$ pip install virtualenv

Now we can create a virtual environment for our project. However, before we do that let me give you a tip: create a separate directory for storing all your virtual environments and give each virtual environment a descriptive name.

$ mkdir ~/virtualenvs

If you are anything like me you will end up having at least one virtual environment for each project you are working on.

$ virtualenv ~/virtualenvs/awesome
New python executable in /home/tjelvar/virtualenvs/awesome/bin/python
Installing setuptools, pip...done.

Note that this creates a directory named awesome in the ~/virtualenvs directory. You could have named it anything, but I like to use the same name as the project for which I intend to use the virtual environment. The ~/virtualenvs/awesome directory contains the virtual environment.

To make use of a virtual environment we need to “activate” it. This is done by sourcing the activate script in the bin directory of the virtual environment.

To get a feel for the effect of activating the virtual environment let us use which to find the path to python before and after we activate the virtual environment.

$ which python
/usr/bin/python

Now let us activate the virtual environment we just created.

$ source ~/virtualenvs/awesome/bin/activate
(awesome)$ which python
/home/tjelvar/virtualenvs/awesome/bin/python

When we source the activate script above it basically alters the PATH and PS1 environment variables. It also defines a deactivate function that one can use to reset the environment variables to their original state.

(awesome)$ deactivate
$ which python
/usr/bin/python

Tying it all together

That was a lot of pre-amble to be able to show a simple and effective work flow for setting up clean Python development environments.

Generate a new Python project template.

$ cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
...
repo_name (default is "mypackage")? awesome
...
$ cd awesome

Create a virtual environment for the project.

$ virtualenv ~/virtualenvs/awesome

Activate the virtual environment and use setuptools to create a development environment.

$ source ~/virtualenvs/awesome/bin/activate
(awesome)$ python setup.py develop

Run the tests!

Tests are great, they let us know that things are working as intended. Let us make sure that our setup is sound.

(awesome)$ python tests/tests.py
..
----------------------------------------------------------------------
Ran 2 tests in 0.000s

OK

Discussion

In this post I have shown you how to use setuptools and virtualenv to create reproducible, clean and isolated Python development environments.

However, the work flow is not limited to development environments. It is just as applicable to production environments and it is extensively used in the Python web development community. In fact, having the same work flow for setting up your development and production environments is a great bonus as it gives you more confidence in the end product.

Using Cookiecutter - a passive code generator

2015-05-04T00:00:00+00:00

Using a tool to generate a templated result.

In The Pragmatic Programmer Andrew Hunt and David Thomas talk about the importance of code generators when faced with the task of producing the same thing over and over. They further separate code generators into two types: passive and active.

A passive code generator being one that saves on typing. It is run once, the result is placed into version control and then the code is built upon by hand.

Whereas an active code generator is used to produce complete code by converting a source of meta-data into language(s) of interest. Active code generators are run frequently and as the resulting code is reproducible it is also disposable, hence it does not need to be tracked in version control.

In this post I will show you how you can use a passive code generator to create a basic layout for a Python package.

Cookiecutter: a passive code generator

A classic example where passive code generators are useful is in setting up an initial project structure. Let us take the example of creating a Python package, in the simplest case you will want to create a setup.py file and a directory with the desired package name containing an __init__.py file. Scott Torborg has created a great tutorial on How To Package Your Python Code.

Several tools exist to deal with this type of scenario. However, I quite like Audrey Roy’s Cookiecutter. Let us illustrate it’s use by creating a minimal template for a Python package.

Firs of all we install it using pip.

$ sudo pip install cookiecutter

Now we will create a funny looking directory structure. It is funny looking because it uses the Jinja2 templating syntax.

$ mkdir -p mypyproject/{{cookiecutter.repo_name}}/{{cookiecutter.repo_name}}

Now create the file myproject/cookiecutter.json and add the code below to it.

{
  "repo_name": "mypackage",
  "version": "0.0.1",
  "author": "Your Name"
}

Let us have a look at the directory structure we have created.

$ tree mypyproject/
mypyproject/
├── cookiecutter.json
└── {{cookiecutter.repo_name}}
    └── {{cookiecutter.repo_name}}

2 directories, 1 file

We now have enough boilerplate to run cookiecutter. Actually we have more than enough, at this point we do not need the version and author variables.

Let us create an “awesome” Python package to see it in action.

$ cookiecutter mypyproject/
repo_name (default is "mypackage")? awesome
version (default is "0.0.1")? 
author (default is "Your Name")? Tjelvar Olsson

Note that the prompts and default values are the key/value pairs specified in the cookiecutter.json file.

Let us have a look at what was produced.

$ tree awesome/
awesome/
└── awesome

1 directory, 0 files

Ok, great - let us add an __init__.py file to the leaf myproject/{{cookiecutter.repo_name}}/{{cookiecutter.repo_name}} directory.

$ touch mypyproject/\{\{cookiecutter.repo_name\}\}/\{\{cookiecutter.repo_name\}\}/__init__.py

In the above we need to esacape the { and } characters when using bash. If you are not already using tab completion when using bash this may be a good point to try it out (just start typing the name of the file/directory of interest and then press the tab key).

Let’s run cookiecutter again to see what we get now that we have added the __init__.py file.

$ cookiecutter mypyproject/
repo_name (default is "mypackage")? awesome
version (default is "0.0.1")? 
author (default is "Your Name")? Tjelvar Olsson

$ tree awesome/
awesome/
└── awesome
    └── __init__.py

1 directory, 1 file

Great we now automatically get an __init__.py file added to our project when we create it. Now let us add a basic, but all the same templated, setup.py file to our project layout. Create the file mypyproject/{{cookiecutter.repo_name}}/setup.py and copy and paste the code below into it.

from setuptools import setup

setup(name="{{ cookiecutter.repo_name }}",
      version="{{ cookiecutter.version }}",
      author="{{ cookiecutter.author }}"
)

Let us try this out.

$ cookiecutter mypyproject/
repo_name (default is "mypackage")? awesome
version (default is "0.0.1")? 
author (default is "Your Name")? Tjelvar Olsson

$ tree awesome/
awesome/
├── awesome
│   └── __init__.py
└── setup.py

1 directory, 2 files

$ cat awesome/setup.py 
from setuptools import setup

setup(name="awesome",
      version="0.0.1",
      author="Tjelvar Olsson"
)

Great we now have a basic layout for building up a Python project!

Now that you know the principles you can use them to automate the generation of your boilerplate code.

Making use of GitHub

Once you start building up your template make sure that you save it on GitHub or BitBucket. You are already using version control, right?

A nice feature of Cookiecutter is that it has built in functionality for making use of templates stored in GitHub/Bitbucket. For example to make use of my default Python package layout, which includes:

setup.py
test suite layout using nose and coverage
sphinx docs layout using read the docs theme

You can simply use the command below.

$ cookiecutter gh:tjelvar-olsson/cookiecutter-pypackage
Cloning into 'cookiecutter-pypackage'...
remote: Counting objects: 48, done.
remote: Compressing objects: 100% (37/37), done.
remote: Total 48 (delta 13), reused 37 (delta 8), pack-reused 0
Unpacking objects: 100% (48/48), done.
Checking connectivity... done.
repo_name (default is "mypackage")? awesome
version (default is "0.0.1")? 
authors (default is "Tjelvar Olsson")? 

Alternatively, for an even more extensive setup have a look at Audrey Roy’s ultimate python package template.

Summary

When you find yourself repeatedly doing the same thing it may be time to start thinking about using a code generator. In this post I have shown you how to use cookiecutter to produce a basic Python package template.

However, it is not limited to Python package projects. You could use it to automate the setup of CMake / HTML / LaTeX files; the world is your oyster.

Happy code generating!

How to manage firewalls using ferm and Ansible

2015-04-24T00:00:00+00:00

In the previous post we created an Ansible playbook for installing the GBrowse genome browser. As the name implies GBrowse is a browser based application and it serves web pages over http using Apache. If one is installing this software as a service to be made more widely accessible one needs to start thinking about security. In this post we will therefore configure a firewall for our machine.

iptables

The standard tool for setting up firewalls on Linux is iptables. It is a way to set up policy chains to allow or block traffic to, from and through the machine of interest. If you have not come across or managed iptables before I recommend that you have a look at howtogeek’s Beginner’s Guide to iptables, the Linux Firewall and Major Hayden’s Best practices: iptables.

ferm

However, managing firewalls using iptables can be a pain. Several tools have therefore evolved to make things easier. In this post we will be using a program called ferm (for Easy Rule Making).

When configuring a firewall it is easy to lock oneself out of the machine one is configuring. The most common scenario for this is setting the default policy to drop incoming connections and then accidentally flushing the connection rules, including the rule to accept ssh connections, leaving the server inaccessible. To avoid this scenario we will configure the default policy to accept incoming connections and to secure the server we will include a rule to drop any incoming connections that do not match any other rules.

Below is a list stating the behaviour that we want from the INPUT chain of our firewall.

We want the default policy to accept incoming connections
We want to enable connection tracking
We want to be able to ping the machine
We want to be able to ssh into the machine
We want to be able to add custom rules using Ansible
Finally, we want to drop any incoming connections that do not match any rules

The behaviours that we want from the OUTPUT and FORWARD chains are simpler. We do not want to limit any outgoing connections so we will set the output policy to accept all connections and because we are not configuring a router we will set the policy of the forward chain to drop all connections.

We can configure the behaviour above using the ferm.conf file below.

# Ferm script for configuring iptables.

table filter {

    chain INPUT {
	# Set the default policy to ACCEPT to avoid getting
	# accidentally locked out.
        policy ACCEPT;

        # Connection tracking.
        mod state state INVALID DROP;
        mod state state (ESTABLISHED RELATED) ACCEPT;

        # Allow local connections.
        interface lo ACCEPT;

        # Respond to ping.
        proto icmp icmp-type echo-request ACCEPT;

        # Allow ssh connections.
        proto tcp dport ssh ACCEPT;

        # Ansible specified rules.
        
        # Because the default policy is to ACCEPT we DROP
        # everything that comes through to this stage.
        DROP;
    }

    # Outgoing connections are not limited.
    chain OUTPUT policy ACCEPT;

    # This is not a router.
    chain FORWARD policy DROP;
}

If you have ferm installed you can apply the firewall above using the command below.

$ sudo ferm ferm.conf

Note that in the ferm.conf file above we have an empty section marked by the comment # Ansible specified rules.. We will use this to dynamically alter the firewall rules during the running of our Ansible playbook.

Integrating ferm with Ansible

Let us create an Ansible role for installing and configuring ferm. Copy and paste the code below into a file named roles/ferm/task/main.yml.

---
# Install and configure ferm.
#

# The ferm program is in the epel repository so we need
# to enable it. This could be a separate role, but this
# is left as an exercise for the reader.
- name: enable the epel repo
  yum: name=epel-release
       state=present

# We need to install libselinux-python on the target
# machine to be able to use Ansible to copy the ferm.conf
# file to the /etc/ferm/ directory. It would be reasonable
# to move this task into a separate role for installing common
# software, again this is left as an exercise for the reader.
- name: install libselinux-python
  yum: name=libselinux-python
       state=present

- name: install ferm
  yum: name=ferm
       state=present

- name: add /etc/ferm directory
  file: path=/etc/ferm
        mode=0700
        state=directory

- name: add the ferm.conf file to /etc/ferm
  copy: src=ferm.conf
        dest=/etc/ferm/ferm.conf
  notify: run ferm

Note that the last task copies the ferm.conf file we created above to the target machine. However, for this to work Ansible expects the ferm.conf file to be located in the directory named roles/ferm/files/. Let us therefore create this directory and move the file there.

$ mkdir roles/ferm/files
$ mv ferm.conf roles/ferm/files/

In the previous post I introduced the concept of handlers that could be notified by other tasks. Let us create a handler for applying the ferm rules. Copy and paste the code below into a file named roles/ferm/handlers/main.yml.

---

- name: run ferm
  command: ferm /etc/ferm/ferm.conf
  notify: save iptables

- name: save iptables
  command: service iptables save

Now we have a handler named run ferm, which when notified will run the command ferm /etc/ferm/ferm.conf and in turn notify the save iptables handler, which makes sure that the firewall rules persist if the machine is rebooted.

Let us add this role to our playbook. Update the gbrowse.yml file so that it looks like the below (we have only added the ferm role).

---
- hosts: all
  sudo: True

  roles:
    - ferm
    - gbrowse

However, if you run the gbrowse.yml playbook at this point the GBrowse application will stop working as port 80 will be closed. Let us therefore add a task to open up ports 80 (http) and 443 (https) to the apache role. Edit the file roles/apache/tasks/main.yml to look like the below.

---
# Install and configure Apache.

- name: install apache
  yum: name=httpd
       state=present

- name: start apache and enable at boot
  service: name=httpd
           enabled=yes
           state=started

- name: open up the http and https ports
  lineinfile: dest=/etc/ferm/ferm.conf
              line='proto tcp dport (http https) ACCEPT;'
              insertafter='# Ansible specified rules.'
  notify: run ferm

In the above we make use of Ansible’s lineinfile module to insert a new rule to the ferm.conf file.

Results

Let us run the playbook and find out what the resulting iptables firewall looks like. Here I am using the same Vagrant/Ansible setup as described in how to create automated and reproducible work flows for installing scientific software.

$ ansible-playbook -i hosts gbrowse.yml
...
$ vagrant ssh
Last login: Thu Apr 16 02:03:00 2015 from 192.168.33.1
[vagrant@localhost ~]$ sudo iptables -nL
Chain INPUT (policy ACCEPT)
target     prot opt source               destination         
DROP       all  --  0.0.0.0/0            0.0.0.0/0           state INVALID 
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED 
ACCEPT     all  --  0.0.0.0/0            0.0.0.0/0           
ACCEPT     icmp --  0.0.0.0/0            0.0.0.0/0           icmp type 8 
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:22 
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:80 
ACCEPT     tcp  --  0.0.0.0/0            0.0.0.0/0           tcp dpt:443 
DROP       all  --  0.0.0.0/0            0.0.0.0/0           

Chain FORWARD (policy DROP)
target     prot opt source               destination         

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Discussion

If you have public facing machines you need to think about security. However managing firewalls using iptables directly can be a pain.

In this post I have outlined how you can integrate ferm and Ansible to manage your firewall. The cool thing about this approach is that the role of interest, in this case apache, is responsible for opening up the relevant ports.

Furthermore as the /etc/ferm/ferm.conf file will be re-written every time you run the playook your rules will be updated both if you add or remove roles from the playbook. In other words if you removed the apache role and ran the playbook ports 80 and 433 would be closed at the end when the handlers were executed (handlers notified during a playbook are executed at the end of it).

Finally, note that security is a complex topic and that the reading of this post should not be taken as a substitute for a proper understanding of how to manage firewalls. That is a roundabout way of stating that I do not take responsibility for any security breaches that you encounter.

Ansible playbook for installing the GBrowse genome browser

2015-04-18T00:00:00+00:00

In previous posts I have described how to use ansible to create automated and reproducible work flows for installing scientific software and how to create reusable Ansible components. In this post we will create a playbook for installing the genome browser GBrowse and in the process we will learn how to install and manage services, such as Apache, using Ansible.

Adding `Bio::Graphics` to the `bio_perl` role

GBrowse does not only depend on Bio::Perl it also depends on Bio::Graphics. At this point we could add a role for installing Bio::Graphics. However, I prefer to add the installation of it to the existing bio_perl role.

It turns out that Bio::Graphics depends on GD, which I struggle to install using cpanm. However, it is available in a pre-compiled form from the CentOS repositories so we can install perl-GD from there using yum.

Furthermore, it turned out that the Bio::Graphics had an implicit dependency on the CGI module.

Please update the roles/bio_perl/tasks/main.yml file to look like the below.

---
# Install and configure Bio::Perl and Bio::Graphics.

# Bio::Graphics requires GD.
# However, I cannot work out how to install GD using cpanm,
# so installing it using yum instead.
- name: install perl-GD
  yum: name=perl-GD
       state=present

- name: install implicit Bio::Perl dependencies
  cpanm: name={{ item }}
  with_items:
    - Time::HiRes
    - LWP::UserAgent

- name: install implicit Bio::Graphics dependencies
  cpanm: name=CGI

- name: install BioPerl
  cpanm: name={{ item }}
  with_items:
    - Bio::Perl
    - Bio::Graphics

Installing and configuring Apache

As the name implies GBrowse is a web based tool so to serve it we need to install Apache. Let us create create a new role for this. Copy and paste the text below into the file roles/apache/tasks/main.yml.

---
# Install and configure Apache.

- name: install apache
  yum: name=httpd
       state=present

- name: start apache and enable at boot
  service: name=httpd
           enabled=yes
           state=started

The code above introduces us to the Ansible service module. The service module is used to interact with services managed by initd (or systemd on CentOS 7). In the service task above we ask for the service to be started and for it to be enabled at boot.

Now suppose that we wanted to restart Apache at some point in our Ansible script. For example after having installed another piece of software that was served by Apache, such as GBrowse. This can be achived using Ansible’s concept of handlers. Let us therefore add a handler for restarting apache. Copy and paste the code below into the file roles/apache/handlers/main.yml.

---

- name: restart apache
  service: name=httpd
           state=restarted

Now any task in a playbook that makes use of the apache role can restart Apache by adding the directive notify: restart apache. We will see an example of this later on in the post towards the end of the gbrowse role.

Creating the `gbrowse` role

We are now in a position to create the gbrowse role for configuring and installing the GBrowse software. Let us start by defining the Ansible roles it depends on. Copy and paste the code below into the file roles/gbrowse/meta/main.yml.

---
dependencies:
  - { role: apache }
  - { role: bio_perl }

GBrowse has got pretty good installation notes and following them we only need to deal with a couple issues: a couple of undocumented Perl module dependencies and the fact that the resulting Build script requires interactive answers. The former is easy to deal with, we simply install the missing Perl modules using cpanm. However, the latter is more tricky.

Ansible is not really meant to deal with interactive tasks. This means that installers that ask a lot of questions pose a problem. However fortunately in this case the ./Build config command provides sensible defaults that we can accept and we can simply answer no to all the questions posed by ./Build install. This means that we can use a work around outlined in a post by Craig Marvelley.

Copy and paste the code below into the file roles/gbrowse/tasks/main.yml.

---
# Install and configure the gbrowse genome browser.

- name: install undocumented dependencies
  cpanm: name={{ item }}
  with_items:
    - Date::Parse
    - Term::ReadKey

- name: install remaining perl module dependencies
  cpanm: name={{ item }}
  with_items:
    - CGI::Session
    - Digest::MD5
    - File::Temp
    - IO::String
    - JSON
    - Storable
    - Statistics::Descriptive
    - DBI
    - Net::SMTP
    - DBD::SQLite

- name: download the gbrowse tarball
  get_url: url=http://search.cpan.org/CPAN/authors/id/L/LD/LDS/GBrowse-2.54.tar.gz
           dest=/tmp/

- name: unpack the gbrowse tarball
  command: tar -zxf GBrowse-2.54.tar.gz
  args:
    chdir: /tmp/
    creates: /tmp/GBrowse-2.54/LICENSE

- name: build the installer
  command: perl Build.PL
  args:
    chdir: /tmp/GBrowse-2.54/
    creates: /tmp/GBrowse-2.54/Build

# For more detail on ``yes ' ' |`` syntax for accepting default values see:
# http://marvelley.com/blog/2014/04/23/handling-interactive-ansible-tasks/
- name: configure the install accepting all default values
  shell: yes '' | ./Build config
  args:
    chdir: /tmp/GBrowse-2.54/

- name: install gbrowse answering no to all interactive questions
  shell: yes 'n' | ./Build install
  args:
    chdir: /tmp/GBrowse-2.54/
    creates: /etc/httpd/conf.d/gbrowse2.conf
  notify: restart apache

Note the notify: restart apache directive added to the final task above. This will ensure that Apache is restarted after GBrowse has been installed.

One of the questions we answer “no” to in the interactive installer is to register our use of GBrowse. If you find this tool useful the developers of it would appreciate if you registered. You can do this at any point by running the command ./Build register.

Creating the playbook

Now create a playbook named gbrowse.yml at the same level as your roles directory with the code below.

---
- hosts: all
  sudo: True

  roles:
    - gbrowse

I am using the same Vagrant setup as outlined in the post on how to create automated and reproducible work flows for installing scientific software. So to run the playbook I simply use the command:

$ ansible-playbook -i hosts playbook.yml

When the playbook finished running I could view the GBrowse application in my browser by going to the url http://192.168.33.10/gbrowse2/ (192.168.33.10 being the private network specified in the Vagrant file from the previous post).

Conclusion

In this post I have shown you how to create a reproducible and automated work flow for installing the GBrowse genome brower using Ansible.

We created a role for installing and managing Apache. This introduced us to Ansible’s service module and the concept of “handlers” that can be “notified” by other tasks in a playbook.

In the next post we will look into how we can manage the firewall of our machine using Ansible and ferm.

How to create reusable Ansible components

2015-04-11T00:00:00+00:00

In the previous post I described how to create reproducible and automated work flows for installing scientific software using Ansible. In the end we had an Ansible playbook for installing Bio::Perl. The playbook did many things. It installed gcc and cpanm as well as Bio::Perl. In this post I will show how we can split these tasks out into reusable components using Ansible’s concept of “roles”.

Let us have a look at the Ansible playbook from the end of the previous post.

---
- hosts: all
  sudo: True

  tasks:
    - name: install gcc required to build some Perl modules
      yum: name=gcc
           state=present

    - name: install cpan and perl-devel
      yum: name={{ item }}
           state=present
      with_items:
        - perl-devel
        - perl-CPAN

    - name: download cpanm
      get_url: url=https://cpanmin.us/
               dest=/tmp/cpanm.pl
               mode=755

    - name: install cpanm so that we can use the ansible cpanm module
      command: perl cpanm.pl App::cpanminus
      args:
        chdir: /tmp/
        creates: /usr/local/bin/cpanm

    - name: add cpanm symbolic link to /usr/bin/
      file: src=/usr/local/bin/cpanm
            dest=/usr/bin/cpanm
            state=link

    - name: install implicit Bio::Perl dependencies
      cpanm: name={{ item }}
      with_items:
        - Time::HiRes
        - LWP::UserAgent

    - name: install BioPerl
      cpanm: name=Bio::Perl

Looking at the above there are at least three reusable roles: build_tools (for installing gcc; this role could grow to include more build tools in the future), cpanm (for installing and configuring cpanm), and bio_perl (for installing Bio::Perl and its implicit dependencies). I guess one could argue that the implicit dependencies of Bio::Perl could be split out into individual roles, but for now I think that would be too granular.

To create Ansible roles we need a directory named roles. Let us create it along with the directories required for the build_tools role.

$ mkdir -p roles/build_tools/tasks

Now we move the task of installing gcc into the build_tools role by copying and pasting the text below into the file roles/build_tools/tasks/main.yml.

---
# Install and configure build tools.

- name: install gcc
  yum: name=gcc
       state=present

We now need to remove the gcc task from the playbook and add the build_tools role. Modify the playbook.yml file to look like the below.

---
- hosts: all
  sudo: True

  roles:
    - build_tools

  tasks:

    - name: install cpan and perl-devel
      yum: name={{ item }}
           state=present
      with_items:
        - perl-devel
        - perl-CPAN

    - name: download cpanm
      get_url: url=https://cpanmin.us/
               dest=/tmp/cpanm.pl
               mode=755

    - name: install cpanm so that we can use the ansible cpanm module
      command: perl cpanm.pl App::cpanminus
      args:
        chdir: /tmp/
        creates: /usr/local/bin/cpanm

    - name: add cpanm symbolic link to /usr/bin/
      file: src=/usr/local/bin/cpanm
            dest=/usr/bin/cpanm
            state=link

    - name: install implicit Bio::Perl dependencies
      cpanm: name={{ item }}
      with_items:
        - Time::HiRes
        - LWP::UserAgent

    - name: install BioPerl
      cpanm: name=Bio::Perl

In the above it is worth noting that one can mix roles and tasks in the same playbook. This is useful when one wants to create a playbook that makes use of some reusable roles but which also needs to perform some non-reusable tasks.

Now we can try running the playbook to make sure that we have not broken anything. Note that the output now reflects the fact that the install gcc task is being called from within the build_tools role.

$ ansible-playbook -i hosts playbook.yml

PLAY [all] ********************************************************************

GATHERING FACTS ***************************************************************
ok: [scicomp.example.com]

TASK: [build_tools | install gcc] *********************************************
changed: [scicomp.example.com]

...

Let us now create directory structures for the cpanm and bio_perl roles.

$ mkdir -p roles/{cpanm,bio_perl}/tasks

For the cpanm role cut and paste the code below into the file roles/cpanm/tasks/main.yml.

---
# Install and configure cpanm.

- name: install cpan and perl-devel
  yum: name={{ item }}
       state=present
  with_items:
    - perl-devel
    - perl-CPAN

- name: download cpanm
  get_url: url=https://cpanmin.us/
           dest=/tmp/cpanm.pl
           mode=755

- name: install cpanm so that we can use the ansible cpanm module
  command: perl cpanm.pl App::cpanminus
  args:
    chdir: /tmp/
    creates: /usr/local/bin/cpanm

- name: add cpanm symbolic link to /usr/bin/
  file: src=/usr/local/bin/cpanm
        dest=/usr/bin/cpanm
        state=link

And the code below into the file roles/bio_perl/tasks/main.yml.

---
# Install and configure Bio::Perl.

- name: install implicit Bio::Perl dependencies
  cpanm: name={{ item }}
  with_items:
    - Time::HiRes
    - LWP::UserAgent

- name: install BioPerl
  cpanm: name=Bio::Perl

Finally let us update the playbook.yml file so that it looks like the below.

---
- hosts: all
  sudo: True

  roles:
    - build_tools
    - cpanm
    - bio_perl

That is much cleaner! Furthermore the roles can be reused in other playbooks as and when we need them.

Adding dependencies

It might be obvious to us now that the bio_perl role depends on the build_tools and cpanm roles. However, it may be less obvious as the playbook grows or when we want to create a new playbook that makes use of the bio_perl module.

It is possible to make dependencies explicit when using Ansible roles. To do this we will need to add a meta directory to our bio_perl role.

$ mkdir roles/bio_perl/meta

Now copy and paste the code below into the file roles/bio_perl/meta/main.yml.

---
dependencies:
  - { role: build_tools}
  - { role: cpanm }

At this point one can reduce the playbook.yml file to include only the bio_perl module as the build_tools and cpanm modules will be pulled in as dependencies.

---
- hosts: all
  sudo: True

  roles:
    - bio_perl

The content of the playbook now really reflects the original intent: to install Bio::Perl.

Summary

Ansible has the concept of “roles” that can be used to create reusable components. To create a role one simply needs to adhere to Ansible’s conventions of naming files and structuring directories. In its most basic form a role takes the form of tasks within a file named roles/name_of_role/tasks/main.yml.

In this post we also used the file role/bio_perl/meta/main.yml to specify the dependencies of the role. This meant that the content of the final playbook was succinct and reflected the intent for which it was created, namely to install Bio::Perl. Furthermore, by explicitly stating the dependencies of the bio_perl role we made it easier to reuse.

Finally, we also noted that it is possible to pick and mix roles and tasks within a single playbook. This can be useful when creating playbooks that have both reusable and non-reusable components within them.

How to create automated and reproducible work flows for installing scientific software

2015-04-02T00:00:00+00:00

In any organisation systems administration is a big role, which entails making sure the systems everyone take for granted just work. Email, internet, etc; everything needs to function 24/7.

But as computational scientists we need specialist software, written by and for scientists. This means that we often have to rely on ourselves to do some basic systems administration to install and manage scientific software.

The question then arises: how can one effectively configure machines to run scientific software? Particularly as installing software written by other scientists can often be a torturously painful process.

In this post I will outline a method for producing work flows that result in automated and reproducible software installations.

Let us start on the assumption that we have been given a clean machine running CentOS 6.5 by the IT department and now it us up to us to configure it with our scientific software.

Vagrant - create your own virtual machine

Let us refer to the machine give to us by the IT department as the production machine. This could be a physical box or a virtual machine, it does not really matter.

At this point we do not want to experiment with our production machine. Instead we will create a virtual machine on our desktop, which we will refer to as the testing machine. Depending on your interest in virtualisation you may already have heard of and used VirtualBox. It is a tool for creating virtual machines. If you have not already installed VirtualBox do so now (VirtualBox downloads).

Rather than working with VirtualBox directly we will make use of Vagrant. Vagrant is a command line utility for working with VirtualBox and other virtual machine providers such as VMWare and AWS. Here is a link to the Vagrant downloads.

We are now in a position to create and work with virtual machines solely from the command line. Let us start by creating a Vagrant file for setting up a CentOS 6.5 box.

$ vagrant init chef/centos-6.5

The command above creates a file named Vagrantfile, which in its most basic form simply specifies the Linux image to provision the virtual machine with. In this instance the image from: atlas.hashicorp.com/chef/boxes/centos-6.5. Let us have a quick look at the Vagrantfile file.

Vagrant.configure(2) do |config|

  config.vm.box = "chef/centos-6.5"

end

Above I have left out all the comments giving further suggestions on how to configure the setup of the virtual machine.

Let us spin up the virtual machine and ssh into it.

$ vagrant up
$ vagrant ssh
Last login: Fri Mar  7 16:57:20 2014 from 10.0.2.2
[vagrant@localhost ~]$ pwd
/home/vagrant

As you can see Vagrant has configured ssh to allow the vagrant user to login without a password. Let’s close the ssh connection and find more details about the ssh configuration.

[vagrant@localhost ~]$ exit
logout
Connection to 127.0.0.1 closed.
$ vagrant ssh-config
Host default
  HostName 127.0.0.1
  User vagrant
  Port 2222
  UserKnownHostsFile /dev/null
  StrictHostKeyChecking no
  PasswordAuthentication no
  IdentityFile /home/olsson/.vagrant/machines/default/virtualbox/private_key
  IdentitiesOnly yes
  LogLevel FATAL

Finally, let us have a look at the Vagrant help.

$ vagrant help
Usage: vagrant [options] <command> [<args>]

    -v, --version                    Print the version and exit.
    -h, --help                       Print this help.

Common commands:
     box             manages boxes: installation, removal, etc.
     connect         connect to a remotely shared Vagrant environment
     destroy         stops and deletes all traces of the vagrant machine
     global-status   outputs status Vagrant environments for this user
     halt            stops the vagrant machine
     help            shows the help for a subcommand
     init            initializes a new Vagrant environment by creating a Vagrantfile
     login           log in to HashiCorp's Atlas
     package         packages a running vagrant environment into a box
     plugin          manages plugins: install, uninstall, update, etc.
     provision       provisions the vagrant machine
     push            deploys code in this environment to a configured destination
     rdp             connects to machine via RDP
     reload          restarts vagrant machine, loads new Vagrantfile configuration
     resume          resume a suspended vagrant machine
     share           share your Vagrant environment with anyone in the world
     ssh             connects to machine via SSH
     ssh-config      outputs OpenSSH valid configuration to connect to the machine
     status          outputs status of the vagrant machine
     suspend         suspends the machine
     up              starts and provisions the vagrant environment
     version         prints current and latest Vagrant version

For help on any individual command run `vagrant COMMAND -h`

Additional subcommands are available, but are either more advanced
or not commonly used. To see all subcommands, run the command
`vagrant list-commands`.

Note the vagrant halt and vagrant destroy commands to stop and delete the vagrant machine respectively.

Ansible - configure your virtual machine

The aim of the game is to make the process of installing our scientific software of interest reproducible and automated!

We will use the testing virtual machine provisioned using Vagrant to experiment with scripts to configure it.

My favorite tool for configuring machines is Ansible. It is written in Python and makes use of the OpenSSH protocol. Unlike many other configuration tools, such as Puppet and Chef, Ansible is agentless. In other words it does not require you to install an agent on the machine that you want to configure, which makes it much easier to use. It is also very easy to install, here is a link to the Anisble installation notes.

Ansible uses the YAML file format. Let us create a file named playbook.yml.

---
# A basic playbook that simply checks who I logged in as.
- hosts: all
  tasks:
  - name: run the whoami command
    command: whoami

To configure the Vagrant testing machine we simply need to update the Vagrantfile file; inserting the provisioning section below.

Vagrant.configure(2) do |config|

  config.vm.box = "chef/centos-6.5"

  config.vm.provision "ansible" do |ansible|
    ansible.playbook = "playbook.yml"
  end

end

We can now configure the Vagrant testing machine using the command below.

$ vagrant provision
==> default: Running provisioner: ansible...

PLAY [all] ******************************************************************** 

GATHERING FACTS *************************************************************** 
ok: [default]

TASK: [run the whoami command] ************************************************ 
changed: [default]

PLAY RECAP ******************************************************************** 
default                    : ok=2    changed=1    unreachable=0    failed=0   

At this point our Ansible playbook does not really do anything useful. It simply uses the command module to run the whoami program.

Ansible comes with a whole host of built in modules. For example yum, apt and homebrew are but a few of the modules for operating system package management. It also has pip, cpanm and gem modules for managing Python packages, Perl modules and Ruby gems respectively. There is also a vast array of modules for working with files. For more information check out the Ansible module index.

Below is a slightly more involved playbook for installing the Bio::Perl module. The playbook deals with a number of complications. It installs gcc to be able to compile some of the Perl modules. It installs cpan and cpanm to make it easier to install Perl modules. Further, Bio::Perl has some implicit dependencies that are not taken care of automatically when installing it using cpanm, so the playbook installs these dependencies first.

---
- hosts: all
  sudo: True

  tasks:
    - name: install gcc required to build some Perl modules
      yum: name=gcc
           state=present

    - name: install cpan and perl-devel
      yum: name={{ item }}
           state=present
      with_items:
        - perl-devel
        - perl-CPAN

    - name: download cpanm
      get_url: url=https://cpanmin.us/
               dest=/tmp/cpanm.pl
               mode=755

    - name: install cpanm so that we can use the ansible cpanm module
      command: perl cpanm.pl App::cpanminus
      args:
        chdir: /tmp/
        creates: /usr/local/bin/cpanm

    - name: add cpanm symbolic link to /usr/bin/
      file: src=/usr/local/bin/cpanm
            dest=/usr/bin/cpanm
            state=link

    - name: install implicit Bio::Perl dependencies
      cpanm: name={{ item }}
      with_items:
        - Time::HiRes
        - LWP::UserAgent

    - name: install BioPerl
      cpanm: name=Bio::Perl

We can now try out this Ansible playbook on the testing virtual machine.

$ vagrant provision
==> default: Running provisioner: ansible...

PLAY [all] ********************************************************************

GATHERING FACTS ***************************************************************
ok: [default]

TASK: [install gcc required to build some Perl modules] ***********************
changed: [default]

TASK: [install cpan and perl-devel] *******************************************
changed: [default] => (item=perl-devel,perl-CPAN)

TASK: [download cpanm] ******************************************************** 
changed: [default]

TASK: [install cpanm so that we can use the ansible cpanm module] ************* 
changed: [default]

TASK: [add cpanm symbolic link to /usr/bin/] ********************************** 
changed: [default]

TASK: [install implicit Bio::Perl dependencies] ******************************* 
ok: [default] => (item=Time::HiRes)
ok: [default] => (item=LWP::UserAgent)

TASK: [install BioPerl] ******************************************************* 
ok: [default]

PLAY RECAP ******************************************************************** 
default                    : ok=8    changed=5    unreachable=0    failed=0   

Great it works! Almost time to deploy to the production machine. However, first let us commit our scripts to version control.

Git - tracking what you are doing

One of the beauties of Ansible is that it uses the human readable YAML file format. This means that you get descriptive configuration files that can be used directly to configure your machines.

Another beauty of text files is that they can be tracked in version control. This means that you can get an audit record of how the specification of the configuration evolved over time. Furthermore, you can use the ability to add comments to your commits to specify the reason why particular changes needed to be made.

Let us commit our work to version control.

$ git init
$ git add Vagrantfile
$ git commit -m "Vagrant file with CentOS 6.5 configured by playbook.yml"
$ git add playbook.yml
$ git commit -m "Playbook for installing Bio::Perl"

Configuring your production machine

Now that we have built up our Ansible configuration script and committed it to version control we can use it to configure the production machine.

In order to achieve this we need to put our public ssh key on the production server.

If you have not already created an ssh key pair you can do so using ssh-keyen. You can then append the public key to the authorized_keys files in the .ssh directory on the production server. For more detail see, for example, Etel Sverdlov blog post on How To Set Up SSH Keys.

Up until this point we have not used Ansible directly, we have only used it through Vagrant. We will remedy that now.

First of all Ansible needs to know about the machines that you want it to talk to. By default Ansible looks for these in /etc/ansible/hosts. Alternatively, you can specify a “hosts” file using the command line option -i. Suppose that your server’s host name was scicomp.example.com you could then add this to a file named hosts.

scicomp.example.com

A simple way to check that everything is setup as it should be is to make use of Ansible’s ping module. If everything is working you will see something along the lines of the below.

$ ansible -i hosts -m ping scicomp.example.com
scicomp.example.com | success >> {
    "changed": false,
    "ping": "pong"
}

Otherwise, you will see something along the lines of the below.

$ ansible -i hosts -m ping scicomp.example.com
scicomp.example.com | FAILED => SSH encountered an unknown error during the connection. We recommend you re-run the command using -vvvv, which will enable SSH debugging output to help diagnose the issue

The ansible program can be an extremely effective way of issuing ad-hoc commands to remote machines. However, we have a playbook that we want to run so we want to use ansible-playbook.

$ ansible-playbook -i hosts playbook.yml

PLAY [all] ********************************************************************

GATHERING FACTS ***************************************************************
ok: [scicomp.example.com]

TASK: [install gcc required to build some Perl modules] ***********************
changed: [scicomp.example.com]

TASK: [install cpan and perl-devel] *******************************************
changed: [scicomp.example.com] => (item=perl-devel,perl-CPAN)

TASK: [download cpanm] ********************************************************
changed: [scicomp.example.com]

TASK: [install cpanm so that we can use the ansible cpanm module] *************
changed: [scicomp.example.com]

TASK: [add cpanm symbolic link to /usr/bin/] **********************************
changed: [scicomp.example.com]

TASK: [install implicit Bio::Perl dependencies] *******************************
ok: [scicomp.example.com] => (item=Time::HiRes)
ok: [scicomp.example.com] => (item=LWP::UserAgent)

TASK: [install BioPerl] *******************************************************
ok: [scicomp.example.com]

PLAY RECAP ********************************************************************
scicomp.example.com        : ok=8    changed=5    unreachable=0    failed=0

And now the production machine is configured with Bio::Perl!

A confession

I did not actually have the IT department create a production machine for me just for the purposes of this blog post. Instead I used Vagrant to create a virtual one for me by simply removing the provisioning section we added earlier and uncommenting the line for setting up the machine on a private network.

Vagrant.configure(2) do |config|

  config.vm.box = "chef/centos-6.5"

  config.vm.network "private_network", ip: "192.168.33.10"

end

To make sure I got the machine in a clean state I simply destroyed it and spun it up again.

$ vagrant destroy
    default: Are you sure you want to destroy the 'default' VM? [y/N] y
==> default: Forcing shutdown of VM...
==> default: Destroying VM and associated drives...
$ vagrant up

I then used the Ansible hosts file below, all in one long line, to specify how to connect to the machine.

scicomp.example.com ansible_ssh_host=192.168.33.10 ansible_ssh_user=vagrant ansible_ssh_private_key_file=.vagrant/machines/default/virtualbox/private_key

Using the above we have pretty much created a staging virtual machine in a couple of minutes. Pretty cool!

Summary

As a computational scientist you are likely to get exposed to systems administration to some extent. In particular for installing scientific software.

In an ideal world you should try to make the installation of your software as reproducible and automated as possible, because your machine will fall over at one point or another. When this happens you want to be in a position where you simply need to press a button to get your new machine configured with all the software that you need to work effectively.

Vagrant is a tool for spinning up virtual machines from the command line. Virtual machines are great for testing scripts that you create to configure your machines.

Ansible is a wonderful tool for scripting the configuration of your machines. It is very powerful, yet easy to use. Make it your friend!

Finally, I highly recommend that you keep your Vagrant and Ansible files under version control. It will give you more confidence when experimenting with new setups and it provides a way for you to track the progression of your machines configurations.

In the next post we will learn how to convert the playbook created in this post into reusable components.

Object-oriented programming for scientists

2015-03-22T00:00:00+00:00

Using a bucket to create sandcastles. That is what object-oriented programming is all about.

Introduction

For anyone not familiar with object-oriented programming it can sometimes come across as something mysterious that is used by expert coders. Indeed, any respectable text book on object-oriented programming will try to overwhelm the reader with concepts such as “abstraction”, “encapsulation”, “inheritance” and “polymorphism”.

However, object-oriented programming is not that difficult and can be very useful when dealing with complex data structures. In this post I will illustrate some object-oriented principles using a bioinformatics example, the parsing of FASTA files.

The code will be written in Python as I like it, it has built-in support for object-oriented programming and its syntax is relatively easy to understand.

An example using procedural programming

To set the scene let us write some code using procedural programming to parse the example.fasta file below.

>sp|O76074|PDE5A_HUMAN cGMP-specific 3',5'-cyclic phosphodiesterase OS=Homo sapiens GN=PDE5A PE=1 SV=2
MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE
RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG
TVSFLSDSEKKEQMPLTPPRFDHDEGDQCSRLLELVKDISSHLDVTALCHKIFLHIHGLI
SADRYSLFLVCEDSSNDKFLISRLFDVAEGSTLEEVSNNCIRLEWNKGIVGHVAALGEPL
NIKDAYEDPRFNAEVDQITGYKTQSILCMPIKNHREEVVGVAQAINKKSGNGGTFTEKDE
KDFAAYLAFCGIVLHNAQLYETSLLENKRNQVLLDLASLIFEEQQSLEVILKKIAATIIS
FMQVQKCTIFIVDEDCSDSFSSVFHMECEELEKSSDTLTREHDANKINYMYAQYVKNTME
PLNIPDVSKDKRFPWTTENTGNVNQQCIRSLLCTPIKNGKKNKVIGVCQLVNKMEENTGK
VKPFNRNDEQFLEAFVIFCGLGIQNTQMYEAVERAMAKQMVTLEVLSYHASAAEEETREL
QSLAAAVVPSAQTLKITDFSFSDFELSDLETALCTIRMFTDLNLVQNFQMKHEVLCRWIL
SVKKNYRKNVAYHNWRHAFNTAQCMFAALKAGKIQNKLTDLEILALLIAALSHDLDHRGV
NNSYIQRSEHPLAQLYCHSIMEHHHFDQCLMILNSPGNQILSGLSIEEYKTTLKIIKQAI
LATDLALYIKRRGEFFELIRKNQFNLEDPHQKELFLAMLMTACDLSAITKPWPIQQRIAE
LVATEFFDQGDRERKELNIEPTDLMNREKKNKIPSMQVGFIDAICLQLYEALTHVSEDCF
PLLDGCRKNRQKWQALAEQQEKMLINGESGQAKRN
>sp|Q9Y233|PDE10_HUMAN cAMP and cAMP-inhibited cGMP 3',5'-cyclic phosphodiesterase 10A OS=Homo sapiens GN=PDE10A PE=1 SV=1
MRIEERKSQHLTGLTDEKVKAYLSLHPQVLDEFVSESVSAETVEKWLKRKNNKSEDESAP
KEVSRYQDTNMQGVVYELNSYIEQRLDTGGDNQLLLYELSSIIKIATKADGFALYFLGEC
NNSLCIFTPPGIKEGKPRLIPAGPITQGTTVSAYVAKSRKTLLVEDILGDERFPRGTGLE
SGTRIQSVLCLPIVTAIGDLIGILELYRHWGKEAFCLSHQEVATANLAWASVAIHQVQVC
RGLAKQTELNDFLLDVSKTYFDNIVAIDSLLEHIMIYAKNLVNADRCALFQVDHKNKELY
SDLFDIGEEKEGKPVFKKTKEIRFSIEKGIAGQVARTGEVLNIPDAYADPRFNREVDLYT
GYTTRNILCMPIVSRGSVIGVVQMVNKISGSAFSKTDENNFKMFAVFCALALHCANMYHR
IRHSECIYRVTMEKLSYHSICTSEEWQGLMQFTLPVRLCKEIELFHFDIGPFENMWPGIF
VYMVHRSCGTSCFELEKLCRFIMSVKKNYRRVPYHNWKHAVTVAHCMYAILQNNHTLFTD
LERKGLLIACLCHDLDHRGFSNSYLQKFDHPLAALYSTSTMEQHHFSQTVSILQLEGHNI
FSTLSSSEYEQVLEIIRKAIIATDLALYFGNRKQLEEMYQTGSLNLNNQSHRDRVIGLMM
TACDLCSVTKLWPVTKLTANDIYAEFWAEGDEMKKLGIQPIPMMDRDKKDEVPQGQLGFY
NAVAIPCYTTLTQILPPTEPLLKACRDNLSQWEKVIRGEETATWISSPSVAQKAAASED

The aim is to find and print out the FASTA record with the UniProt identifier Q9Y233 (the second entry). The code below achieves this using procedural programming.

with open('example.fasta') as fh:
    match = False
    for line in fh:
        line = line.strip()  # Remove newline at the end of the line.
        if line.startswith('>'):
            # We have encountered a description line.
            # That means the start of a new FASTA record.
            if line.find('Q9Y233') != -1:
                # We have matched our search criteria.
                match = True
            else:
                # We have encountered a new entry and it does
                # not match the search criteria.
                match = False
        if match:
            # We are currently in a section of the FASTA file
            # that matches our search criteria.
            print(line)

It is worth noting that I felt the need to add quite a few comments to explain what was going on, a sign that everything is not as clear as it should be. However, on the whole the code does a job and it works.

Now imagine that you wanted to add a filter based on the length of the sequence. Is it immediately obvious what you would do? How can you ensure that the code remains understandable?

Object-oriented programming to the rescue

Object-oriented programming is all about grouping data and functionality together. This allows one to abstract away some of the complexities of the processing logic and to encapsulate the data.

Let us start by creating an object representing a FASTA record. Save the code below to a file named fasta.py.

class FastaRecord(object):
    """Class representing a FASTA record."""

    def __init__(self, description_line):
        """Initialise an instance of the FastaRecord class."""
        self.description = description_line.strip()
        self.sequences = []

    def add_sequence_line(self, sequence_line):
        """
        Add a sequence line to the FastaRecord instance.
        This function can be called more than once.
        """
        self.sequences.append( sequence_line.strip() )

    def __repr__(self):
        """Representation of the FastaRecord instance."""
        lines = [self.description,]
        lines.extend(self.sequences)
        return '\n'.join(lines)

There are a few things to note in the code above. Particularly if you are new to object-oriented programming and/or Python.

First of all we inherit functionality from the base class object (the first line). This is kind of historical where in Python 2.1 “new style” classes were added. To remain backwards compatible with the “classic” or “old style” classes it was decided that one would have to inherit from object to access the goodness of the new style class. There are more details on the Python wiki.

Secondly, we make use of the “magic” method __init__. This is used to create an instance of a class.

Classes, objects, instances, what is up with all this terminology? What does it all mean?

Okay, let us take a slight detour. You can think of classes as moulds, for example a plastic bucket that you bring to the beach to make a sand castle. You fill the bucket with sand and tip it up-side down, pat it on the top and lift it up. What remains is a tower made out of sand. This sand castle is an “instance” of your bucket “class”. Finally, the term “object”, as in object-oriented programming, tends to be used to refer to classes and instances interchangeably.

Back to the __init__ method, which is used to initialise an instance of the class. The instance created is accessible via the self argument. During the initialisation of the FastaRecord class we also provide the description line.

>>> from fasta import FastaRecord
>>> fasta_record = FastaRecord('>sp|O76074|PDE5A_HUMAN')

Note that the fasta_record variable above is an instance of the FastaRecord class. We can access the description attribute of the FastaRecord instance directly.

>>> fasta_record.description
'>sp|O76074|PDE5A_HUMAN'

The add_sequence_line method simply adds a sequence line to the sequences (list) attribute.

>>> fasta_record.add_sequence_line('MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE')
>>> fasta_record.add_sequence_line('RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG')

Finally, we have the “magic” __repr__ method. At this point you are probably screaming out loud, what is a “magic” method? A “magic” method is basically a way to make an object behave like a built-in Python object. For example the __repr__ method is used to describe how the instance should be represented. Let us illustrate this below.

>>> fasta_record
>sp|O76074|PDE5A_HUMAN
MERAGPSFGQQRQQQQPQQQKQQQRDQDSVEAWLDDHWDFTFSYFVRKATREMVNAWFAE
RVHTIPVCKEGIRGHTESCSCPLQQSPRADNSAPGTPTRKISASEFDRPLRPIVVKDSEG

For more information on “magic” methods have a look at Rafe Kettler’s blog post A Guide to Python’s Magic Methods.

A FASTA parser object

Now that we have a basic class for working with FASTA records let us create another class for parsing FASTA files.

class FastaParser(object):
    """Class for parsing FASTA files."""

    def __init__(self, fpath):
        """Initialise an instance of the FastaParser."""
        self.fpath = fpath

    def __iter__(self):
        """Yield FastaRecord instances."""
        fasta_record = None
        with open(self.fpath, 'r') as fh:
            for line in fh:
                if line.startswith('>'):
                    if fasta_record:
                        yield fasta_record
                    fasta_record = FastaRecord(line)
                else:
                    fasta_record.add_sequence_line(line)
        yield fasta_record

In the example above I have used the __iter__ magic method. This basically defines the behaviour the class should display when called as an iterator. In this particular case we want it to yield FastaRecord instances as the FASTA file is parsed.

>>> from fasta import FastaParser
>>> fasta_parser = FastaParser('example.fasta')
>>> for fasta_record in fasta_parser:
...     print(fasta_record.description)
...
>sp|O76074|PDE5A_HUMAN cGMP-specific 3',5'-cyclic phosphodiesterase OS=Homo sapiens GN=PDE5A PE=1 SV=2
>sp|Q9Y233|PDE10_HUMAN cAMP and cAMP-inhibited cGMP 3',5'-cyclic phosphodiesterase 10A OS=Homo sapiens GN=PDE10A PE=1 SV=1

Back to grouping data and functionality

At this point we could write a simple script to loop over the FASTA records and find the hits of interest. However, where should we add the logic for finding hits of interest?

I would argue that this is a great opportunity for abstracting away the logic of identifying a hit by putting it in the FastaRecord class itself. Let us extend the class to do this.

class FastaRecord(object):
    """Class representing a FASTA record."""

    def __init__(self, description_line):
        """Initialise an instance of the FastaRecord class."""
        self.description = description_line.strip()
        self.sequences = []

    def add_sequence_line(self, sequence_line):
        """
        Add a sequence line to the FastaRecord instance.
        This function can be called more than once.
        """
        self.sequences.append( sequence_line.strip() )

    def matches(self, search_term):
        """Return True if the search_term is in the description."""
        return self.description.find(search_term) != -1

    def __repr__(self):
        """Representation of the FastaRecord instance."""
        lines = [self.description,]
        lines.extend(self.sequences)
        return '\n'.join(lines)

Note the addition of the matches method above. Also, note that the addition of more functionality did not make the code any more difficult to understand.

It is now trivial to write a script to do the analysis that we want.

from fasta import FastaParser

for fasta_record in FastaParser('example.fasta'):
    if fasta_record.matches('Q9Y233'):
        print(fasta_record)

Compare the descriptiveness of this code to that of the procedural example at the beginning of this post.

But you had to write so much more code to get to this point, is it really worth it?

I go back to the scenario outlined earlier in this post. Imagine that you had to extend the logic of the code to be able to filter based on the length of the sequence. Which code base would you rather use as a starting point? If you are unsure, try adding this functionality to both code bases to find out which one is more extensible.

Try to avoid re-inventing the wheel

The point of this post was to illustrate object-oriented programming, not to re-invent the wheel. I used the example of parsing FASTA files in this post as they are widely used in biological research and are conceptually easy to understand. However, if you are serious about using Python for bioinformatics I suggest that you check out Biopython.

Conclusion

Object-oriented programming can be very useful when dealing with complex data structures. In particular it can be used to hide complexity by grouping data and functionality together.

Furthermore, it can make your code more understandable and extensible.

Finally, do not let your lack of knowledge about “polymorphism” and “inheritance” hold you back from making use of objects. Yes, these are interesting topics, and please do read up on them. However, they are not essential to your use of object-oriented programming (at least not in Python).

I hope you find this post useful and that it has encouraged you to try out object-oriented programming. Send me a message if you need any help.

Strategies to access content from Python functions that write to disk

2015-03-12T00:00:00+00:00

Have you ever worked with an API that has some sort of “save to file” function only to find yourself wanting a function that returns the content to a string? For example the Python image module skimage.io has a function named imsave that takes fname and arr as arguments and writes an image to disk. However, what I wanted was a function that returned the image as a byte string. In other words I wanted the behaviour of the Python Image Library’s PIL.Image.tobytes function. However, I could not find one in scikit-image.

Strategy 1: make use of `StringIO`

In these types of circumstances one can often make use of Python’s built-in StringIO module. Let’s illustrate this using PIL.

>>> import numpy as np
>>> from PIL import Image
>>> from StringIO import StringIO
>>> ar = np.zeros((50,50), dtype=np.uint8)  # The array we want to get a png byte string for.
>>> img = Image.fromarray(ar)
>>> img = img.convert('RGB')  # Need to convert to RGB to save as PNG.
>>> output = StringIO()
>>> img.save(output, format="PNG")
>>> contents = output.getvalue()
>>> output.close()
>>> assert(isinstance(contents, bytes))

Strategy 2: write, read, delete

However, one cannot use the approach above with skimage.io.imsave as it does not provide a means to specify the format (the format seems to be “automagically” determined from the file name). So we are forced to save the image to disk and then read the contents of the file.

>>> import os
>>> from skimage.io import imsave
>>> imsave('tmp.png', ar)
>>> contents = open('tmp.png', 'rb').read()
>>> os.unlink('tmp.png')
>>> assert(isinstance(contents, bytes))

Strategy 3: create a context manager

The code above above is really ugly. What we want is something that can give us a relatively safe temporary file path and delete it once we are done with it. This is what Python’s context managers are for. Context managers are what lets you use the with statement for opening files etc. Jeff Preshing has written a nice tutorial on context mangers The Python “with” Statement by Example.

Here I will use a test driven development (TDD) approach to illustrate how we can implement a context manager to help us work more safely with temporary file paths. So, before we start working on an implementation let us specify the desired behaviour as a test. Add the code below to a file named tempfilepath.py.

if __name__ == "__main__":
    import os.path
    fpath = None
    with TemporaryFilePath() as tmp:
        assert(os.path.isfile(tmp.fpath))
        with open(tmp.fpath, 'w') as fh:
            fh.write('Testing opening and writing...')
        fpath = tmp.fpath
    assert(not os.path.isfile(fpath))

The code above will raise a NameError stating that the TemporaryFilePath is not defined. Great, now we can start adding an implementation to make the tests pass. I will do this incrementally as it is a useful illustration of some of the aspects of TDD.

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

if __name__ == "__main__":
    import os.path
    fpath = None
    with TemporaryFilePath() as tmp:
        assert(os.path.isfile(tmp.fpath))
        with open(tmp.fpath, 'w') as fh:
            fh.write('Testing opening and writing...')
        fpath = tmp.fpath
    assert(not os.path.isfile(fpath))

We now get the error message below.

Traceback (most recent call last):
  File "tempfilepath.py", line 7, in <module>
    with TemporaryFilePath() as tmp:
AttributeError: __exit__

In true TDD style let us add a minimal implementation to make the test pass.

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __exit__(self, type, value, tb):
        pass

if __name__ == "__main__":
    import os.path
    fpath = None
    with TemporaryFilePath() as tmp:
        assert(os.path.isfile(tmp.fpath))
        with open(tmp.fpath, 'w') as fh:
            fh.write('Testing opening and writing...')
        fpath = tmp.fpath
    assert(not os.path.isfile(fpath))

The implementation now gives the error below.

Traceback (most recent call last):
  File "tempfilepath.py", line 10, in <module>
    with TemporaryFilePath() as tmp:
AttributeError: __enter__

Let us add the minimal implementation to fix this.

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __enter__(self):
        pass

    def __exit__(self, type, value, tb):
        pass

Which reveals the error below.

Traceback (most recent call last):
  File "tempfilepath.py", line 14, in <module>
    assert(os.path.isfile(tmp.fpath))
AttributeError: 'NoneType' object has no attribute 'fpath'

Can you work out what we need to do to fix this? This is a bit subtle, and it caught me out. The clue is that the tmp variable is NoneType, whereas it should have been TemporaryFilePath. This is due to the __enter__ function not returning anything. Let us fix it.

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __enter__(self):
        return self

    def __exit__(self, type, value, tb):
        pass

Now the context manager returns the object type we expect.

Traceback (most recent call last):
  File "tempfilepath.py", line 14, in <module>
    assert(os.path.isfile(tmp.fpath))
AttributeError: 'TemporaryFilePath' object has no attribute 'fpath'

Time to add the fpath attribute.

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __init__(self):
        self.fpath = 'tmp.txt'

    def __enter__(self):
        return self

    def __exit__(self, type, value, tb):
        pass

Now we are starting to get to the centre of the desired functionality.

Traceback (most recent call last):
  File "tempfilepath.py", line 17, in <module>
    assert(os.path.isfile(tmp.fpath))
AssertionError

At this stage we just want to get the tests to pass so we add a “dumb” implementation.

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __init__(self):
        self.fpath = 'tmp.txt'
        with open(self.fpath, 'w') as fh:
            pass

    def __enter__(self):
        return self

    def __exit__(self, type, value, tb):
        pass

Which gets us to the second assertion statement.

Traceback (most recent call last):
  File "tempfilepath.py", line 23, in <module>
    assert(not os.path.isfile(fpath))
AssertionError

Basically, we need to add some clean up functionality to the __exit__ function.

import os

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __init__(self):
        self.fpath = 'tmp.txt'
        with open(self.fpath, 'w') as fh:
            pass

    def __enter__(self):
        return self

    def __exit__(self, type, value, tb):
        os.unlink(self.fpath)

And that makes all the tests pass. However, this code still has the ugly side-effect of hijacking the tmp.txt file. It is time to refactor the code to make it less nasty. Let us make use of the tempfile module.

import os
import tempfile

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __init__(self):
        with tempfile.NamedTemporaryFile(delete=False) as tmp:
            self.fpath = tmp.name

    def __enter__(self):
        return self

    def __exit__(self, type, value, tb):
        os.unlink(self.fpath)

Great, everything is working nicely. The only problem is that in order to be able to save an image in png file format we need to be able to specify the suffix of the file name.

At this stage it is very tempting to simply add the desired functionality and it is where test driven development really requires discipline. Let us be good practitioners of TDD and add a test specifying the desired behaviour first.

if __name__ == "__main__":
    import os.path
    fpath = None
    with TemporaryFilePath() as tmp:
        assert(os.path.isfile(tmp.fpath))
        with open(tmp.fpath, 'w') as fh:
            fh.write('Testing opening and writing...')
        fpath = tmp.fpath
    assert(not os.path.isfile(fpath))
        
    with TemporaryFilePath(suffix='.png') as tmp:
        assert(tmp.fpath.endswith('.png'))
        

Great, we now have a failing test.

Traceback (most recent call last):
  File "tempfilepath.py", line 27, in <module>
    with TemporaryFilePath(suffix='.png') as tmp:
TypeError: __init__() got an unexpected keyword argument 'suffix'

Let us continue to work incrementally and only fix the error reported.

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __init__(self, suffix=None):
        with tempfile.NamedTemporaryFile(delete=False) as tmp:
            self.fpath = tmp.name

This moves us on to the actual assertion that we wanted to test.

$ python tempfilepath.py
Traceback (most recent call last):
  File "tempfilepath.py", line 28, in <module>
    assert(tmp.fpath.endswith('.png'))
AssertionError

Let us try to fix it.

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __init__(self, suffix=None):
        with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
            self.fpath = tmp.name

    def __enter__(self):
        return self

    def __exit__(self, type, value, tb):
        os.unlink(self.fpath)

However, this results in a horrible error message.

Traceback (most recent call last):
  File "tempfilepath.py", line 20, in <module>
    with TemporaryFilePath() as tmp:
  File "tempfilepath.py", line 8, in __init__
    with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tempfile.py", line 462, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/tempfile.py", line 237, in _mkstemp_inner
    file = _os.path.join(dir, pre + name + suf)
TypeError: cannot concatenate 'str' and 'NoneType' objects

The error message above is basically trying to tell us that the suffix argument should not be None by default. We can verify this by looking at the tempfile.NamedTemporaryFile documentation, which states that it should be an empty string. Let us fix our code.

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __init__(self, suffix=''):
        with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
            self.fpath = tmp.name

    def __enter__(self):
        return self

    def __exit__(self, type, value, tb):
        os.unlink(self.fpath)

And now all the tests pass. Below is the code in all its glory.

import os
import tempfile

class TemporaryFilePath(object):
    """Context manager for handling temporary file paths."""

    def __init__(self, suffix=''):
        with tempfile.NamedTemporaryFile(suffix=suffix, delete=False) as tmp:
            self.fpath = tmp.name

    def __enter__(self):
        return self

    def __exit__(self, type, value, tb):
        os.unlink(self.fpath)

if __name__ == "__main__":
    import os.path
    fpath = None
    with TemporaryFilePath() as tmp:
        assert(os.path.isfile(tmp.fpath))
        with open(tmp.fpath, 'w') as fh:
            fh.write('Testing opening and writing...')
        fpath = tmp.fpath
    assert(not os.path.isfile(fpath))
        
    with TemporaryFilePath(suffix='.png') as tmp:
        assert(tmp.fpath.endswith('.png'))

We can now use this to get the content of our numpy array as an image in byte string representation.

>>> from tempfilepath import TemporaryFilePath
>>> with TemporaryFilePath(suffix='.png') as tmp:
... 	imsave(tmp.fpath, ar)
...     content = open(tmp.fpath, 'rb').read()
...
>>> assert(isinstance(content, bytes))

How to display objects as images in IPython

2015-03-08T00:00:00+00:00

IPython has some neat functionality for displaying objects in ways that can be more informative than the standard __repr__ representation. Both the IPython notebook and qtconsole support the display of png, jpeg and svg images. Furthermore, the IPython notebook can also display html, javascript, json and latex.

If you simply want to display an image you can achieve this using the IPython.display.Image class.

>>> from IPython.display import Image
>>> image = Image('tiny_tjelvar.png')
>>> image

The last image call would result in the image below being displayed in the IPython qtconsole/notebook.

However, suppose that you wanted to create an image representation of your own class. Let us illustrate this with the hypothetical example of an ImageFile class that simply stores the location of an image.

class ImageFile(object):
    """Class for storing an image location."""

    def __init__(self, fpath):
        self.fpath = fpath

    def _repr_png_(self):
        return open(self.fpath, 'r').read()

The usage of the class above would be along the lines of the below.

>>> im_file = ImageFile('tiny_tjelvar.png')
>>> im_file

The example above would fall over if the file was not in png format. Let us make the code a little bit more robust by adding a naive file format check.

class ImageFile(object):
    """Class for storing an image location."""

    def __init__(self, fpath):
        self.fpath = fpath
        self.format = fpath.split('.')[-1]

    def _repr_png_(self):
        if self.format == 'png':
            return open(self.fpath, 'r').read()

Finally, let us extend the class to be able to deal with jpeg and svg images as well.

class ImageFile(object):
    """Class for storing an image location."""

    def __init__(self, fpath):
        self.fpath = fpath
        self.format = fpath.split('.')[-1]

    def _repr_png_(self):
        if self.format == 'png':
            return open(self.fpath, 'r').read()

    def _repr_jpeg_(self):
        if self.format == 'jpeg' or self.format == 'jpg':
            return open(self.fpath, 'r').read()

    def _repr_svg_(self):
        if self.format == 'svg':
            return open(self.fpath, 'r').read()

You now have a class that stores image file paths and has the capability to interactively display the images using either IPython qtconsole or IPython notebook.

Useful links

NorDevCon a day filled with passion!

2015-03-06T00:00:00+00:00

This years Norfolk Developer Conference (NorDevCon) took place on the 27th of Febrary and what a feast it was! The conference which traditionally has a heavy focus on Tech and Agile was this year bolstered by a new business track as well as more design sessions.

The day started off with an introduction by Paul Grenyer (@pjgrenyer) who organised the event, followed by a short speech by Huw Sayer (@HuwSayer) congratulating Norwich on now being an official #TechCluster on the (@TechCityUK) map. This is a great achievement and it was interesting to hear about the journey that lead to the official recognition.

The event was then officially opened by Jon Skeet (@jonskeet) giving the opening keynote, which was all about passion! Jon started out on the premise that conferences are useful for connecting and inspiring people and this comes from passion. In fact Jon went as far as saying that conferences are all about inspiration and not about learning.

One take home message from Jon’s talk was that we can not always work on things that we are passionate about. However, we can always strive to find something interesting in the dullest of tasks. For example by trying to understand the problem space or by working out how to do something as well as possible. In other words be curious. If nothing else it will make you a more interesting person to work with.

How do you share and grow passion with people that you meet and work with? Jon suggested that you should listen, encourage, nurture and feed other people’s passion. However, which ideas spread will be partly down to charisma. This means that we all have a role to make sure that this does not become too unbalanced. In other words we all have a role in echoing what other people have done; to amplify other people’s message without taking them over. One of the great benefits of nurturing passion is that a team with shared passion can really “motor along” and achieve great things.

The dangers of passion were also dealt with. Jon identified three types of destructive behaviour and outlined ways in which to deal with them.

The first scenario was when two people on a team disagree. It happens a lot. It happens a lot with good people. It is inevitable as more often than not there is more than one solution. What to do? The team needs to pick one way forward. However, it is at this point that the whole team needs to look after the person who “lost”. Otherwise the dynamics of the team can get unbalanced. In particular if the same person “loses” several times there is the danger that that person will feel that his/her opinions and suggestions are not appreciated and he/she may well start looking for a different place to work and the team will loose diversity.

The second danger scenario identified was inter-team disagreement. For example the fast and furious team having to work with the really careful team. In this case you get positive feedback loops in both camps, where all the negative views are constantly being re-enforced by the people you trust, i.e. the people on your team. In this scenario there will be a need for compromise and the teams will need to talk face-to-face.

The third danger scenario was a team with no disagreement at all. Although a team with shared passion can really “motor along”, what happens if the team was running in the wrong direction from the start? To avoid this scenario Jon’s suggested solution was to take a step back and make sure that you think about the business value. What is the ultimate goal of the project? Will it make people happy?

After describing the dangers Jon did add the caveat that clearly the easy solutions outlined above are not at all easy in real life.

Finally, Jon gave us a challenge. The challenge was to bathe in passion! If a speaker did not inspire us with passion we should leave the session and find passion somewhere else. When we introduced ourselves to each other in the breaks we should do so with passion!

With those thoughts embedded in our minds we then went out to seek our fortune in the rest of the conference.

I had a great time! I met up with old friends and I made new friends. I had my appetite whet for Docker by Dom Davis (@idomdavis). I learnt about browser APIs from Ruth John (@Rumyra); via the medium of VJing! I was inspired by Seb Rose (@sebrose) to write more and better tests. I had minor revelations on how to improve my coding style during the talk by Kevlin Henney (@KevlinHenney). It was great!

The day was rounded off by a fast and furious closing keynote by Harry Harrold (@harryharrold) and Rupert Redington (@rupertredington). It was an audio-visual bonanza, shining a light on the Agile manifesto, ending up in fireworks!

It was a day filled with passion and I am already looking forward to next year’s instalment of NorDevCon!

Three essential tips for improving your scientific code

2015-02-28T00:00:00+00:00

Writing scientific code is not dissimilar to writing any other type of code. What is different is that many people who end up coding during their PhD do not have or get any formal training in software development best practices.

For me it was very much a case of trial and error and picking things up as I went along. In the research group that I was in, my peers were doing lab work culturing cells and characterising proteins, so there was no one to discuss programming with. So I learnt by reading; reading blogs; reading magazines; reading books; reading other people’s code. However, it was a slow process sorting the wheat from the chaff. Furthermore many of the things that I read seemed a bit over the top for a one man band.

With hindsight I realise that another difficulty in learning software development best practices is that sometimes the most fundamental aspects of software development are not explicitly stated as they are taken for granted by everyone that uses them.

Here are the three most valuable things that I have learnt both from my own trial and error and by working with great software developers since finishing my PhD. For anyone developing software professionally, none of this will be new. However, if you are a scientist who has drifted into programming, these are the three most important things that you can do to improve your productivity and the quality of your code.

Use version control

Using version control is one of the simplest ways of increasing your productivity. The reason is it reduces your fear of changing existing code as you can always roll back to a previously working state. One of the tell-tale signs that you need to use version control is if your project directory contains files named along the lines of the below (as you can tell I used to do this before I saw the light).

new_simulation.py
older_simulation.py
old_simulation.py
simulation.py
simumlation_100606.py
test_simulation.py

When I started using version control Subversion was the best open source tool available. However, it was difficult to set up and I was never sure I got it right. These days you have a choice of two largely equivalent systems Git or Mercurial. These are very easy to set up and use.

Here I will illustrate how to use Git from the command line. To start a new project.

git init simulation
cd simulation

This creates a directory named simulation, the files in this directory can now be managed by git. Suppose that we create a README file in this directory and want to add it to version control.

git add README

When a file is added it is staged to be committed. Let’s commit it as a snapshot to version control.

git commit -m "Added README file."

That’s it. You can keep using git add and git commit to add incremental changes to your code base until you find that you need to use some more powerful features of Git, at which point you can learn more about it.

If you already have a project that you want to start tracking using Git you can use the commands below.

cd my_existing_project
git init
git add "*"
git commit -m "Initial file import."

Note that the command above will put all files in the project directory under version control so do not use it if your project directory contains machine generated files, for example output files from your program.

Once you have got a little bit of familiarity with Git or Mercurial I would strongly recommend that you set up an account with BitBucket or GitHub and host your code there. This has several advantages: you can stop worrying about your computer crashing and losing all your work, you can access your code from any machine with an internet connection, and you can collaborate with other people on your code.

Write code so that it can be understood by someone else

One thing that is different when developing code in an academic environment is that it is not unusual to start a project from scratch. This is quite uncommon when working for a company that develops software. In the latter case one of the first tasks on the job is to get familiar with the company’s code base. This means reading code, which is when one learns to appreciate:

Explicit names of variables, functions and classes
Comments that explain the intent of code whenever it is not immediately intuitive
Documentation describing the overall architecture of the system
Consistent coding style

However, when one starts coding by oneself on a new project none of the above matters as the logic behind every decision and poorly named variable is immediately obvious to oneself. Anyway, one tells oneself, “I will deal with those niceties once the program is working”.

However, once the program is working those things which were immediately obvious are now obscure and anyway the program is working so one can use it to generate some results, which is much more interesting than cleaning up code. Then one realises that the results are not quite as expected and something is not quite right about the logic of the program. However, the logic of the program is not immediately clear…

So go on, name your variable temperature_increase instead of temp_inc or ti. You have to type a few more letters but you will gain so much more. By the way, does temp stand for temporary or temperature and does inc stand for increment or increase? Also, if you find that typing out long explicit names for variables, functions and classes is causing you frustration then you should go on the hunt for a better text editor (I use vim) or an integrated development environment, that understands code and offers to complete names for you.

In terms of commenting your code, the key is to realise that you should document the intent not the actual code. In other words, I can read your code so I don’t need it re-iterated using plain English. However, I cannot read your mind so please tell me what the intention was.

Describing the architecture of the system is just a fancy way of saying that you should describe how the components of your software interact with each other. Suppose for example that you were faced with a relatively simple code base that contained the files:

parser.py
database.py
simulation.py
experiment.py

Can you describe how these Python modules interact with and depend on each other? What is the difference between a simulation and an experiment?

Now suppose that the author of this hypothetical Python package had been kind and spent five minutes including the lines below in the README file, would it enable you to answer the questions above?

README
======

parser.py     - module for parsing parameter files
database.py   - module for storing results
simulation.py - module for running simulations
experiment.py - template for creating a new experiment

The ``experiment.py`` template uses the parser to read
in the parameters for the experiment. The parameters
are then passed on to the simulation (``simulation.py``).
Note that when you instantiate the ``simulation.Simulation``
class you need to provide it with a ``database.Database``
instance. The latter will be used to write the simulation
results to your database of choice.

I won’t dwell too long on coding style. Basically be consistent and try to use the standard one for your language; i.e. if you code in Python use PEP8, if you write C code use K&R style, and so forth. If coding style interests you, please read Style is Substance by Ken Arnold.

Write tests

This is the most difficult of the tips outlined in this post. Writing good tests is hard, and continuing to write tests as your code base grows requires discipline. Furthermore, many scientific algorithms have a stochastic nature to them, which further compounds the situation.

First of all, before you start writing tests, make sure that you find a suitable testing framework so that you do not re-invente the wheel. For example if you are coding in Python you could use Unittest.

If you already have code that is working but have no tests, start by adding some integration tests. In other words treat your software as a black box that, given a set of inputs, produces a known set of outputs. Write an automated test that checks that this is true.

Now once you go in and work on a particular unit of your code, make sure that you write a test for that particular unit first, then make the change that you wanted to make.

When the unit that you are testing is so isolated that it does not depend on any other code or systems (e.g. a database running in the background) then the test is referred to as a unit test.

There are two advantages to unit tests over integration tests. They make it easier to identify which part of your code is broken when they fail. Secondly they run quicker than integration tests so you can have more of them.

Why does the speed of the tests matter? Speed matters because once you have automated tests in place you need to run them often, at a minimum before every commit to version control.

At this point I recommend that you get a copy of Martin Fowlers’ book Refactoring: Improving the Design of Existing Code. As the title suggests it is about refactoring rather than testing. However, refactoring requires tests and the book gives loads of practical advice on how to improve existing code by writing tests and refactoring.

If you are starting out with a clean slate (i.e. no existing code), I highly recommend that you start writing tests from the start. You could even go to the extreme and use Test Driven Development, where you write a test before you write any code. Initially the test will fail and then you implement the code to make the test pass.

Test driven development is a bit more complicated than what I outlined above, notably it includes a step of refactoring. However I will not go into more detail here. If test driven development sounds interesting and you are interested in web development as well I highly recommend Harry Percival’s book Test-Driven Development with Python.

This all sounds like a lot of hard work, why do I need tests anyway? I won’t dwell on this too much. However, if you don’t have tests how can you have any confidence that your code is doing what it is supposed to do? Okay, so you have done manual testing and the results are as expected. Fine, now suppose that you want to add another feature how can you be sure that you will not introduce a bug somewhere else? Do you want to do all that manual testing again? If you do not have tests you will get to the stage where you are afraid to touch the code for fear of breaking it.

Summary

This post turned out a bit longer than I initially thought. However the take home message is simple:

Use version control
Write code so that it can be understood by someone else
Write tests

Using version control is easy: do it!

Another person that is likely to need to get familiar with your code is you in six months time so be kind and make your code easy to understand.

Writing good tests is initially hard, and the only way to learn is by practise (I’m still learning). However, do write them otherwise your code will hold you to ransom.

If you already do all of the above, great, I’m preaching to the converted, please forward this post to someone less experienced than yourself.

Acknowledgements

I’d like to thank Clare Macrae (@ClareMacraeUK) for helpful discussions and feedback.

How to save RGB images using PyLibTiff

2015-02-18T00:00:00+00:00

In the previous post I showed how to read and write tiff files in Python using PyLibTiff. Here I will illustrate how to use PyLibTiff to create an RGB tiff file.

Figure illustrating the Canny edge detection algorithm followed by a binary filling of holes. The red, green and blue channels represent the initial edges, the segments identified and the raw data respectively.

The PyLibTiff on-line documentation is minimal, so I started off by simply trying to save a list containing three numpy.arrays. This was a guess based upon how I would have liked the package to work.

>>> import numpy as np
>>> from libtiff import TIFF
>>> r = np.ones((50,50), dtype=np.uint8) * 30
>>> g = np.ones((50,50), dtype=np.uint8) * 90
>>> b = np.ones((50,50), dtype=np.uint8) * 120
>>> tiff = TIFF.open('initial-test.tiff', 'w')
>>> tiff.write_image([r, g, b])
>>> tiff.close()

To my surprise this created a tiff file without complaining. However, inspecting the tiff file using exiftool revealed that the file only had one channel per sample.

$ exiftool initial-test.tiff 
...
Bits Per Sample                 : 8
Compression                     : Uncompressed
Photometric Interpretation      : BlackIsZero
...

I had in fact produced a multi-page tiff file. After some head scratching I started digging around in PyLibTiff’s built in documentation using pydoc. This was very informative. It revealed that the write_image() function has an argument named write_rgb, which by default is set to False; so I set it to True.

>>> tiff = TIFF.open('rgb-test.tiff', 'w')
>>> tiff.write_image([r, g, b], write_rgb=True)
>>> tiff.close()

Inspecting the new file revealed that it was indeed a RGB tiff file!

$ exiftool initial-test.tiff 
...
Bits Per Sample                 : 8 8 8
Compression                     : Uncompressed
Photometric Interpretation      : RGB
...

How is this useful?

Microscopy data often contains several channels of information, red and green fluorescence are common, so it is useful to be able to save these to the red and green channels respectively.

Furthermore, it can be a quick and dirty way of annotating regions of interest. Say for example that we wanted to visualise how a segmentation using the Canny edge detection algorithm followed by a binary filling of holes works in the context of the raw data. This can be achieved using the code snippet below, which was used to produce the image at the top of this post.

>>> from skimage import data
>>> from skimage.filter import canny, sobel
>>> from scipy.ndimage import binary_fill_holes
>>> coins = data.coins()
>>> edges = np.array(canny(coins), dtype=np.uint8) * 255
>>> filled = np.array(binary_fill_holes(edges), dtype=np.uint8) * 255
>>> tiff = TIFF.open('canny-fill-holes-segmentation.tiff', 'w')
>>> tiff.write_image([edges, filled, coins], write_rgb=True)
>>> tiff.close()

Saving 16-bit tiff files using Python

2015-02-13T00:00:00+00:00

When dealing with microscopy data it is not uncommon to be dealing with image files that have 16-bit channels. This presents a difficulty when working with Python as many imaging libraries struggle to save numpy.uint16 arrays.

To illustrate the problem let us create a white 50x50 pixel 16-bit image using numpy.

>>> import numpy as np
>>> ar = np.ones((50,50), dtype=np.uint16)
>>> ar = ar * np.iinfo(np.uint16).max

PIL/Pillow simply, and helpfully, raises a TypeError.

>>> from PIL import Image
>>> img = Image.fromarray(ar)
Traceback (most recent call last):
...
TypeError: Cannot handle this data type

SciPy does save the file, but it converts it to 8-bit. Personally I do not like this behaviour as it has caused me confusion on several occasions as subsequent steps of the analysis has read the file and tried to extract meaningful information from it.

>>> import scipy.misc
>>> scipy.misc.imsave('scipy.tiff', ar)
>>> ar2 = scipy.misc.imread('scipy.tiff')
>>> ar2.dtype
dtype('uint8')
>>> np.max(ar2)
0

PyLibTiff to the rescue

PyLibTiff is a package that provides a wrapper to the libtiff library. To use it simply make sure that you have the libtiff library installed on your system and then you can use pip to install PyLibTiff. On a Debian based system.

sudo apt-get install libtiff-dev
sudo pip install libtiff

Now let us look at how to save a file using PyLibTiff.

>>> from libtiff import TIFF
>>> tiff = TIFF.open('libtiff.tiff', mode='w')
>>> tiff.write_image(ar)
>>> tiff.close()

To show that everything is working as expected let us open the tiff file and read in the image from it.

>>> tiff = TIFF.open('libtiff.tiff', mode='r')
>>> ar = tiff.read_image()
>>> tiff.close()
>>> ar.dtype
dtype('uint16')
>>> np.max(ar)
65535

Other options

Another option for working with 16-bit tiff files is OpenCV-Python. I also believe that tiffile.py can handle them, although I have not tested this myself. The reason I prefer PyLibTiff over these is that it can be installed into a virtual environment using pip.

Tjelvar Olsson

Using relative paths in Linux scripts

Relative and absolute paths in Linux

Homeworking: opportunities for scientists

New journal for "Patterns" in data

Packaging data and metadata using dtool

Introduction

Executive summary

The hairy details

Creating a dataset

Validating the integrity of a dataset

DISCUSSION

Data management for biologists

Introduction

Principle 1: Make it clear who is responsible for what

Principle 2: Keep raw data safe and separate from derived data

Principle 3: Standardise the location and structure of data

Principle 4: Provide metadata

Discussion

Python for biologists

Getting a flavour of Python

Variables

Determining the GC count of a sequence

Creating reusable functions

List slicing

Loops

Summary

Biologist's Guide to Python string manipulation

The Python string object

Regular expressions

Biologist's Guide to Computing - almost there

Taking the effort out of server configuration using Ansible

Listing your inventory

Interacting with Ansible

Batteries included as Ansible modules

Reproducible configuration scripts using playbooks

What now?

Biologist's Guide to Computing - a work in progress

How to build a basic image viewer using FreeImage and SDL2

Argument parsing

Reading in the image using FreeImage

Creating a SDL surface from a FreeImage bitmap

Creating a SDL window

Rendering the surface as a texture in the window (a.k.a. displaying the image)

Giving the user the chance to view the image

Putting it all together

Compiling and linking

Conclusion

Acknowledgements

How to continuously test your Python code on Windows using AppVeyor

AppVeyor to the rescue

Using Minconda to test projects that depend on the numpy/scipy stack

See also

Five steps to add the 'bling' factor your Python package

Introduction

Step 1: Host the documentation on readthedocs

Step 2: Set up continuous integration testing on Travis Ci

Step 3: Calculate your code coverage using Codecov

Step 4: Upload your Package to PyPi

Step 5: Add badges to your project’s README file

Conclusion

Day 12: Multi-level modelling in morphogenesis

Day 11: Multi-level modelling in morphogenesis

Day 10: Multi-level modelling in morphogenesis

Day 9: Multi-level modelling in morphogenesis

Day 8: Multi-level modelling in morphogenesis

Day 5: Multi-level modelling in morphogenesis

Day 4: Multi-level modelling in morphogenesis

Day 3: Multi-level modelling in morphogenesis

Day 2: Multi-level modelling in morphogenesis

Day 1: Multi-level modelling in morphogenesis

How to generate beautiful technical documentation

Installing Sphinx

Generating boilerplate files for the documentation

Let’s build some documentation

Adding some more content

reStructuredText markup

Including code snippets in the documentation

Generating API documentaiton for Python projects

What about the original README file?

Using Minconda to test projects that depend on the `numpy`/`scipy` stack

Making use of `setuptools`

Adding `Bio::Graphics` to the `bio_perl` role

Creating the `gbrowse` role