Biologist's Guide to Python string manipulation
Because information about DNA and proteins are often stored in plain text files many aspects of biological data processing involves manipulating text. In computing text is often referred to as strings of characters. String manipulation is is therefore a common task both for processing biological sequences and for interpreting sequence identifiers.
This post provides a quick summary of how Python can be used for such string manipulation, using the FASTA description line as an example.
The Python string object
When reading in strings from a text file one often has to deal with
lines that have leading and/or trailing white spaces. Commonly one wants
to get rid of them. This can be achieved using the
built into the Python string object.
>>> " text with leading/trailing spaces ".strip() 'text with leading/trailing spaces'
Another common use case is to replace a word in a line. For example,
when we strip out the leading and trailing white spaces one might want
to update the word “with” to “without” to make the resulting string
reflect its current state. This can be achieved using the
>>> " text with leading/trailing spaces ".strip().replace("with", "without") 'text without leading/trailing spaces'
In the example above we chain the
replace() methods together.
In practise this means that the
replace() method acts on the return value of
Python’s string object also comes with a
startswith() method. This can,
for example, be used to identify FASTA description lines.
>>> ">MySeq1|description line".startswith(">") True
endswith() method complements the
startswith() method and is
often used to examine file extensions.
>>> "/home/olsson/images/profile.png".endswith("png") True
The example above only works if the file extension is in lower case.
>>> "/home/olsson/images/profile.PNG".endswith("png") False
However, we can overcome this issue by adding a call to the
method, which converts the string to lower case.
>>> "/home/olsson/images/profile.PNG".lower().endswith("png") True
Another common use case is to search for a particular string
within another string. For example one might want to find out if the
UniProt identifier “Q6GZX4” is present in a FASTA description line. To
achieve this one can use the
find() method, which returns the index
position (zero-based) where the search term was first identified.
>>> ">sp|Q6GZX4|001R_FRG3G".find("Q6GZX4") 4
If the search term is not identified
find() returns -1.
>>> ">sp|P31946|1433B_HUMAN".find("Q6GZX4") -1
When iterating over lines in a file one often wants to split the line
based on a delimiter. This can be achieved using the
By default this splits on white space characters and returns a list of
>>> "text without leading/trailing spaces".split() ['text', 'without', 'leading/trailing', 'spaces']
A different delimiter can be used by providing it as an argument to the
>>> ">sp|Q6GZX4|001R_FRG3G".split("|") ['>sp', 'Q6GZX4', '001R_FRG3G']
There are many variations on the string operators described above. It is useful to familiarise yourself with the Python documentation on strings.
Regular expressions can be defined as a series of characters that define a search pattern.
Regular expressions can be very powerful. However, they can be difficult to build up. Often it is a process of trial and error. This means that once they have been created, and the trial and error process has been forgotten, it can be extremely difficult to understand what the regular expression does and why it is constructed the way it is.
Warning: only use regular expression as a last resort!
A good rule of thumb is to always try to use string operations to implement the desired functionality and only switch to regular expressions when the code implemented using these become more difficult to understand than the equivalent regular expression.
To use regular expressions in Python we need to import the
re module is part of Python’s standard library. Importing modules in
Python is achieved using the
>>> import re
Let us store a FASTA description line in a variable.
>>> fasta_desc = ">sp|Q6GZX4|001R_FRG3G"
Now, let us search for the UniProt identifier
Q6GZX4 within the line.
>>> re.search(r"Q6GZX4", fasta_desc) # doctest: +ELLIPSIS <_sre.SRE_Match object at 0x...>
There are two things to note here:
- We use a raw string to represent our regular expression, i.e. the
string prefixed with an
- The regular expression
search()method returns a match object (or None if no match is found)
What is a “raw” string? In Python “raw” strings differ from regular strings
in that the bashslash
\ character is interpreted literally. For example the
regular string equivalent of
r"\n" would be
"\\n" where the first backslash
is used to escape the effect of the second (remember that
\n represents a
newline). Raw strings were introduced in Python to make it easier to create
regular expressions that rely heavily on the use of literal backslashes.
The index of the first matched character can be accessed using the match
start() method. The match object also has an
that returns the index of the last character + 1.
>>> match = re.search(r"Q6GZX4", fasta_desc) >>> if match: ... print(fasta_desc[match.start():match.end()]) ... Q6GZX4
In the above we make use of the fact that Python strings support
slicing. Slicing is a means to access a subsection of a sequence. The
[start:end] syntax is inclusive for the start index and exclusive for
the end index.
>>> "012345"[2:4] '23'
To see the merit of regular expressions we need to create one that
matches more than one thing. For example a regular expression that could
match all the patterns
Now suppose that we had a list containing FASTA description lines with these types of identifiers.
>>> fasta_desc_list = [">id0 match this", ... ">id9 and this", ... ">id100 but not this (initially)", ... "AATCG"] ...
Note that the list above also contains a sequence line that we never want to match.
Let us loop over the items in this list and print out the lines that match our identifier regular expression.
>>> for line in fasta_desc_list: ... if re.search(r">id[0-9]\s", line): ... print(line) ... >id0 match this >id9 and this
There are two noteworthy aspects of the regular expression. Firstly, the
[0-9] syntax means match any digit. Secondly, the
\s regular expression
meta character means match any white space character.
If one wanted to create a regular expression to match an identifier with
an arbitrary number of digits one can make use of the
character, which causes the regular expression to match the preceding
expression 0 or more times.
>>> for line in fasta_desc_list: ... if re.search(r">id[0-9]*\s", line): ... print(line) ... >id0 match this >id9 and this >id100 but not this (initially)
It is possible to extract specific pieces of information from a line using regular expressions. This uses a concept known as “groups”, which are indicated using parenthesis. Let us try to extract the UniProt identifier from a FASTA description line.
>>> print(fasta_desc) >sp|Q6GZX4|001R_FRG3G >>> match = re.search(r">sp\|([A-Z,0-9]*)\|", fasta_desc)
Note how horrible and incomprehensible the regular expression is!
It took me a couple of attempts to get this regular expression right as
I forgot that
| is a regular expression meta character that needs to
be escaped using a backslash
The regular expression representing the UniProt idendifier
match capital letters (
A-Z) and digits (
0-9) zero or more times (
The UniProt regular expression is enclosed in parenthesis. The parenthesis
denote that the UniProt identifier is a group that we would like access to. In
other words, the purpose of a group is to give the user access to a section of
interest within the regular expression.
>>> match.groups() ('Q6GZX4',) >>> match.group(0) # Everything matched by the regular expression. '>sp|Q6GZX4|' >>> match.group(1) 'Q6GZX4'
Note that there is a difference between the
groups() and the
methods. The former returns a tuple containing all the groups defined in the
regular expression. The latter takes an integer as input and returns a specific
group. However, confusingly
group(0) returns everything matched by the
regular expression and
group(1) returns the first group; making the
method appear as if it used a one-based indexing scheme.
Finally, let us have a look at a common pitfall when using regular
expressions in Python: the difference between the methods
>>> print(re.search(r"cat", "my cat has a hat")) # doctest: +ELLIPSIS <_sre.SRE_Match object at 0x...> >>> print(re.match(r"cat", "my cat has a hat")) # doctest: +ELLIPSIS None
match() only looks for a match at the beginning of the
string to be searched. For more information see the search() vs
section in the Python documentation.
There is a lot more to regular expressions in particular all the meta characters. For more information have a look at the regular expressions operations section in the Python documentation.
This blog post was adapted from a section in the book that I am working on: The Biologist’s Guide to Computing. Please check it out if you found this post useful!