External Modules¶
Importing Packages¶
Packages are just Python modules that offer additional functionality. There are many packages in the wild, take a look at the Python Package Index for an overview.
In order to make use of a package, you have to first import it:
import somepackage
Once imported, you can perform the usual operations on the package, e.g.:
>>> help(somepackage)
>>> print somepackage.__version__
>>> somepackage.function()
The Python Standard Library is installed by default together with Python 2 (and Python 3). It provides a lot of packages for dealing with many different tasks, e.g. regular expressions (more on that later).
Other specific packages do not come along with Python 2 (or 3) by default, so they have to be installed separately.
Warning
If you try to import a package, and get:
>>> import iamnotinstalled
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ImportError: No module named iamnotinstalled
it means that the package is not installed on your machine.
Luckily, it is easy to install a package with Python. The process is very streamlined. If you want to install a new package, just type (from the shell):
$ pip install thepackageyouwanttoinstall --user
This will install the package named thepackageyouwanttoinstall
into your
home directory (more specifically, inside the ~/.local/lib/python2.7/
directory).
After the installation, Python will automatically pick up the package and allow you to import it.
Warning
These instructions assume that you have the pip
installer.
This is the case for most GNU/Linux distributions and MacOS X versions
>= El Capitan.
If pip
is not installed, either install it from your package manager
on GNU/Linux, brew
on MacOS X, or use the generic setup script provided
by the Python Packaging Authority
Example. Let’s install the biopython
package:
$ pip install biopython --user
The pip
command will download the package (and all its dependencies) from
the internet, and install them in the correct order.
After pip
is done, open a Python interpreter and try to import the
biopython
package:
>>> import Bio
Warning
The pip
installer and the python
interpreter may use different
names to refer to the same package.
In the example above, the Biopython package is called biopython
by
the pip
installer, and simply Bio
from within Python.
Please refer to the package documentation (i.e. its website) to figure out how it is called in the two different settings.
Example. Let’s say that you want to import the numpy
module. Just
write:
import numpy
Once imported, you can take a peek at the various functionalty supported by
the numpy
module by typing:
help(numpy)
For instance, to use the arccos
function provided by the numpy
module,
type:
print numpy.arccos(0)
You can also abbreviate the name of the package with a shorthand, as follows:
>>> import numpy as np
>>> print np.__version__
1.11.2
>>> np.arccos(0)
1.5707963267948966
>>> np.arccos(1)
0.0
Example. You can import specific sub-modules using the following notation:
import module.submodule
Then you can call the functions available in the sub-module as:
module.submodule.function("stuff")
For instance, to import the linalg
(linear algebra) submodule of the
numpy
package, write:
>>> import numpy.linalg
>>> help(numpy.linalg)
>>> help(numpy.linalg.eig)
Now, you can also import the submodule as a standalone package:
>>> from numpy import linalg
>>> help(linalg.eig)
as well as using the shorthand trick:
>>> import numpy.linalg as la
>>> help(la.eig)
or both:
>>> from numpy import linalg as la
>>> help(la.eig)
Finally, you can import individual functions:
>>> from numpy import arccos
>>> print arccos(0)
as well as multiple functions:
>>> from numpy import arccos, arcsin
>>> print arccos(0)
1.57079632679
>>> print arcsin(0)
0.0
Warning
The __future__
“module” is a special module used to import Python 3
functionality into Python 2 programs. It can be useful for writing
code compatible with both Python 2 and Python 3.
For instance, although print
is a statement in Python 2, in Python
3 it becomes a function, meaning that our beloved:
print "stuff"
does not work anymore: in Python 3, print
is a proper function an
requires brackets:
print("stuff")
In order to have a true print
function in your Python 2 program,
use the following import at the very beginning of your script:
from __future__ import print_function
A useful feature to import is:
from __future__ import division
Once imported, the division
feature makes division beteen integers
return a float
if needed!
Exercises - Packages¶
Import some of the packages from the Python Standard Library. Some useful ones are:
The
math
package. It offers a plethora of mathematical utility functions. Use itsfactorial()
function to compute the factorial of all integers from1
to100
, i.e.1!
,2!
, ...Remark. It is much faster than our naive implementation!
The
glob
package. It includes a methodglob()
that takes a path to a directory, and returns the list of the contents of the directory, pretty much like thels
shell command.Use the
glob()
function and print the list of the files in your home directory.Remark. Unfortunately
glob()
does not understand the~
path.The
time()
package. It provides atime()
function that returns the number of seconds since the Epoch, i.e. midnight UTC of 1st January 1970.It also provides a
sleep()
function that takes a number of seconds, and makes your program pause for the specified amount of time (with reasonable approximation).After properly importing the
time()
andsleep()
functions from thetime
package, check what this code does:t0 = time() for i in range(5): sleep(1) print "it took me {}s to get here".format(time() - t0) print "done in approx {}s".format(time() - t0)
The
pickle()
package. It provides facilities for storing arbitrary Python data structures to disk, and loading them back.It is a must-have package for saving, e.g., the results of data analysis or other complex objects that are difficult to encode into text.
Study its
dump(object, file_handle)
andload(file_handle)
functions. Then use them to save a dictionary to file, close the Python interpreter, and load the dictionary back from the file.Remark. Make sure that the
file_handle
is opened for writing when writing to the file, and for reading when reading from it.
Regular Expressions¶
A regular expression (or regex) is a string that encodes a string *pattern. The pattern specifies which strings do match the regex.
A regex consists of both normal and special characters:
- Normal characters match themselves.
- Special characters match sets of other characters.
A string matches a regex if it matches all of its characters, in the sequence in which they appear.
Example. A few simple examples:
- The regex
"text"
matches only the string"text"
. - The regex
".*"
matches all strings. - The regex
"beginning.*"
matches all strings that start with"beginning"
- The regex
".*end"
matches all strings that end with"end"
.
More formally, the contents of a regex can be are:
Character | Meaning |
---|---|
text |
Matches itself |
(regex) |
Matches the regex regex (i.e. parens don’t count) |
^ |
Matches the start of the string |
$ |
Matches the end of the string or just before the newline at the end of the string |
. |
Matches any character except a newline |
regex? |
Matches 0 or 1 repetitions of regex (longest possible) |
regex* |
Matches 0 or more repetitions of regex (longest possible) |
regex+ |
Matches 1 or more repetitions of regex (longest possible) |
regex{m,n} |
Matches from m to n repetitions of regex (longest possible) |
[...] |
Matches a set of characters |
[c1-c2] |
Matches the characters “in between” c1 and c2 |
[^...] |
Matches the complement of a set of characters |
r1|r2 |
Matches both r1 and r2 |
Note
There are many more special symbols that can be used into a regex. We will restrict ourselves to the most common ones.
Example. Let’s start with the anchor characters ^
and $
:
- The regex
"^something"
matches all strings that start with"something"
, for instance"something better"
. - The regex
"worse$"
matches all strings that end with"worse"
, for instance"I am feeling worse"
.
Example. The “anything goes” character .
(the dot) matches all
characters except the newline. For instance:
"."
matches all strings that contain at least one character."..."
matches all strings that contain at least three characters."^.$"
matches all those strings that contain exactly one character."^...$"
matches all those strings that contain exactly three characters.
Example. The “optional” character ?
match zero or more repetitions of
the preceding regex. For instance:
"words?"
matches both"word"
and"words"
: the last character of the"words"
regex (that is, the"s"
) is now optional."(optional)?"
matches both"optional"
and the empty string."he is (our)? over(lord!)?"
matches the following four strings:"he is over"
,"he is our over"
,"he is overlord!"
, and"he is our overlord!"
.
Here I used the parens (...)
to specify the scope of the ?
.
Example. The “zero or more” character "*"
and the “one or more” "+"
characters match zero or more (or one or more) repetitions of the preceding
regex. The parens (...)
grouping trick still applies. A few examples:
"Python!*"
matches all of the following strings:"Python"
,"Python!"
,"Python!!"
,"Python!!!!"
, etc."(column )+"
matches:"column "
,"column column "
, etc. but not the empty string""
."I think that (you think that (I think that)*)+ this regex is cool"
will match things like"I think that you think that this regex is cool"
, as well as"I think that you think that I think that you think that I think that this regex is cool"
, and so on.
Example. The “from n to m” regex {n,m}
matches from n
to m
repetitions of the previous regex.
For instance, "(column ){2,3}"
matches "column column "
and "column column column "
.
Example. Regexes can also match entire sets of characters (or their complement); in other words, they match all strings containing at least one of the characters in the set. For instance:
"[abc]"
matches strings that contain"a"
,"b"
, or"c"
. It does not match the string"zzzz"
."[^abc]"
matches all characters except"a"
,"b"
, and"c"
."[a-z]"
matches all lowercase alphabetic characters."[A-Z]"
matches all uppercase alphabetic characters."[0-9]"
matches all numeric characters from0
to9
(included)."[2-6]"
matches all numeric characters from2
to6
(included)."[^2-6]"
matches all characters except the numeric characters from2
to6
(included)."[a-zA-Z0-9]"
matches all alphanumeric characters.
Example. Perhaps the most powerful special character, the |
character,
allows to match either of two regexes. For instance:
"I|you|he|she|it|we|they"
will match any string containing at least one of the English personal pronouns."(PRO)+|(SER)+"
matches strings like"PRO"
,"PRO PRO"
,"SER"
and"SER SER"
, but not strings like"PRO SER"
or"SER PRO SER"
.
Example. As usual, if you want to “disable” a special character, you have
to escape it, i.e. prefix with a backslash \
. An alternative is to insert
the special character to be disabled in square brackets.
For instance, in order to match strings that contain at least one (literal) dot,
you can use either "\."
or "[.]"
.
Example. Regexes can be combined to obtain powerful matching operations. A couple of examples:
A regex that only matches strings that start with
"ATOM"
, followed by one or more space, followed by three space-separated integers:^ATOM[ ]+[0-9]+ [0-9]+ [0-9]+
The following string matches:
ATOM 30 42 12
A regex that matches a floating-point number in dot-notation:
[0-9]+(\.[0-9]+)?
i.e.
"123"
or"2.71828"
.A regex that matches a floating-point number in mathematical format:
[0-9]+(\.[0-9])?e[0-9]+
i.e.
"6.022e23"
. (It can be improved!)A regex that matches the following UBR-box sequence motif: zero or more methionins (M), followed by either a glutamic acid (E) or an aspartic acid (D). The motif only appears at the beginning of the sequence, and never at the end (i.e. it is followed by at least one more residue):
^M?([ED]).
The re
Module¶
The re
module of the Standard Python Library allows to deal with regular
expression matching, for instance checking whether a given string matches a
regular expression, or how many times a regular expression occurs in a string.
The available methods are:
Returns | Method | Meaning |
---|---|---|
match | match(regex, str) | Match a regular expression regex to the beginning of a string |
match | search(regex, str) | Search a string for the presence of a regex |
list | findall(regex, str) | Find all occurrences of a regex in a string |
Example. match(regex, string)
and search(regex, str)
return the
first match of the regex
in the string string
. The difference is that:
match()
requiresregex
to match at the beginning ofstring
search()
does not:regex
can match anywhere insidestring
If no match is found, they returns None
. Otherwise, a match
object is
returned. This makes it easy to see whether the regex matches the string.
To extract the matched text from the match
object, call its group()
method, as in the following examples.
For instance (make sure that the re
module has been imported):
>>> re.match("nomatch", "some text")
>>> print re.match("nomatch", "some text")
None
searches for the regex "nomatch"
into "some text"
, starting at the
beginning of the target string. Of course, no match is found ("some text"
does not start with "nomatch"
), so the function returns None
.
Cleary, the following doesn’t work either:
>>> re.match("text", "some text")
>>> print re.match("text", "some text")
None
because "some text"
does not start with "text"
! However:
>>> print re.search("text", "some text")
<_sre.SRE_Match object at 0x7fac7b5906b0>
>>> print re.search("text", "some text").group()
'text'
works, because search()
matches the regex
anywhere in string
, not
just at the beginning.
In order to see match()
work, try:
>>> print re.match("some", "some text")
<_sre.SRE_Match object at 0x7fac7b5906b0>
>>> print re.match("some", "some text").group()
some
Example. findall(regex, text)
returns the list of all matches of
regex
inside text
. For instance, let:
fasta = """>1A3A:A|PDBID|CHAIN|SEQUENCE
MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV
FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK
>1A3A:B|PDBID|CHAIN|SEQUENCE
MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV
FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK
>1A3A:C|PDBID|CHAIN|SEQUENCE
MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV
FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK
>1A3A:D|PDBID|CHAIN|SEQUENCE
MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV
FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK"""
then we can extract the headers using a single call to findall()
:
>>> from pprint import pprint
>>> pprint(re.findall("\n>.*\n", fasta))
['\n>1A3A:B|PDBID|CHAIN|SEQUENCE\n',
'\n>1A3A:C|PDBID|CHAIN|SEQUENCE\n',
'\n>1A3A:D|PDBID|CHAIN|SEQUENCE\n']
Warning
Note that the *
regex is greedy: it does not stop at the first match of
the |
character, but rather greedily proceeds till the last.
To extract only the protein IDs and chain ID, write:
>>> pprint(re.findall("\n>[a-zA-Z0-9:]+[^|]", fasta))
['\n>1A3A:B', '\n>1A3A:C', '\n>1A3A:D']
Exercises - Regular Expressions¶
Note
Here you can find the sequences.fasta FASTA file.
In the following exercises, you can use all the Python tools you want to compute the answer. In other words, looping over the lines of the FASTA file is allowed.
Motifs can appear anywhere in the sequence, unless stated otherwise.
Compute how many sequences contain the following motifs:
- A pheniylalanine (F); two arbitrary amino acids; another phenylalanine. The motif should occur at the end of the sequence.
- An arginine (R); a phenylalanine; an aminoacid that is not a proline (P); an isoleucine (I) or a valine (V).
Compute also how many sequences include at least one of the two motifs.
Compute how many sequences contain the following motifs:
- Three tyrosines (Y); at most three arbitrary amino acids; a histidine (H).
- Contains non-standard or unknown amino acids.
Are there sequences satisfying both conditions?
Note
Standard amino acids: A R N D C E Q G H I L K M F P S T W Y V.
Compute how many sequences contain the following motifs:
- An arginine (R); a lysine (K). The motif should not occur at the beginning of the sequence.
- Two arginines followed by an amino acid that is neither an arginine or a lysine.
- None of the previous two motifs.
Compute how many sequences contain the following motifs:
- A phenylalanyne (F); an arbitrary amino acid; a phenylalanine or a tyrosine (Y); a proline (P).
- A proline; a threonine (T) or a serine (S); an alanine (A); another proline. The motif should be neither at the beginning nor at the end of the sequence.
- The first motif followed by the second, or vice versa.