External Modules

Importing Packages

Packages are just Python modules that offer additional functionality. There are many packages in the wild, take a look at the Python Package Index for an overview.


In order to make use of a package, you have to first import it:

import somepackage

Once imported, you can perform the usual operations on the package, e.g.:

>>> help(somepackage)
>>> print somepackage.__version__
>>> somepackage.function()

The Python Standard Library is installed by default together with Python 2 (and Python 3). It provides a lot of packages for dealing with many different tasks, e.g. regular expressions (more on that later).

Other specific packages do not come along with Python 2 (or 3) by default, so they have to be installed separately.

Warning

If you try to import a package, and get:

>>> import iamnotinstalled
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ImportError: No module named iamnotinstalled

it means that the package is not installed on your machine.

Luckily, it is easy to install a package with Python. The process is very streamlined. If you want to install a new package, just type (from the shell):

$ pip install thepackageyouwanttoinstall --user

This will install the package named thepackageyouwanttoinstall into your home directory (more specifically, inside the ~/.local/lib/python2.7/ directory).

After the installation, Python will automatically pick up the package and allow you to import it.

Warning

These instructions assume that you have the pip installer. This is the case for most GNU/Linux distributions and MacOS X versions >= El Capitan.

If pip is not installed, either install it from your package manager on GNU/Linux, brew on MacOS X, or use the generic setup script provided by the Python Packaging Authority


Example. Let’s install the biopython package:

$ pip install biopython --user

The pip command will download the package (and all its dependencies) from the internet, and install them in the correct order.

After pip is done, open a Python interpreter and try to import the biopython package:

>>> import Bio

Warning

The pip installer and the python interpreter may use different names to refer to the same package.

In the example above, the Biopython package is called biopython by the pip installer, and simply Bio from within Python.

Please refer to the package documentation (i.e. its website) to figure out how it is called in the two different settings.


Example. Let’s say that you want to import the numpy module. Just write:

import numpy

Once imported, you can take a peek at the various functionalty supported by the numpy module by typing:

help(numpy)

For instance, to use the arccos function provided by the numpy module, type:

print numpy.arccos(0)

You can also abbreviate the name of the package with a shorthand, as follows:

>>> import numpy as np
>>> print np.__version__
1.11.2
>>> np.arccos(0)
1.5707963267948966
>>> np.arccos(1)
0.0

Example. You can import specific sub-modules using the following notation:

import module.submodule

Then you can call the functions available in the sub-module as:

module.submodule.function("stuff")

For instance, to import the linalg (linear algebra) submodule of the numpy package, write:

>>> import numpy.linalg
>>> help(numpy.linalg)
>>> help(numpy.linalg.eig)

Now, you can also import the submodule as a standalone package:

>>> from numpy import linalg
>>> help(linalg.eig)

as well as using the shorthand trick:

>>> import numpy.linalg as la
>>> help(la.eig)

or both:

>>> from numpy import linalg as la
>>> help(la.eig)

Finally, you can import individual functions:

>>> from numpy import arccos
>>> print arccos(0)

as well as multiple functions:

>>> from numpy import arccos, arcsin
>>> print arccos(0)
1.57079632679
>>> print arcsin(0)
0.0

Warning

The __future__ “module” is a special module used to import Python 3 functionality into Python 2 programs. It can be useful for writing code compatible with both Python 2 and Python 3.

For instance, although print is a statement in Python 2, in Python 3 it becomes a function, meaning that our beloved:

print "stuff"

does not work anymore: in Python 3, print is a proper function an requires brackets:

print("stuff")

In order to have a true print function in your Python 2 program, use the following import at the very beginning of your script:

from __future__ import print_function

A useful feature to import is:

from __future__ import division

Once imported, the division feature makes division beteen integers return a float if needed!


Exercises - Packages

  1. Import some of the packages from the Python Standard Library. Some useful ones are:

    • The math package. It offers a plethora of mathematical utility functions. Use its factorial() function to compute the factorial of all integers from 1 to 100, i.e. 1!, 2!, ...

      Remark. It is much faster than our naive implementation!

    • The glob package. It includes a method glob() that takes a path to a directory, and returns the list of the contents of the directory, pretty much like the ls shell command.

      Use the glob() function and print the list of the files in your home directory.

      Remark. Unfortunately glob() does not understand the ~ path.

    • The time() package. It provides a time() function that returns the number of seconds since the Epoch, i.e. midnight UTC of 1st January 1970.

      It also provides a sleep() function that takes a number of seconds, and makes your program pause for the specified amount of time (with reasonable approximation).

      After properly importing the time() and sleep() functions from the time package, check what this code does:

      t0 = time()
      for i in range(5):
          sleep(1)
          print "it took me {}s to get here".format(time() - t0)
      print "done in approx {}s".format(time() - t0)
      
    • The pickle() package. It provides facilities for storing arbitrary Python data structures to disk, and loading them back.

      It is a must-have package for saving, e.g., the results of data analysis or other complex objects that are difficult to encode into text.

      Study its dump(object, file_handle) and load(file_handle) functions. Then use them to save a dictionary to file, close the Python interpreter, and load the dictionary back from the file.

      Remark. Make sure that the file_handle is opened for writing when writing to the file, and for reading when reading from it.



Regular Expressions

A regular expression (or regex) is a string that encodes a string *pattern. The pattern specifies which strings do match the regex.

A regex consists of both normal and special characters:

  • Normal characters match themselves.
  • Special characters match sets of other characters.

A string matches a regex if it matches all of its characters, in the sequence in which they appear.


Example. A few simple examples:

  • The regex "text" matches only the string "text".
  • The regex ".*" matches all strings.
  • The regex "beginning.*" matches all strings that start with "beginning"
  • The regex ".*end" matches all strings that end with "end".

More formally, the contents of a regex can be are:

Character Meaning
text Matches itself
(regex) Matches the regex regex (i.e. parens don’t count)
^ Matches the start of the string
$ Matches the end of the string or just before the newline at the end of the string
. Matches any character except a newline
regex? Matches 0 or 1 repetitions of regex (longest possible)
regex* Matches 0 or more repetitions of regex (longest possible)
regex+ Matches 1 or more repetitions of regex (longest possible)
regex{m,n} Matches from m to n repetitions of regex (longest possible)
[...] Matches a set of characters
[c1-c2] Matches the characters “in between” c1 and c2
[^...] Matches the complement of a set of characters
r1|r2 Matches both r1 and r2

Note

There are many more special symbols that can be used into a regex. We will restrict ourselves to the most common ones.


Example. Let’s start with the anchor characters ^ and $:

  • The regex "^something" matches all strings that start with "something", for instance "something better".
  • The regex "worse$" matches all strings that end with "worse", for instance "I am feeling worse".

Example. The “anything goes” character . (the dot) matches all characters except the newline. For instance:

  • "." matches all strings that contain at least one character.
  • "..." matches all strings that contain at least three characters.
  • "^.$" matches all those strings that contain exactly one character.
  • "^...$" matches all those strings that contain exactly three characters.

Example. The “optional” character ? match zero or more repetitions of the preceding regex. For instance:

  • "words?" matches both "word" and "words": the last character of the "words" regex (that is, the "s") is now optional.
  • "(optional)?" matches both "optional" and the empty string.
  • "he is (our)? over(lord!)?" matches the following four strings: "he is over", "he is our over", "he is overlord!", and "he is our overlord!".

Here I used the parens (...) to specify the scope of the ?.


Example. The “zero or more” character "*" and the “one or more” "+" characters match zero or more (or one or more) repetitions of the preceding regex. The parens (...) grouping trick still applies. A few examples:

  • "Python!*" matches all of the following strings: "Python", "Python!", "Python!!", "Python!!!!", etc.
  • "(column )+" matches: "column ", "column column ", etc. but not the empty string "".
  • "I think that (you think that (I think that)*)+ this regex is cool" will match things like "I think that you think that this regex is cool", as well as "I think that you think that I think that you think that I think that this regex is cool", and so on.

Example. The “from n to m” regex {n,m} matches from n to m repetitions of the previous regex.

For instance, "(column ){2,3}" matches "column column " and "column column column ".


Example. Regexes can also match entire sets of characters (or their complement); in other words, they match all strings containing at least one of the characters in the set. For instance:

  • "[abc]" matches strings that contain "a", "b", or "c". It does not match the string "zzzz".
  • "[^abc]" matches all characters except "a", "b", and "c".
  • "[a-z]" matches all lowercase alphabetic characters.
  • "[A-Z]" matches all uppercase alphabetic characters.
  • "[0-9]" matches all numeric characters from 0 to 9 (included).
  • "[2-6]" matches all numeric characters from 2 to 6 (included).
  • "[^2-6]" matches all characters except the numeric characters from 2 to 6 (included).
  • "[a-zA-Z0-9]" matches all alphanumeric characters.

Example. Perhaps the most powerful special character, the | character, allows to match either of two regexes. For instance:

  • "I|you|he|she|it|we|they" will match any string containing at least one of the English personal pronouns.
  • "(PRO)+|(SER)+" matches strings like "PRO", "PRO PRO", "SER" and "SER SER", but not strings like "PRO SER" or "SER PRO SER".

Example. As usual, if you want to “disable” a special character, you have to escape it, i.e. prefix with a backslash \. An alternative is to insert the special character to be disabled in square brackets.

For instance, in order to match strings that contain at least one (literal) dot, you can use either "\." or "[.]".


Example. Regexes can be combined to obtain powerful matching operations. A couple of examples:

  • A regex that only matches strings that start with "ATOM", followed by one or more space, followed by three space-separated integers:

    ^ATOM[ ]+[0-9]+ [0-9]+ [0-9]+
    

    The following string matches:

    ATOM  30 42 12
    
  • A regex that matches a floating-point number in dot-notation:

    [0-9]+(\.[0-9]+)?
    

    i.e. "123" or "2.71828".

  • A regex that matches a floating-point number in mathematical format:

    [0-9]+(\.[0-9])?e[0-9]+
    

    i.e. "6.022e23". (It can be improved!)

  • A regex that matches the following UBR-box sequence motif: zero or more methionins (M), followed by either a glutamic acid (E) or an aspartic acid (D). The motif only appears at the beginning of the sequence, and never at the end (i.e. it is followed by at least one more residue):

    ^M?([ED]).
    

The re Module

The re module of the Standard Python Library allows to deal with regular expression matching, for instance checking whether a given string matches a regular expression, or how many times a regular expression occurs in a string.

The available methods are:

Returns Method Meaning
match match(regex, str) Match a regular expression regex to the beginning of a string
match search(regex, str) Search a string for the presence of a regex
list findall(regex, str) Find all occurrences of a regex in a string

Example. match(regex, string) and search(regex, str) return the first match of the regex in the string string. The difference is that:

  • match() requires regex to match at the beginning of string
  • search() does not: regex can match anywhere inside string

If no match is found, they returns None. Otherwise, a match object is returned. This makes it easy to see whether the regex matches the string.

To extract the matched text from the match object, call its group() method, as in the following examples.

For instance (make sure that the re module has been imported):

>>> re.match("nomatch", "some text")
>>> print re.match("nomatch", "some text")
None

searches for the regex "nomatch" into "some text", starting at the beginning of the target string. Of course, no match is found ("some text" does not start with "nomatch"), so the function returns None.

Cleary, the following doesn’t work either:

>>> re.match("text", "some text")
>>> print re.match("text", "some text")
None

because "some text" does not start with "text"! However:

>>> print re.search("text", "some text")
<_sre.SRE_Match object at 0x7fac7b5906b0>
>>> print re.search("text", "some text").group()
'text'

works, because search() matches the regex anywhere in string, not just at the beginning.

In order to see match() work, try:

>>> print re.match("some", "some text")
<_sre.SRE_Match object at 0x7fac7b5906b0>
>>> print re.match("some", "some text").group()
some

Example. findall(regex, text) returns the list of all matches of regex inside text. For instance, let:

fasta = """>1A3A:A|PDBID|CHAIN|SEQUENCE
MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV
FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK
>1A3A:B|PDBID|CHAIN|SEQUENCE
MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV
FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK
>1A3A:C|PDBID|CHAIN|SEQUENCE
MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV
FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK
>1A3A:D|PDBID|CHAIN|SEQUENCE
MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV
FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK"""

then we can extract the headers using a single call to findall():

>>> from pprint import pprint
>>> pprint(re.findall("\n>.*\n", fasta))
['\n>1A3A:B|PDBID|CHAIN|SEQUENCE\n',
 '\n>1A3A:C|PDBID|CHAIN|SEQUENCE\n',
 '\n>1A3A:D|PDBID|CHAIN|SEQUENCE\n']

Warning

Note that the * regex is greedy: it does not stop at the first match of the | character, but rather greedily proceeds till the last.

To extract only the protein IDs and chain ID, write:

>>> pprint(re.findall("\n>[a-zA-Z0-9:]+[^|]", fasta))
['\n>1A3A:B', '\n>1A3A:C', '\n>1A3A:D']


Exercises - Regular Expressions

Note

Here you can find the sequences.fasta FASTA file.

In the following exercises, you can use all the Python tools you want to compute the answer. In other words, looping over the lines of the FASTA file is allowed.

Motifs can appear anywhere in the sequence, unless stated otherwise.

  1. Compute how many sequences contain the following motifs:

    • A pheniylalanine (F); two arbitrary amino acids; another phenylalanine. The motif should occur at the end of the sequence.
    • An arginine (R); a phenylalanine; an aminoacid that is not a proline (P); an isoleucine (I) or a valine (V).

    Compute also how many sequences include at least one of the two motifs.

  2. Compute how many sequences contain the following motifs:

    • Three tyrosines (Y); at most three arbitrary amino acids; a histidine (H).
    • Contains non-standard or unknown amino acids.

    Are there sequences satisfying both conditions?

    Note

    Standard amino acids: A R N D C E Q G H I L K M F P S T W Y V.

  3. Compute how many sequences contain the following motifs:

    • An arginine (R); a lysine (K). The motif should not occur at the beginning of the sequence.
    • Two arginines followed by an amino acid that is neither an arginine or a lysine.
    • None of the previous two motifs.
  4. Compute how many sequences contain the following motifs:

    • A phenylalanyne (F); an arbitrary amino acid; a phenylalanine or a tyrosine (Y); a proline (P).
    • A proline; a threonine (T) or a serine (S); an alanine (A); another proline. The motif should be neither at the beginning nor at the end of the sequence.
    • The first motif followed by the second, or vice versa.