================ External Modules ================ Importing Packages ------------------ Packages are just Python modules that offer additional functionality. There are many packages in the wild, take a look at the `Python Package Index `_ for an overview. **** In order to make use of a package, you have to first import it:: import somepackage Once imported, you can perform the usual operations on the package, e.g.:: >>> help(somepackage) >>> print somepackage.__version__ >>> somepackage.function() **** The `Python Standard Library `_ is installed by default together with Python 2 (and Python 3). It provides a lot of packages for dealing with many different tasks, e.g. regular expressions (more on that later). Other specific packages do not come along with Python 2 (or 3) by default, so they have to be installed separately. .. warning:: If you try to import a package, and get:: >>> import iamnotinstalled Traceback (most recent call last): File "", line 1, in ImportError: No module named iamnotinstalled it means that the package is not installed on your machine. Luckily, it is easy to **install a package** with Python. The process is very streamlined. If you want to install a new package, just type (from the shell):: $ pip install thepackageyouwanttoinstall --user This will install the package named ``thepackageyouwanttoinstall`` into your home directory (more specifically, inside the ``~/.local/lib/python2.7/`` directory). After the installation, Python will automatically pick up the package and allow you to import it. .. warning:: These instructions assume that you have the ``pip`` installer. This is the case for most GNU/Linux distributions and MacOS X versions >= El Capitan. If ``pip`` is not installed, either install it from your package manager on GNU/Linux, ``brew`` on MacOS X, or use the generic setup script provided by the `Python Packaging Authority `_ **** **Example.** Let's install the ``biopython`` package:: $ pip install biopython --user The ``pip`` command will download the package (and all its dependencies) from the internet, and install them in the correct order. After ``pip`` is done, open a Python interpreter and try to import the ``biopython`` package:: >>> import Bio .. warning:: The ``pip`` installer and the ``python`` interpreter may use different names to refer to the same package. In the example above, the Biopython package is called ``biopython`` by the ``pip`` installer, and simply ``Bio`` from within Python. Please refer to the package documentation (i.e. its website) to figure out how it is called in the two different settings. **** **Example**. Let's say that you want to import the ``numpy`` module. Just write:: import numpy Once imported, you can take a peek at the various functionalty supported by the ``numpy`` module by typing:: help(numpy) For instance, to use the ``arccos`` function provided by the ``numpy`` module, type:: print numpy.arccos(0) You can also abbreviate the name of the package with a shorthand, as follows:: >>> import numpy as np >>> print np.__version__ 1.11.2 >>> np.arccos(0) 1.5707963267948966 >>> np.arccos(1) 0.0 **** **Example**. You can import specific sub-modules using the following notation:: import module.submodule Then you can call the functions available in the sub-module as:: module.submodule.function("stuff") For instance, to import the ``linalg`` (linear algebra) submodule of the ``numpy`` package, write:: >>> import numpy.linalg >>> help(numpy.linalg) >>> help(numpy.linalg.eig) Now, you can also import the submodule as a standalone package:: >>> from numpy import linalg >>> help(linalg.eig) as well as using the shorthand trick:: >>> import numpy.linalg as la >>> help(la.eig) or both:: >>> from numpy import linalg as la >>> help(la.eig) Finally, you can import individual functions:: >>> from numpy import arccos >>> print arccos(0) as well as multiple functions:: >>> from numpy import arccos, arcsin >>> print arccos(0) 1.57079632679 >>> print arcsin(0) 0.0 **** .. warning:: The ``__future__`` "module" is a special module used to import Python 3 functionality into Python 2 programs. It can be useful for writing code compatible with both Python 2 and Python 3. For instance, although ``print`` is a *statement* in Python 2, in Python 3 it becomes a *function*, meaning that our beloved:: print "stuff" does not work anymore: in Python 3, ``print`` is a proper function an requires brackets:: print("stuff") In order to have a true ``print`` function in your Python 2 program, use the following import at the very beginning of your script:: from __future__ import print_function A useful feature to import is:: from __future__ import division Once imported, the ``division`` feature makes division beteen integers return a ``float`` if needed! | Exercises - Packages -------------------- #. Import some of the packages from the Python Standard Library. Some useful ones are: - The ``math`` package. It offers a plethora of mathematical utility functions. Use its ``factorial()`` function to compute the factorial of all integers from ``1`` to ``100``, i.e. ``1!``, ``2!``, ... *Remark*. It is much faster than our naive implementation! - The ``glob`` package. It includes a method ``glob()`` that takes a path to a directory, and returns the list of the contents of the directory, pretty much like the ``ls`` shell command. Use the ``glob()`` function and print the list of the files in your home directory. *Remark*. Unfortunately ``glob()`` does not understand the ``~`` path. - The ``time()`` package. It provides a ``time()`` function that returns the number of seconds since `the Epoch `_, i.e. midnight UTC of 1st January 1970. It also provides a ``sleep()`` function that takes a number of seconds, and makes your program pause for the specified amount of time (with reasonable approximation). After properly importing the ``time()`` and ``sleep()`` functions from the ``time`` package, check what this code does:: t0 = time() for i in range(5): sleep(1) print "it took me {}s to get here".format(time() - t0) print "done in approx {}s".format(time() - t0) - The ``pickle()`` package. It provides facilities for storing arbitrary Python data structures to disk, and loading them back. It is a must-have package for saving, e.g., the results of data analysis or other complex objects that are difficult to encode into text. Study its ``dump(object, file_handle)`` and ``load(file_handle)`` functions. Then use them to save a dictionary to file, close the Python interpreter, and load the dictionary back from the file. *Remark*. Make sure that the ``file_handle`` is opened for writing when writing to the file, and for reading when reading from it. | | Regular Expressions ------------------- A **regular expression** (or **regex**) is a string that encodes a *string *pattern*. The pattern specifies which strings do *match* the regex. A regex consists of both *normal* and *special* characters: - Normal characters match themselves. - Special characters match sets of other characters. A string matches a regex if it matches all of its characters, in the sequence in which they appear. **** **Example**. A few simple examples: - The regex ``"text"`` matches only the string ``"text"``. - The regex ``".*"`` matches all strings. - The regex ``"beginning.*"`` matches all strings that start with ``"beginning"`` - The regex ``".*end"`` matches all strings that end with ``"end"``. **** More formally, the contents of a regex can be are: =============== ================================================================================= Character Meaning =============== ================================================================================= ``text`` Matches itself ``(regex)`` Matches the regex ``regex`` (i.e. parens don't count) ``^`` Matches the start of the string ``$`` Matches the end of the string or just before the newline at the end of the string ``.`` Matches any character except a newline ``regex?`` Matches 0 or 1 repetitions of ``regex`` (longest possible) ``regex*`` Matches 0 or more repetitions of ``regex`` (longest possible) ``regex+`` Matches 1 or more repetitions of ``regex`` (longest possible) ``regex{m,n}`` Matches from m to n repetitions of ``regex`` (longest possible) ``[...]`` Matches a set of characters ``[c1-c2]`` Matches the characters "in between" ``c1`` and ``c2`` ``[^...]`` Matches the complement of a set of characters ``r1|r2`` Matches both ``r1`` and ``r2`` =============== ================================================================================= .. note:: There are many more special symbols that can be used into a regex. We will restrict ourselves to the most common ones. **** **Example**. Let's start with the anchor characters ``^`` and ``$``: - The regex ``"^something"`` matches all strings that start with ``"something"``, for instance ``"something better"``. - The regex ``"worse$"`` matches all strings that end with ``"worse"``, for instance ``"I am feeling worse"``. **** **Example**. The "anything goes" character ``.`` (the dot) matches all characters except the newline. For instance: - ``"."`` matches all strings that contain *at least* one character. - ``"..."`` matches all strings that contain *at least* three characters. - ``"^.$"`` matches all those strings that contain *exactly* one character. - ``"^...$"`` matches all those strings that contain *exactly* three characters. **** **Example**. The "optional" character ``?`` match zero or more repetitions of the preceding regex. For instance: - ``"words?"`` matches both ``"word"`` and ``"words"``: the last character of the ``"words"`` regex (that is, the ``"s"``) is now optional. - ``"(optional)?"`` matches both ``"optional"`` and the empty string. - ``"he is (our)? over(lord!)?"`` matches the following four strings: ``"he is over"``, ``"he is our over"``, ``"he is overlord!"``, and ``"he is our overlord!"``. Here I used the parens ``(...)`` to specify the scope of the ``?``. **** **Example**. The "zero or more" character ``"*"`` and the "one or more" ``"+"`` characters match zero or more (or one or more) repetitions of the preceding regex. The parens ``(...)`` grouping trick still applies. A few examples: - ``"Python!*"`` matches all of the following strings: ``"Python"``, ``"Python!"``, ``"Python!!"``, ``"Python!!!!"``, etc. - ``"(column )+"`` matches: ``"column "``, ``"column column "``, etc. but not the empty string ``""``. - ``"I think that (you think that (I think that)*)+ this regex is cool"`` will match things like ``"I think that you think that this regex is cool"``, as well as ``"I think that you think that I think that you think that I think that this regex is cool"``, and so on. **** **Example**. The "from n to m" regex ``{n,m}`` matches from ``n`` to ``m`` repetitions of the previous regex. For instance, ``"(column ){2,3}"`` matches ``"column column "`` and ``"column column column "``. **** **Example**. Regexes can also match entire sets of characters (or their complement); in other words, they match all strings containing at least one of the characters in the set. For instance: - ``"[abc]"`` matches strings that contain ``"a"``, ``"b"``, or ``"c"``. It does not match the string ``"zzzz"``. - ``"[^abc]"`` matches all characters *except* ``"a"``, ``"b"``, and ``"c"``. - ``"[a-z]"`` matches all lowercase alphabetic characters. - ``"[A-Z]"`` matches all uppercase alphabetic characters. - ``"[0-9]"`` matches all numeric characters from ``0`` to ``9`` (included). - ``"[2-6]"`` matches all numeric characters from ``2`` to ``6`` (included). - ``"[^2-6]"`` matches all characters *except* the numeric characters from ``2`` to ``6`` (included). - ``"[a-zA-Z0-9]"`` matches all alphanumeric characters. **** **Example**. Perhaps the most powerful special character, the ``|`` character, allows to match either of two regexes. For instance: - ``"I|you|he|she|it|we|they"`` will match any string containing at least one of the English personal pronouns. - ``"(PRO)+|(SER)+"`` matches strings like ``"PRO"``, ``"PRO PRO"``, ``"SER"`` and ``"SER SER"``, but not strings like ``"PRO SER"`` or ``"SER PRO SER"``. **** **Example**. As usual, if you want to "disable" a special character, you have to *escape* it, i.e. prefix with a backslash ``\``. An alternative is to insert the special character to be disabled in square brackets. For instance, in order to match strings that contain at least one (literal) dot, you can use either ``"\."`` or ``"[.]"``. **** **Example**. Regexes can be combined to obtain powerful matching operations. A couple of examples: - A regex that only matches strings that start with ``"ATOM"``, followed by one or more space, followed by three space-separated integers:: ^ATOM[ ]+[0-9]+ [0-9]+ [0-9]+ The following string matches:: ATOM 30 42 12 - A regex that matches a floating-point number in dot-notation:: [0-9]+(\.[0-9]+)? i.e. ``"123"`` or ``"2.71828"``. - A regex that matches a floating-point number in mathematical format:: [0-9]+(\.[0-9])?e[0-9]+ i.e. ``"6.022e23"``. (It can be improved!) - A regex that matches the following UBR-box sequence motif: zero or more methionins (M), followed by either a glutamic acid (E) or an aspartic acid (D). The motif only appears at the beginning of the sequence, and never at the end (i.e. it is followed by at least one more residue):: ^M?([ED]). | The ``re`` Module ----------------- The ``re`` module of the Standard Python Library allows to deal with regular expression matching, for instance checking whether a given string matches a regular expression, or how many times a regular expression occurs in a string. The available methods are: ======= ===================== =============================================================== Returns Method Meaning ======= ===================== =============================================================== match match(regex, str) Match a regular expression regex to the beginning of a string match search(regex, str) Search a string for the presence of a regex list findall(regex, str) Find all occurrences of a regex in a string ======= ===================== =============================================================== **** **Example**. ``match(regex, string)`` and ``search(regex, str)`` return the first match of the ``regex`` in the string ``string``. The difference is that: - ``match()`` requires ``regex`` to match at the **beginning** of ``string`` - ``search()`` does not: ``regex`` can match **anywhere** inside ``string`` If no match is found, they returns ``None``. Otherwise, a ``match`` object is returned. This makes it easy to see *whether* the regex matches the string. To extract the matched text from the ``match`` object, call its ``group()`` method, as in the following examples. For instance (make sure that the ``re`` module has been imported):: >>> re.match("nomatch", "some text") >>> print re.match("nomatch", "some text") None searches for the regex ``"nomatch"`` into ``"some text"``, starting at the beginning of the target string. Of course, no match is found (``"some text"`` does not start with ``"nomatch"``), so the function returns ``None``. Cleary, the following doesn't work either:: >>> re.match("text", "some text") >>> print re.match("text", "some text") None because ``"some text"`` does not *start with* ``"text"``! However:: >>> print re.search("text", "some text") <_sre.SRE_Match object at 0x7fac7b5906b0> >>> print re.search("text", "some text").group() 'text' works, because ``search()`` matches the ``regex`` anywhere in ``string``, not just at the beginning. In order to see ``match()`` work, try:: >>> print re.match("some", "some text") <_sre.SRE_Match object at 0x7fac7b5906b0> >>> print re.match("some", "some text").group() some **** **Example**. ``findall(regex, text)`` returns the list of all matches of ``regex`` inside ``text``. For instance, let:: fasta = """>1A3A:A|PDBID|CHAIN|SEQUENCE MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK >1A3A:B|PDBID|CHAIN|SEQUENCE MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK >1A3A:C|PDBID|CHAIN|SEQUENCE MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK >1A3A:D|PDBID|CHAIN|SEQUENCE MANLFKLGAENIFLGRKAATKEEAIRFAGEQLVKGGYVEPEYVQAMLDREKLTPTYLGESIAVPHGTVEAKDRVLKTGVV FCQYPEGVRFGEEEDDIARLVIGIAARNNEHIQVITSLTNALDDESVIERLAHTTSVDEVLELLAGRK""" then we can extract the headers using a single call to ``findall()``:: >>> from pprint import pprint >>> pprint(re.findall("\n>.*\n", fasta)) ['\n>1A3A:B|PDBID|CHAIN|SEQUENCE\n', '\n>1A3A:C|PDBID|CHAIN|SEQUENCE\n', '\n>1A3A:D|PDBID|CHAIN|SEQUENCE\n'] .. warning:: Note that the ``*`` regex is greedy: it does not stop at the first match of the ``|`` character, but rather greedily proceeds till the last. To extract only the protein IDs and chain ID, write:: >>> pprint(re.findall("\n>[a-zA-Z0-9:]+[^|]", fasta)) ['\n>1A3A:B', '\n>1A3A:C', '\n>1A3A:D'] **** | Exercises - Regular Expressions ------------------------------- .. note:: Here you can find the `sequences.fasta `_ FASTA file. In the following exercises, you can use all the Python tools you want to compute the answer. In other words, looping over the lines of the FASTA file is allowed. Motifs can appear anywhere in the sequence, unless stated otherwise. #. Compute how many sequences contain the following motifs: - A pheniylalanine (F); two arbitrary amino acids; another phenylalanine. The motif should occur at the end of the sequence. - An arginine (R); a phenylalanine; an aminoacid that is not a proline (P); an isoleucine (I) or a valine (V). Compute also how many sequences include at least one of the two motifs. #. Compute how many sequences contain the following motifs: - Three tyrosines (Y); at most three arbitrary amino acids; a histidine (H). - Contains non-standard or unknown amino acids. Are there sequences satisfying both conditions? .. note:: Standard amino acids: A R N D C E Q G H I L K M F P S T W Y V. #. Compute how many sequences contain the following motifs: - An arginine (R); a lysine (K). The motif should not occur at the beginning of the sequence. - Two arginines followed by an amino acid that is neither an arginine or a lysine. - None of the previous two motifs. #. Compute how many sequences contain the following motifs: - A phenylalanyne (F); an arbitrary amino acid; a phenylalanine or a tyrosine (Y); a proline (P). - A proline; a threonine (T) or a serine (S); an alanine (A); another proline. The motif should be neither at the beginning nor at the end of the sequence. - The first motif followed by the second, or vice versa.