=========
Biopython
=========

What is Biopython?
------------------

Biopython is the largest and most common bioinformatics package for Python. It
contains a number of different sub-modules for common bioinformatics tasks.

It is freely available here:

    `<http://biopython.org/wiki/Biopython>`_

while the entire source code is hosted here:

    `<https://github.com/biopython/biopython>`_

.. note::

    The Biopython code is distributed under the `Biopython License <https://github.com/biopython/biopython/blob/master/LICENSE>`_.

|

Biopython sports several features. It provides:

- Data structures for **biological sequences** and features thereof, as well
  as a multitude of manipulation functions for performing common tasks, such
  as translation, transcription, weight computations, ...

- Functions for loading **biological data** into ``dict``-like data structures.

  Supported formats include, for instance: FASTA, PDB, GenBank, Blast, SCOP,
  PubMed/Medline, ExPASy-related formats

- Functions for **data analysis**, including basic data mining and machine
  learning algorithm: *k*-Nearest Neighbors, Naive Bayes, and Support Vector
  Machines

- Access to **online services** and database, including NCBI services (Blast,
  Entrez, PubMed) and ExPASY services (SwissProt, Prosite)

- Access to **local services**, including Blast, Clustalw, EMBOSS.

|

.. warning::
    In order to use local services (such as Blast), the latter must be
    properly installed and configured on the local machine.

    Setting up the local services is beyond the scope of this course.

|

.. note::
    In order to install Biopython:

    - On a Debian-based distributions (for instance Ubuntu), just type::

        $ sudo apt-get install python-biopython

    - On a generic GNU/Linux distribution, just type::

        $ pip install biopython --user

    Please refer to the `official documentation <http://biopython.org/DIST/docs/install/Installation.html>`_ for further instructions.

|

.. note::

    While Biopython is the main player in the field, it is not the only one.
    Other interesting packages are:

    `ETE <http://etetoolkit.org/>`_ and `DendroPy <http://dendropy.org/>`_,
    dedicated to computation and visualization of phylogenetic trees.

    `CSB <https://csb.codeplex.com/documentation>`_ for dealing with
    sequences and structures, computing alignments and profiles (with
    profile HMMs), and Monte Carlo sampling.

    `biskit <http://biskit.pasteur.fr/>`_ is designed for structural
    bioinformatics tasks, such as loading 3D protein structures in
    Protein Data Bank (PDB) format, homology modelling, molecular
    dynamics simulations (using external simulators) and trajectory
    computations, and docking, among others.

    You are free to mix and match all the packages you need.

|

Biopython: Sequences
--------------------

Sequences lay at the core of bioinformatics: although DNA, RNA and proteins
are molecules with specific structures and dynamical behaviors, their basic
building block is their sequence.

Biopython encodes sequences using objects of type ``Seq``, provided by the
``Bio.Seq`` sub-module. Take a look at their manual::

    from Bio.Seq import Seq
    help(Seq)

.. note::
    The official documentation of the Biopython ``Seq`` class can be found on
    the `Biopython wiki <http://biopython.org/wiki/Seq>`_.

|

Each ``Seq`` object is associated to a formal alphabet.

- Each sequence has its own **alphabet**

- Conversions between different sequence types (transcription, translation)
  can be seen as a change of alphabet.

- Some operations (e.g. concatenation) can not be done between sequences
  with incompatible alphabets.

|

The various alphabets supported by Biopython can be found in ``Bio.Alphabet``,
e.g. ``Bio.Alphabet.IUPAC``.

The most basic alphabets are:

- ``IUPAC.unambiguous_dna``: a sequence of DNA
- ``IUPAC.unambiguous_rna``: a sequence of RNA
- ``IUPAC.protein``: a protein sequence

****

**Example**. To create a generic sequence, i.e. a sequence not bound to any
specific alphabet, write::

    >>> from Bio.Seq import Seq
    >>> s = Seq("ACCGTTTAAC") # no alphabet specified
    >>> print type(s)
    <class 'Bio.Seq.Seq'>
    >>> s
    Seq('ACCGTTTAAC', Alphabet())
    >>> print s
    ACCGTTTAAC

To create a DNA sequence, specify the corresponding alphabet when creating
the ``Seq`` object::

    >>> from Bio.Seq import Seq
    >>> from Bio.Alphabet import IUPAC
    >>> s = Seq("ACCGTTTAAC", IUPAC.unambiguous_dna)
    >>> s
    Seq('ACCGTTTAAC', IUPACUnambiguousDNA())

****

``Seq`` objects act mostly like standard Python strings ``str``. For instance,
you can write::

    >>> from Bio.Seq import Seq
    >>> s = Seq("ACTG")
    >>> len(s)
    4
    >>> s[2]
    'T'
    >>> s[1:-1]
    Seq('CT', Alphabet())
    >>> s.count("C")
    3

However, some operations are restricted to ``Seq`` objects with the same
alphabet (in order to prevent the user from, say, concatenating nucleotides
and amino acids)::

    >>> s = Seq("ACTG", IUPAC.unambiguous_dna)
    >>> s + s
    Seq('ACTGACTG', IUPACUnambiguousDNA())

    >>> t = Seq("AAADGHHHTEWW", IUPAC.protein)
    >>> s
    Seq('ACTG', IUPACUnambiguousDNA())
    >>> t
    Seq('AAADGHHHTEWW', IUPACProtein())
    >>> s + t
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/home/stefano/.local/lib/python2.7/site-packages/Bio/Seq.py", line 292, in __add__
        self.alphabet, other.alphabet))
    TypeError: Incompatible alphabets IUPACUnambiguousDNA() and IUPACProtein()

To convert back a ``Seq`` object into a normal ``str`` object, just use the
normal type-casting operator ``str()``::

    >>> s = Seq("ACTG", IUPAC.unambiguous_dna)
    >>> str(s)

.. warning::
    Just like strings, ``Seq`` objects are **immutable**::

        >>> s = Seq("ACCGTTTAAC")
        >>> s[0] = "G"
        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        TypeError: 'Seq' object does not support item assignment

    Thanks to their immutability, ``Seq`` objects can be used as keys in
    dictionaries (like ``str`` objects do).

.. note::
    Biopython also provides a mutable sequence type ``MutableSeq``.

    You can convert back and forth between ``Seq`` and ``MutableSeq``::

        >>> s = Seq("ACTG", IUPAC.unambiguous_dna)
        >>> muts = s.tomutable()
        >>> muts[0] = "C"
        >>> muts
        MutableSeq('CCTG', IUPACUnambiguousDNA())
        >>> muts.toseq()
        Seq('CCTG', IUPACUnambiguousDNA())

****

|

Some operations are **only** available when specific alphabets are used. Let's
see a few examples.

|

**Example**. The ``complement()`` method allows to complement a DNA or RNA
sequence::

  >>> s = Seq("CCGTTAAAAC", IUPAC.unambiguous_dna)
  >>> s.complement()
  Seq('GGCAATTTTG', IUPACUnambiguousDNA())

while the ``reverse_complement()`` also reverses the result from left
to right::

  >>> s.reverse_complement()
  Seq('GTTTTAACGG', IUPACUnambiguousDNA())

.. warning::

    Neither of these methods is available for other alphabets::

      >>> p = Seq("ACLKQ", IUPAC.protein)
      >>> p.complement()
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/home/stefano/.local/lib/python2.7/site-packages/Bio/Seq.py", line 760, in complement
          raise ValueError("Proteins do not have complements!")
      ValueError: Proteins do not have complements!

|

**Example**. (Adapted from the Biopython tutorial.) Transcription is the
biological process by which a DNA sequence is transcribed into the
corresponding DNA.

A picture of the transcription process::

    5’  ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG 3’ (coding strand)
        |||||||||||||||||||||||||||||||||||||||
    3’  TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC 5’ (template strand)
                            |
                       Transcription
                            ↓
    5’  AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG 3’ (mRNA)

The same process can be "simulated" on ``Seq`` objects. Starting from the
coding strand (as is usually done in bioinformatics), it is only a matter
of converting T's into U's::

    >>> coding_dna
    Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())
    >>> mrna = coding_dna.transcribe()
    >>> mrna
    Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

From the template strand instead we first have to compute the reverse
complement, and then perform the character substitution::

    >>> template_dna
    'TACCGGTAACATTACCCGGCGACTTTCCCACGGGCTATC'
    >>> template_dna.reverse_complement().transcribe()
    Seq('AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG', IUPACUnambiguousRNA())

It is also easy to perform the reverse process, back-transcription::

    >>> mrna
    'AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG'
    >>> mrna.back_transcribe()
    Seq('ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG', IUPACUnambiguousDNA())

.. warning::
    These functions also take care of checking that the input alphabet
    is the right one (e.g. DNA) and apply the right alphabet to the
    result (e.g. RNA).

|

**Example**. (Adapted from the Biopython tutorial.) Let's not take a look at
**translation**, i.e. mRNA to protein.

The code is as follows::

    >>>from Bio.Seq import Seq
    >>>from Bio.Alphabet import IUPAC
    >>>mrna = Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", IUPAC.unambiguous_rna)
    >>>print mrna
    AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG
    >>>print mrna.translate()
    MAIVMGR*KGAR*

Here the stop codons are indicated with an asterisk ``"*"``.

Now, different coding tables are available, for instance::

    >>> coding_dna.translate(table="Vertebrate Mitochondrial")
    Seq('MAIVMGRWKGAR*', HasStopCodon(IUPACProtein(), '*'))

It's also possible to tell the ``translate()`` method to stop at the first
stop codon::

    >>> coding_dna.translate()
    Seq('MAIVMGR*KGAR*', HasStopCodon(IUPACProtein(), '*'))
    >>> coding_dna.translate(to_stop=True)
    Seq('MAIVMGR', IUPACProtein())

In this case the stop codon is not included in the resulting sequence.

.. note::
    The `Genetic Codes <https://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi>`_
    page of the NCBI provides the full list of translation tables used by
    Biopython.

    In order to visualize the various translation tables, you can do::

        from Bio.Data import CodonTable

        ttable = CodonTable.unambiguous_dna_by_name["Standard"]
        print ttable

        ttable = CodonTable.unambiguous_dna_by_name["Vertebrate Mitochondrial"]
        print ttable

    You can also visualize what are the encodings of stop and start codons::

        print ttable.stop_codons
        print ttable.start_codons

|

Let's take a look at some common sequence formats.

- FASTA is the most basic format, we have used it several times already::

    >sp|P25730|FMS1_ECOLI CS1 fimbrial subunit A precursor (CS1 pilin)
    MKLKKTIGAMALATLFATMGASAVEKTISVTASVDPTVDLLQSDGSALPNSVALTYSPAV
    NNFEAHTINTVVHTNDSDKGVVVKLSADPVLSNVLNPTLQIPVSVNFAGKPLSTTGITID
    SNDLNFASSGVNKVSSTQKLSIHADATRVTGGALTAGQYQGLVSIILTKST
    >sp|P15488|FMS3_ECOLI CS3 fimbrial subunit A precursor (CS3 pilin)
    MLKIKYLLIGLSLSAMSSYSLAAAGPTLTKELALNVLSPAALDATWAPQDNLTLSNTGVS
    NTLVGVLTLSNTSIDTVSIASTNVSDTSKNGTVTFAHETNNSASFATTISTDNANITLDK
    NAGNTIVKTTNGSQLPTNLPLKFITTEGNEHLVSGNYRANITITSTIK

- GenBank is a richer sequence format for genes, it includes
  fields for various kinds of **annotations** (e.g. source organism,
  bibliographical references, database-specific accession number).

  A GenBank file looks like this::

    LOCUS       5I3U_F                    39 bp    DNA     linear   SYN 31-AUG-2016
    DEFINITION  Chain F, Structure Of Hiv-1 Reverse Transcriptase N-site Complex;
                Catalytic Incorporation Of Aztmp To A Dna Aptamer In Crystal.
    ACCESSION   5I3U_F
    VERSION     5I3U_F
    KEYWORDS    .
    SOURCE      synthetic construct
      ORGANISM  synthetic construct
                other sequences; artificial sequences.
    REFERENCE   1  (bases 1 to 39)
      AUTHORS   Tu,X., Das,K., Han,Q., Bauman,J.D., Clark AD,Jr., Hou,X.,
                Frenkel,Y.V., Gaffney,B.L., Jones,R.A., Boyer,P.L., Hughes,S.H.,
                Sarafianos,S.G. and Arnold,E.
      TITLE     Structural basis of HIV-1 resistance to AZT by excision
      JOURNAL   Nat. Struct. Mol. Biol. 17 (10), 1202-1209 (2010)
       PUBMED   20852643
    ...
    FEATURES             Location/Qualifiers
         source          1..39
                         /organism="synthetic construct"
                         /mol_type="other DNA"
                         /db_xref="taxon:32630"
    ORIGIN
            1 taatacncnc ccttcggtgc tttgcaccga agggggggn
    //

- UniProt (obtained by combining SwissProt, TrEMBL and PIR) is the GenBank
  equivalent for protein sequences::

    ID  143B_HUMAN  STANDARD; PRT; 245 AA.
    AC  P31946;
    DT  01-JUL-1993 (Rel. 26, Created)
    DT  01-FEB-1996 (Rel. 33, Last sequence update)
    DT  15-SEP-2003 (Rel. 42, Last annotation update)
    DE  14-3-3 protein beta/alpha (Protein kinase C inhibitor protein-1)
    DE  (KCIP-1) (Protein 1054).
    GN  YWHAB.
    OS  Homo sapiens (Human).
    OC  Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
    OC  Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo.
    OX  NCBI_TaxID=9606;
    RN  [1]
    RP  SEQUENCE FROM N.A.
    RC  TISSUE=Keratinocytes;
    RX  MEDLINE=93294871; PubMed=8515476;
    RA  Leffers H., Madsen P., Rasmussen H.H., Honore B., Andersen A.H.,
    RA  Walbum E., Vandekerckhove J., Celis J.E.;
    RT  "Molecular cloning and expression of the transformation sensitive
    RT  epithelial marker stratifin. A member of a protein family that has
    RT  been involved in the protein kinase C signalling pathway.";

The ``Bio.SeqIO`` module provides a functions to parse all of the above
(among others).

In particular, the ``parse()`` function takes a file handle and the format
of the file (as a string) and returns an *iterable* of ``SeqRecord`` objects::

    from Bio.SeqIO import parse
    help(parse)

    f = open("test.fasta")
    for record in parse(f, "fasta"):
        print record.id, record.seq

.. warning::
    An *iterable* is a function that allows to iterate over a collection
    of objects (similar to a list).

    When given an iterator, you can either use it as in the code above::

        iterable = get_it_from_somewhere() # e.g. parse()
        for element in iterable:
            do_something(element)

    or extract one element at a time, with the ``next()`` method::

        iterable = get_it_from_somewhere() # e.g. parse()
        first_element = iterable.next()
        do_something(first_element)
        second_element = iterable.next()
        do_something(second_element)
        # ...

The iterable allows to iterate over a list of ``SeqRecord`` objects. The
``SeqRecord`` objects associate a ``Seq`` sequence with additional meta-data,
such as human-readable descriptions, annotations, and sequence features::

    >>> from Bio.SeqRecord import SeqRecord
    >>> help(SeqRecord)

Each ``SeqRecord`` object provides these attributes:

#. ``.seq``: the sequence itself, typically a ``Seq`` object.

#. ``.id``: the primary identifier of the sequence -- a string.

#. ``.name``: a "common" identifier for the sequence, also a string.

#. ``.description``: a human readable description of the sequence.

#. ``.annotations``: a ``dict`` of additional information.

among others.

|

In order to read a ``SeqRecord`` object from a file::

    >>> from Bio import SeqIO
    >>> record = SeqIO.parse(open("sequences.fasta"), "fasta").next()
    >>> record
    SeqRecord(seq=Seq('METNCRKLVSACVQLGVQPAAVECLFSKDSEIKKVEFTDSPESRKEAASSKFFP...RPV', SingleLetterAlphabet()), ...
    >>> record.id
    'Q99697'
    >>> record.name
    'Q99697'
    >>> record.annotations
    {}

Here ``record.annotations`` is empty because the FASTA format does not really
support sequence annotations.

The **very** nice thing about the ``SeqIO`` and ``SeqRecord`` mechanism is that
they allow to process all of the different formats described above (FASTA,
GenBank, UniProt) using the *very same code*!

Let's see what happens with a GenBank file::

    >>> record = SeqIO.parse(open("data.genbank"), "genbank").next()
    >>> record
    SeqRecord(seq=Seq('TAATACNCNCCCTTCGGTGCTTTGCACCGAAGGGGGGGN', IUPACAmbiguousDNA()), ...
    >>> from pprint import pprint
    >>> pprint(record.annotations)
    {'accessions': ['5I3U_F'],
     'data_file_division': 'SYN',
     'date': '31-AUG-2016',
     'keywords': [''],
     'organism': 'synthetic construct',
     'references': [Reference(title='Structural basis of HIV-1 resistance to AZT by excision', ...)],
     'source': 'synthetic construct',
     'taxonomy': ['other sequences', 'artificial sequences'],
     'topology': 'linear'}

This way, you can (almost) forget about the format of the sequence data you
are working with -- all the parsing is done automatically by Biopython.

****

Finally, the ``SeqIO.write()`` method allows to write a ``SeqRecord`` objects
or a **list** of ``SeqRecord`` objects to disk, in a given format::

    fr = open("data.genbank", "r")
    record = SeqIO.parse(fr, "genbank").next()

    fw = open("seq1.fasta", "w)
    SeqIO.write(record, fw, "fasta")

|

Biopython: Structures
---------------------

Biopython allows to manipulate polypeptide structures via the ``Bio.PDB`` module.

We will focus on data coming from the **Protein Data Bank** (PDB):

    http://www.rcsb.org/pdb/home/home.do

The PDB is by far the largest protein structure resource available online. At
the time of writing, it hosts more thank 120k distinct protein structures,
including protein-protein, protein-DNA, protein-RNA complexes.

In order to load the required code, just type::

    from Bio.PDB import *

.. note::
    For the documentation of the ``Bio.PDB`` module, see:

        http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

|

.. note::
    In the following we will use the Zaire Ebolavirus glicoprotein structure
    ``5JQ3``, available here:

       http://www.rcsb.org/pdb/explore/explore.do?structureId=5JQ3

    This is what this structure looks like:

    .. image:: figures/5JQ3.jpg
        :width: 80%

|

Protein Structure File Formats
------------------------------

The PDB distributes protein structures in **two** different formats:

#. The XML-based file format. It is *not* supported by Biopython.

#. The ``mmcif`` file format. (The actual file extension is ``.cif``.) See for
   instance:

      https://files.rcsb.org/view/5JQ3.cif

   This is the most *recent* and "parsable" file format.

#. The ``pdb`` file format, which is a specially formatted text file.
   See for instance:

      https://files.rcsb.org/view/5JQ3.pdb

   This is the *historical* file format. It is (mostly) still distributed
   because of legacy tools that do not (yet) understand the newer and
   cleaner ``mmcif`` files.

.. warning::

    Some (many?) PDB files distributed by the Protein Data Bank contain
    **formatting errors** that make them ambiguous or difficult to parse.

    The ``Bio.PDB`` module attempts to deal with these errors automatically.

    Of course, Biopython is not perfect, and some formatting errors may still
    make it do the wrong thing, or raise an exception.

    Depending on your goal, you can either:

    - manually correct the structure, e.g. if you have to analyze a single
      structure

    - ignore the error, e.g. when you are just computing global statistics
      about a large number of structures, and a few errors do not really impact
      the overall numbers

    We will see how to ignore these "spurious" errors soon.

The ``Bio.PDB`` module implements two different parsers, one for the ``pdb``
format and one for the ``mmcif`` format. Let's try them out.

Starting with::

   >>> from Bio.PDB import *

To load a ``cif`` file, use the ``Bio.MMCIF.MMCIFParser`` module::

    >>> parser = MMCIFParser()
    >>> struct = parser.get_structure("5jq3", "5jq3.cif")
    # four warnings
    >>> struct
    <Structure id=5jq3>
    >>> print type(struct)
    <class 'Bio.PDB.Structure.Structure'>

To load a ``pdb`` file, use the ``Bio.PDB.PDBParser`` module::

   >>> parser = PDBParser()
   >>> struct = parser.get_structure("5jq3", "5jq3.pdb")
   # four warnings
   >>> struct
   <Structure id=5jq3>
   >>> print type(struct)
   <class 'Bio.PDB.Structure.Structure'>

As you can see, both parsers give you the same kind of object, a ``Structure``.

The *header* of the structure is a ``dict`` which stores meta-data about the
protein structure, including its resolution (in Angstroms), name, author,
release date, etc.

To access it, write::

    >>> print struct.header.keys()
    ['structure_method', 'head', 'journal', 'journal_reference',
    'compound', 'keywords', 'name', 'author', 'deposition_date',
    'release_date', 'source', 'resolution', 'structure_reference']
    >>> print struct.header["name"]
    crystal structure of ebola glycoprotein
    >>> print struct.header["release_date"]
    2016-06-29
    >>> print struct.header["resolution"]
    2.23   # Å

.. warning::
    The header dictionary may be an empty, depending on the structure and
    the parser. Try it out with the ``mmcif`` parser!

.. note::
    You can download the required files directly from the PBD server with::

        >>> pdbl = PDBList()
        >>> pdbl.retrieve_pdb_file('1FAT')
        Downloading PDB structure '1FAT'...
        '/home/stefano/scratch/fa/pdb1fat.ent'

    The file will be put into the current working directory.

    The ``PDBList`` object also allows you to retrieve a full copy of the PDB
    database, if you wish to do so. Note that the database takes quite a bit of
    disk space.

|

Structures as Hierarchies
-------------------------

Structural information is natively hierarchical.

The hierarchy is represented by ``Bio.PDB`` as follows (image taken from the
Biopython Structural Bioinformatics FAQ):

.. image:: http://biopython.org/wiki/Smcra.png
    :width: 80%

**Example**. For instance, the ``5JQ3`` PDB structure is composed of:

- A single **model**, containing

  - Two **chains**:

    - chain A, containing 338 residues

    - chain B, containing 186 residues

    - Each **residue** is composed by multiple **atoms**, each having a 3D
      position represented as by (x, y, z) coordinates.

|

More specifically, here are the related ``Bio.PDB`` classes:

- A ``Structure`` describes one or more *models*, i.e. different 3D
  conformations of the very same structure (for NMR).

  The ``Structure.get_models()`` method returns an *iterator* over the models::

     >>> it = struct.get_models()
     >>> it
     <generator object get_models at 0x7f318f0b3fa0>
     >>> models = list(it)
     >>> models
     [<Model id=0>]
     >>> type(models[0])
     <class 'Bio.PDB.Model.Model'>

  Here I turned the iterator into a proper list with the ``list()`` function.

  A ``Structure`` also acts as a ``dict``: given a model ID, it returns the
  corresponding ``Model`` object::

     >>> struct[0]
     <Model id=0>

- A ``Model`` describes exactly one 3D conformation. It contains one or more
  chains.

  The ``Model.get_chain()`` method returns an *iterator* over the chains::

     >>> chains = list(models[0].get_chains())
     >>> chains
     [<Chain id=A>, <Chain id=B>]
     >>> type(chains[0])
     <class 'Bio.PDB.Chain.Chain'>
     >>> type(chains[1])
     <class 'Bio.PDB.Chain.Chain'>

  A ``Model`` also acts as a ``dict`` mapping from chain IDs to ``Chain``
  objects::

     >>> model = models[0]
     >>> model
     <Model id=0>
     >>> model["B"]
     <Chain id=B>

- A ``Chain`` describes a proper polypeptide structure, i.e. a consecutive
  sequence of bound residues.

  The ``Chain.get_residues()`` method returns an *iterator* over the residues::

     >>> residues = list(chains[0].get_residues())
     >>> len(residues)
     338
     >>> type(residues[0])
     <class 'Bio.PDB.Residue.Residue'>

- A ``Residue`` holds the atoms that belong to an amino acid.

  The ``Residue.get_atom()`` (unfortunately, "atom" is singular) returns an
  *iterator* over the atoms::

     >>> atoms = list(residues[0].get_atom())
     >>> atoms
     [<Atom N>, <Atom CA>, <Atom C>, <Atom O>, <Atom CB>, <Atom OG>]
     >>> type(atoms[0])
     <class 'Bio.PDB.Atom.Atom'>

- An ``Atom`` holds the 3D coordinate of an atom, as a ``Vector``::

     >>> atoms[0].get_vector()
     <Vector -68.82, 17.89, -5.51>

  The x, y, z coordinates in a ``Vector`` can be accessed with::

     >>> vector = atoms[0].get_vector()
     <Vector -68.82, 17.89, -5.51>
     >>> coords = list(vector.get_array())
     [-68.81500244, 17.88899994, -5.51300001]

  The other methods of the ``Vector`` class allow to compute the ``angle()``
  with another vector, compute its ``norm()``, map it to the unit sphere
  with ``normalize()``, and multiply the coordinates by a matrix with
  ``left_multiply()``.

**Example**. To iterate over all atoms in a structure, type::

    parser = PDBParser()
    struct = parser.get_structure("target", "5jq3.pdb")
    for model in struct:
        for chain in model:
            for res in chain:
                for atom in res:
                    print atom

and to print a single atom::

    >>> struct[0]["B"][502]["CA"].get_vector()
    <Vector -64.45, 8.02, 1.62>

There are also shortcuts to skip the middle levels::

    for res in model.get_residues():
        print residue

and you can go back "up" the hierarchy::

    model = chain.get_parent()

Finally, you can inspect the properties of atoms and residues using the
methods described here:

    http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ#analysis

The meaning of residue IDs are not obvious. From the Biopython Structural
Bioinformatics FAQ:

    **What is a residue id?**

    This is a bit more complicated, due to the clumsy PDB format. A residue id
    is a tuple with three elements:

    - The **hetero-flag**: this is ``"H_"`` plus the name of the hetero-residue
      (e.g. 'H_GLC' in the case of a glucose molecule), or 'W' in the case of a
      water molecule.

    - The **sequence identifier** in the chain, e.g. 100

    - The **insertion code**, e.g. 'A'. The insertion code is sometimes used to
      preserve a certain desirable residue numbering scheme. A Ser 80 insertion
      mutant (inserted e.g. between a Thr 80 and an Asn 81 residue) could e.g.
      have sequence identifiers and insertion codes as follows: Thr 80 A, Ser
      80 B, Asn 81. In this way the residue numbering scheme stays in tune with
      that of the wild type structure.  The id of the above glucose residue
      would thus be ('H_GLC', 100, 'A'). If the hetero-flag and insertion code
      are blank, the sequence identifier alone can be used:

    ::

        # Full id
        residue = chain[(' ', 100, ' ')]
        # Shortcut id
        residue = chain[100]

    The reason for the hetero-flag is that many, many PDB files use the same
    sequence identifier for an amino acid and a hetero-residue or a water,
    which would create obvious problems if the hetero-flag was not used.

|

Writing a ``pdb`` file
----------------------

Given a ``Structure`` object, it is easy to write it to disk in ``pdb`` format::

    struct = ... # a Structure

    io = PDBIO()
    io.set_structure(struct)
    io.save("output.pdb")

And that's it!

|

Structure Alignment
-------------------

In order to be comparable, structures must first be *aligned*: **superimposed**
on top of each other.

.. note::
    Structure alignment is very common problem.

    Say you have a wildtype protein and an SNP mutant, and you would like to
    compare the structural consequences of the mutation. Then in order to
    compare the structure, first you want them to match "as closely as
    possible".

    It also occurs in protein structure prediction: in order to evaluate the
    quality of a structural predictor (a software that takes the sequence of a
    protein and attempts to figure out how the structure looks like), the
    predictions produced by the software must be compared to the real
    corresponding protein structures. The same goes, of course, for dynamical
    molecular simulation software.

For a quick overview of several superposition methods, see:

    https://en.wikipedia.org/wiki/Structural_alignment

The most basic superposition algorithms work by **rototranslating** the two
proteins so to maximize their alignment.

In particular, the quality of the alignment is measured by the Root Mean
Squared Error (RMSD), defined as:

.. math::
    RMSD(s,s') = \sqrt{\frac{1}{n} \sum_{i=1}^n \| \textbf{x}_i - \textbf{x'}_i \|^2  }

This is a least-squares minimization problem, and is solved using standard
decomposition techniques.

.. warning::
    This is a useful, but naive interpretation of the structural alignment
    problem. It may not be suitable to the specific task.

    For instance, it is well known that there is a correlation between the RMSD
    values and the length of the protein, which makes average RMSD values
    computed over different proteins very biased.

    Different models of structure alignment avoid this issue, such as TM-score:

        http://zhanglab.ccmb.med.umich.edu/TM-score/

    We will not cover them here.

|

The ``Bio.PDB`` module supports RMSD-based structural alignment via the
``Superimposer`` class, let's see how.

First, let's extract the list of atoms from the two chains of our structure::

    >>> atoms_a = list(struct[0]["A"].get_atoms())
    >>> atoms_b = list(struct[0]["B"].get_atoms())

We make sure that the two lists of atoms are of the same length; the
``Superimposer`` does not work otherwise::

    >>> atoms_a = atoms_a[:len(atoms_b)]

Now we create a ``Superimposer`` object and specify the two lists of atoms that
should be aligned::

    >>> si = Superimposer()
    >>> si.set_atoms(atoms_a, atoms_b)

Here the atoms ``atoms_a`` are fixed, while the atoms ``atoms_b`` are movable,
i.e. the second structure is aligned on top of the first one.

The superposition occurs **automatically**, behind the scene. We can access
the minimum RMSD value from the ``si`` object with::

    >>> print si.rms
    26.463458137

as well as the rotation matrix and the translation vector::

    >>> rotmatrix = si.rotran[0]
    >>> rotmatrix
    array([[-0.43465167,  0.59888164,  0.67262078],
           [ 0.63784155,  0.73196744, -0.23954504],
           [-0.63579563,  0.32490683, -0.70014246]])
    >>> transvector = si.rotran[1]
    >>> transvector

Finally, we can apply the optimal rototranslation to the movable atoms::

    array([-92.66668863,  31.1358793 ,  17.84958153])
    >>> si.apply(atoms_b)