====================
Python: Dictionaries
====================

A dictionary represents a *map* between objects: it maps from a *key* to the
corresponding *value*.

.. warning::

    Just like lists, dictionaries are **mutable**!

The abstract syntax for defining a dictionary is::

    { key1: value1, key2: value2, ... }

|

****

**Example**. In order to define a dictionary implementing the *genetic code*,
we write::

    genetic_code = {
        "UUU": "F",     # phenilalanyne
        "UCU": "S",     # serine
        "UAU": "Y",     # tyrosine
        "UGU": "C",     # cysteine
        "UUC": "F",     # phenilalanyne
        "UCC": "S",     # serine
        "UAC": "Y",     # tyrosine
        # etc.
    }

Here ``genetic_code`` maps from three-letter nucleotide strings (the keys) to
the corresponding amino acid character (the value).

To use a dictionary, we resort to the usual *extraction* operator, as follows::

    >>> aa = genetic_code["UUU"]
    >>> print aa
    "phenylalanine"

    >>> aa = genetic_code["UCU"]
    >>> print aa
    "serine"

I can use the ``genetic_code`` dictionary to "simulate" the process of translation and convert an RNA sequence into the corresponding aminoacid sequence. For instance, starting from the following mRNA sequence::

    rna = "UUUUCUUAUUGUUUCUCC"

I can split it in triples::

    >>> triples = [rna[i:i+3] for i in range(0, len(rna), 3)]
    >>> print triples
    ['UUU', 'UCU', 'UAU', 'UGU', 'UUC', 'UCC']

At this point, I can translate the triples with::

    >>> aminoacids = [genetic_code[triple] for triple in triples]
    >>> protein = "".join(aminoacids)
    >>> print protein
    "FSYCFS"

Of course, this is a very simple functional model of translation. The
most obvious difference is that the above Python code does *not* care about
termination codons. Support for them will be added in due time.

.. warning::

    Keys are unique: the same key can not be used more than once.

    Value are *not* unique: different keys can map to the same value.

    In the genetic code example, each key is a unique three-letter string,
    which is associated to a given value; the same value (for instance ``"serine"``
    is associated to multiple keys::

        print genetic_code["UCU"]           # "S", serine
        print genetic_code["UCC"]           # "S", serine

****

**Example**. Let's build a dictionary that maps from amino acids to their
(approximated) volume in cubic Amstrongs::

    volume_of = {
        "A":  67.0, "C":  86.0, "D":  91.0,
        "E": 109.0, "F": 135.0, "G":  48.0,
        "H": 118.0, "I": 124.0, "K": 135.0,
        "L": 124.0, "M": 124.0, "N":  96.0,
        "P":  90.0, "Q": 114.0, "R": 148.0,
        "S":  73.0, "T":  93.0, "V": 105.0,
        "W": 163.0, "Y": 141.0,
    }

    # Print the volume of a cysteine
    print volume_of["C"]                    # 86.0
    print type(volume_of["C"])              # float

Here the keys are strings and the values are floats.

****

|

.. warning::

    There are no restrictions on the kinds of objects that can appear as
    *values*.

    The *keys*, however, **must** be immutable objects. This means that ``list``
    and ``dict`` objects can not be used as keys.

    More formally, here is what object types you can use where:

    ===== =========== ===========
    Type  As keys     As values
    ===== =========== ===========
    bool  ✓           ✓
    int   ✓           ✓
    float ✓           ✓
    str   ✓           ✓
    list  **NO**      ✓
    tuple ✓           ✓
    dict  **NO**      ✓
    ===== =========== ===========

    This restriction is due to how dictionaries are implemented in Python
    (and most other programming languages, really).

****

**Example**. Let's create a dictionary that maps from amino acids to a list
of two properties, mass and volume::

    properties_of = {
        "A": [ 89.09,  67.0],
        "C": [121.15,  86.0],
        "D": [133.10,  91.0],
        "E": [147.13, 109.0],
        # ...
    }

    # Print the properties of alanine
    print properties_of["A"]                # [89.09, 67.0]
    print type(properties_of["A"])          # list

Here the keys are ``str`` (immutable) and the values are ``list`` (mutable).
Can I create the *inverse* dictionary (from property list to amino acids)?

No. Let's write the very first key-value pair::

    aa_of = { [89.09, 67.0]: "A" }

Ideally, this dictionary would map from the list::

    [89.09, 67.0]

to ``"A"``. However Python raises an error::

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    TypeError: unhashable type: 'list'
    #          ^^^^^^^^^^^^^^^^^^^^^^^
    #           read: list is mutable!

To solve the problem, I can use tuples in place of lists::

    aa_of = {
        ( 89.09,  67.0): "A",
        (121.15,  86.0): "C",
        (133.10,  91.0): "D",
        (147.13, 109.0): "E",
        # ...
    }

    aa = aa_of[(133.10,  91.0)]
    print aa                        # "D"
    print type(aa)                  # str

Now that the keys are immutable, everything works.

****

|

Operations
----------

=========== ======================= ================================================
Returns     Operator                Meaning
=========== ======================= ================================================
``int``     ``len(dict)``           Return the number of key-value pairs
``object``  ``dict[object]``        Extracts the value associated to a key
--          ``dict[object]=object`` Inserts or replaces a key-value pair
=========== ======================= ================================================

The only major behavioral difference lies in the assignment operator (the last
one). The syntax is the same as for lists, and the meaning too; however, for
dictionaries it can be used to add entirely *new* key-value pairs. Let's see
a few examples.

****

**Example**. Starting from an empty dictionary::

    code = {}

    print code
    print len(code)

I want to build (this time, incrementally) the genetic code dictionary I
introduced in the very first example of this chapter. Let's add the key-value
pairs one by one with the assignment operator::

    code["UUU"] = "F"                 # phenylalanine
    code["UCU"] = "M"                 # methionine
    code["UAU"] = "Y"                 # tyrosine
    # ...

    print code
    print len(code)

Here I am adding *new* key-value pairs to a dictionary.

Whoops, I made a mistake! ``"UCU"`` should map to an ``"S"``, not to an
``"M"``! I can solve the problem by *replacing* the value associated to the key
``"UCU"``, using the same syntax as above::

    code["UCU"] = "S"                 # serine

    print code
    print len(code)

So, if the *key* was already there, the assignment operator simply replaces
the *value* it is associated with.

Qui alla key ``"UCU"``, che gia' era nel dizionario, associo un nuovo value
``"S"`` (serina).

****

.. warning::

    It does not make sense, however, to *extract* values associated to keys
    not present in the dictionary. For instance::

        >>> code[":-("]

    makes Python raise an error::

        Traceback (most recent call last):
          File "<stdin>", line 1, in <module>
        KeyError: ":-("
        #         ^^^^^
        #          the dictionary contains no such key

|

Methods
-------

=================== =========================== =============================================
Returns             Method                      Meaning
=================== =========================== =============================================
``bool``            ``dict.has_key(object)``    ``True`` if the object is a key of the dict
``list``            ``dict.keys()``             Get the keys as a list
``list``            ``dict.values()``           Get the values as a list
``list-of-tuples``  ``dict.items()``            Get the key-value pairs as a list of pairs
=================== =========================== =============================================

****

**Example**. Starting from ``code``::

    code = {
        "UUU": "F",     # phenylalanine
        "UCU": "S",     # serine
        "UAU": "Y",     # tyrosine
        "UGU": "C",     # cysteine
        "UUC": "F",     # phenylalanine
        "UCC": "S",     # serine
        "UAC": "Y",     # tirosine
        # ...
    }

I can get the list of keys::

    >>> keys = code.keys()
    >>> print keys
    ["UUU", "UCU", "UAU", ...]

and the list of values::

    >>> values = code.values()
    >>> print values
    ["F", "S", Y", "C", ...]

and the list of both keys and values::

    >>> key_value_pairs = code.items()
    >>> print key_value_pairs
    [("UUU", "F"), ("UCU", "S"), ...]

Now that I have a bunch of lists, I can apply any of the list
operations/methods to perform the tasks I need!

Finally, to check whether a given object appears as a key, I can write::

    >>> print code.has_key("UUU")
    True
    >>> print code.has_key(":-(")
    False

****

.. warning::

    Key-value pairs are stored in the dictionary in a seemingly arbitrary
    order.

    In other words, Python does **not** guarantee that the order in which
    the key-value pairs are inserted is preserved.

    For instance::

        >>> d = {}
        >>> d["z"] = "zeta"
        >>> d["a"] = "a"
        >>> d
        {'a': 'a', 'z': 'zeta'}
        >>> d.keys()
        ['a', 'z']
        >>> d.values()
        ['a', 'zeta']
        >>> d.items()
        [('a', 'a'), ('z', 'zeta')]

    Here I inserted ``("z", "zeta")`` first, and ``("a", "a")`` second.
    However, the order in which they are *stored* in the dictionary is the
    exact opposite!

****

**Example**. We can use a dictionary to represent a complex structured
object, for instance the properties of a protein chain::

    chain = {
        "name": "1A3A",
        "chain": "B",
        "sequence": "MANLFKLGAENIFLGRKAATK...",
        "num_scop_domains": 4,
        "num_pfam_domains": 1,
    }

    print chain["name"]
    print chain["sequence"]
    print chain["num_scop_domains"]

Of course, writing a dictionary like this by hand is inconvenient. We will
later see how such dictionaries can be created by reading the data
automatically out of some biological database.

(Imagine having 100k such dictionaries to analyze!)

****

**Example**. Given the following FASTA sequence (cut down to a reasonable
size) describing the primary sequence of the HIV-1 `retrotranscriptase
<http://www.rcsb.org/pdb/101/motm.do?momID=33>`_ protein (taken from
the `PDB <http://www.rcsb.org/pdb/explore/explore.do?structureId=2hmi>`_)::

    >2HMI:A|PDBID|CHAIN|SEQUENCE
    PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI
    NKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAF
    QSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLL
    VQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVRQLSKLLRGTKALT
    PSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTE
    WWTEYWQATWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRET
    AIYLALQDSGLEVNIVTDSQYALGIIQAQPDKSESELVNQIIEQLIKKEK
    >2HMI:B|PDBID|CHAIN|SEQUENCE
    PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI
    NKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAF
    QSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLL
    VQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVRQLSKLLRGTKALT
    PSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTE
    WWTEYWQATWIPEWEFVNTPPLVKLWYQLE
    >2HMI:C|PDBID|CHAIN|SEQUENCE
    DIQMTQTTSSLSASLGDRVTISCSASQDISSYLNWYQQKPEGTVKLLIYY
    EDFATYYCQQYSKFPWTFGGGTKLEIKRADAAPTVSIFPPSSEQLTSGGA
    NSWTDQDSKDSTYSMSSTLTLTADEYEAANSYTCAATHKTSTSPIVKSFN
    >2HMI:D|PDBID|CHAIN|SEQUENCE
    QITLKESGPGIVQPSQPFRLTCTFSGFSLSTSGIGVTWIRQPSGKGLEWL
    FLNMMTVETADTAIYYCAQSAITSVTDSAMDHWGQGTSVTVSSAATTPPS
    TVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSTWPSETVTCNVAHP
    >2HMI:E|PDBID|CHAIN|SEQUENCE
    ATGGCGCCCGAACAGGGAC
    >2HMI:F|PDBID|CHAIN|SEQUENCE
    GTCCCTGTTCGGGCGCCA

We'd like to compile a dictionary with the same information::

    sequences_2HMI = {
        "A": "PISPIETVPVKLKPGMDGPKVKQWPLTEEKI...",
        "B": "PISPIETVPVKLKPGMDGPKVKQWPLTEEKI...",
        "C": "DIQMTQTTSSLSASLGDRVTISCSASQDISS...",
        "D": "QITLKESGPGIVQPSQPFRLTCTFSGFSLST...",
        "E": "ATGGCGCCCGAACAGGGAC",
        "F": "GTCCCTGTTCGGGCGCCA",
    }

From this dictionary, it's very easy to extract the sequence of every
individual chain::

    >>> print sequences_2HMI["F"]
    "GTCCCTGTTCGGGCGCCA"

as well as computing how many chains there are::

    num_catene = len(sequences_2HMI)

****

**Example**. Dictionaries can be used to describe *histograms*. For instance,
let's take a sequence::

    seq = "GTCCCTGTTCGGGCGCCA"

We compute the number of the various nucleotides::

    num_A = seq.count("A")                          # 1
    num_T = seq.count("T")                          # 4
    num_C = seq.count("C")                          # 7
    num_G = seq.count("G")                          # 6

It is easy to write down a corresponding histogram into a dictionary::

    histogram = {
        "A": float(num_A) / len(seq),               # 1 / 18 ~ 0.06
        "T": float(num_T) / len(seq),               # 4 / 18 ~ 0.22
        "C": float(num_C) / len(seq),               # 7 / 18 ~ 0.38
        "G": float(num_G) / len(seq),               # 6 / 18 ~ 0.33
    }

Let's say we are now interested in the proportion of adenosine; we can write::

    prop_A = histogram["A"]
    print prop

We can also check whether the histogram represents a "true" multinomial
distribution by checking whether the sum of the probabilities is
(approximately) 1::

    print histogram["A"] + histogram["C"] + ...

****

**Example**. Dictionaries are also very useful for describing protein
interaction networks (physical, functional, genomic, you name it)::

    partners_of = {
        "2JWD": ("1A3A",),
        "1A3A": ("2JWD", "3BLU", "2ZTI"),
        "2ZTI": ("1A3A", "3BLF"),
        "3BLU": ("1A3A", "3BLF"),
        "3BLF": ("3BLU", "2ZTI"),
    }

which represents the following network::

    2JWD ----- 1A3A ----- 2ZTI
                 |          |
                 |          |
               3BLU ----- 3BLF

Here ``partners_of["1A3A"]`` is a tuple with all of the binding partners of
protein 1A3A.

We can use the dictionary to compute the n-step neighborhood of 1A3A::

    # 1-step neighborhood of 1A3A
    neighborhood_at_1 = partners_of["1A3A"]

    # 2-step neighborhood of 1A3A (it may include repeated elements)
    neighborhood_at_2 = \
        [partners_of[p] for p in neighborhood_at_1]

    # 3-step neighborhood of 1A3A (again, may include repeated elements)
    neighborhood_at_3 = \
        [partners_of[p] for p in neighborhood_at_2]

Note that the very same idea can be used to encode *social networks* (Facebook,
Twitter, Google+), to see who-is-friends-with-who, and to discover communities
in said networks.

.. note::

    Dictionaries are just one of the many ways to encode a network (or graph).
    An alternative is to use an `adjacency matrix <https://en.wikipedia.org/wiki/Adjacency_matrix>`_,
    which can be implemented (as usual) as a list of lists.