Python: Dictionaries

A dictionary represents a map between objects: it maps from a key to the corresponding value.

Warning

Just like lists, dictionaries are mutable!

The abstract syntax for defining a dictionary is:

{ key1: value1, key2: value2, ... }


Example. In order to define a dictionary implementing the genetic code, we write:

genetic_code = {
    "UUU": "F",     # phenilalanyne
    "UCU": "S",     # serine
    "UAU": "Y",     # tyrosine
    "UGU": "C",     # cysteine
    "UUC": "F",     # phenilalanyne
    "UCC": "S",     # serine
    "UAC": "Y",     # tyrosine
    # etc.
}

Here genetic_code maps from three-letter nucleotide strings (the keys) to the corresponding amino acid character (the value).

To use a dictionary, we resort to the usual extraction operator, as follows:

>>> aa = genetic_code["UUU"]
>>> print aa
"phenylalanine"

>>> aa = genetic_code["UCU"]
>>> print aa
"serine"

I can use the genetic_code dictionary to “simulate” the process of translation and convert an RNA sequence into the corresponding aminoacid sequence. For instance, starting from the following mRNA sequence:

rna = "UUUUCUUAUUGUUUCUCC"

I can split it in triples:

>>> triples = [rna[i:i+3] for i in range(0, len(rna), 3)]
>>> print triples
['UUU', 'UCU', 'UAU', 'UGU', 'UUC', 'UCC']

At this point, I can translate the triples with:

>>> aminoacids = [genetic_code[triple] for triple in triples]
>>> protein = "".join(aminoacids)
>>> print protein
"FSYCFS"

Of course, this is a very simple functional model of translation. The most obvious difference is that the above Python code does not care about termination codons. Support for them will be added in due time.

Warning

Keys are unique: the same key can not be used more than once.

Value are not unique: different keys can map to the same value.

In the genetic code example, each key is a unique three-letter string, which is associated to a given value; the same value (for instance "serine" is associated to multiple keys:

print genetic_code["UCU"]           # "S", serine
print genetic_code["UCC"]           # "S", serine

Example. Let’s build a dictionary that maps from amino acids to their (approximated) volume in cubic Amstrongs:

volume_of = {
    "A":  67.0, "C":  86.0, "D":  91.0,
    "E": 109.0, "F": 135.0, "G":  48.0,
    "H": 118.0, "I": 124.0, "K": 135.0,
    "L": 124.0, "M": 124.0, "N":  96.0,
    "P":  90.0, "Q": 114.0, "R": 148.0,
    "S":  73.0, "T":  93.0, "V": 105.0,
    "W": 163.0, "Y": 141.0,
}

# Print the volume of a cysteine
print volume_of["C"]                    # 86.0
print type(volume_of["C"])              # float

Here the keys are strings and the values are floats.



Warning

There are no restrictions on the kinds of objects that can appear as values.

The keys, however, must be immutable objects. This means that list and dict objects can not be used as keys.

More formally, here is what object types you can use where:

Type As keys As values
bool
int
float
str
list NO
tuple
dict NO

This restriction is due to how dictionaries are implemented in Python (and most other programming languages, really).


Example. Let’s create a dictionary that maps from amino acids to a list of two properties, mass and volume:

properties_of = {
    "A": [ 89.09,  67.0],
    "C": [121.15,  86.0],
    "D": [133.10,  91.0],
    "E": [147.13, 109.0],
    # ...
}

# Print the properties of alanine
print properties_of["A"]                # [89.09, 67.0]
print type(properties_of["A"])          # list

Here the keys are str (immutable) and the values are list (mutable). Can I create the inverse dictionary (from property list to amino acids)?

No. Let’s write the very first key-value pair:

aa_of = { [89.09, 67.0]: "A" }

Ideally, this dictionary would map from the list:

[89.09, 67.0]

to "A". However Python raises an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: unhashable type: 'list'
#          ^^^^^^^^^^^^^^^^^^^^^^^
#           read: list is mutable!

To solve the problem, I can use tuples in place of lists:

aa_of = {
    ( 89.09,  67.0): "A",
    (121.15,  86.0): "C",
    (133.10,  91.0): "D",
    (147.13, 109.0): "E",
    # ...
}

aa = aa_of[(133.10,  91.0)]
print aa                        # "D"
print type(aa)                  # str

Now that the keys are immutable, everything works.



Operations

Returns Operator Meaning
int len(dict) Return the number of key-value pairs
object dict[object] Extracts the value associated to a key
dict[object]=object Inserts or replaces a key-value pair

The only major behavioral difference lies in the assignment operator (the last one). The syntax is the same as for lists, and the meaning too; however, for dictionaries it can be used to add entirely new key-value pairs. Let’s see a few examples.


Example. Starting from an empty dictionary:

code = {}

print code
print len(code)

I want to build (this time, incrementally) the genetic code dictionary I introduced in the very first example of this chapter. Let’s add the key-value pairs one by one with the assignment operator:

code["UUU"] = "F"                 # phenylalanine
code["UCU"] = "M"                 # methionine
code["UAU"] = "Y"                 # tyrosine
# ...

print code
print len(code)

Here I am adding new key-value pairs to a dictionary.

Whoops, I made a mistake! "UCU" should map to an "S", not to an "M"! I can solve the problem by replacing the value associated to the key "UCU", using the same syntax as above:

code["UCU"] = "S"                 # serine

print code
print len(code)

So, if the key was already there, the assignment operator simply replaces the value it is associated with.

Qui alla key "UCU", che gia’ era nel dizionario, associo un nuovo value "S" (serina).


Warning

It does not make sense, however, to extract values associated to keys not present in the dictionary. For instance:

>>> code[":-("]

makes Python raise an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: ":-("
#         ^^^^^
#          the dictionary contains no such key

Methods

Returns Method Meaning
bool dict.has_key(object) True if the object is a key of the dict
list dict.keys() Get the keys as a list
list dict.values() Get the values as a list
list-of-tuples dict.items() Get the key-value pairs as a list of pairs

Example. Starting from code:

code = {
    "UUU": "F",     # phenylalanine
    "UCU": "S",     # serine
    "UAU": "Y",     # tyrosine
    "UGU": "C",     # cysteine
    "UUC": "F",     # phenylalanine
    "UCC": "S",     # serine
    "UAC": "Y",     # tirosine
    # ...
}

I can get the list of keys:

>>> keys = code.keys()
>>> print keys
["UUU", "UCU", "UAU", ...]

and the list of values:

>>> values = code.values()
>>> print values
["F", "S", Y", "C", ...]

and the list of both keys and values:

>>> key_value_pairs = code.items()
>>> print key_value_pairs
[("UUU", "F"), ("UCU", "S"), ...]

Now that I have a bunch of lists, I can apply any of the list operations/methods to perform the tasks I need!

Finally, to check whether a given object appears as a key, I can write:

>>> print code.has_key("UUU")
True
>>> print code.has_key(":-(")
False

Warning

Key-value pairs are stored in the dictionary in a seemingly arbitrary order.

In other words, Python does not guarantee that the order in which the key-value pairs are inserted is preserved.

For instance:

>>> d = {}
>>> d["z"] = "zeta"
>>> d["a"] = "a"
>>> d
{'a': 'a', 'z': 'zeta'}
>>> d.keys()
['a', 'z']
>>> d.values()
['a', 'zeta']
>>> d.items()
[('a', 'a'), ('z', 'zeta')]

Here I inserted ("z", "zeta") first, and ("a", "a") second. However, the order in which they are stored in the dictionary is the exact opposite!


Example. We can use a dictionary to represent a complex structured object, for instance the properties of a protein chain:

chain = {
    "name": "1A3A",
    "chain": "B",
    "sequence": "MANLFKLGAENIFLGRKAATK...",
    "num_scop_domains": 4,
    "num_pfam_domains": 1,
}

print chain["name"]
print chain["sequence"]
print chain["num_scop_domains"]

Of course, writing a dictionary like this by hand is inconvenient. We will later see how such dictionaries can be created by reading the data automatically out of some biological database.

(Imagine having 100k such dictionaries to analyze!)


Example. Given the following FASTA sequence (cut down to a reasonable size) describing the primary sequence of the HIV-1 retrotranscriptase protein (taken from the PDB):

>2HMI:A|PDBID|CHAIN|SEQUENCE
PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI
NKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAF
QSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLL
VQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVRQLSKLLRGTKALT
PSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTE
WWTEYWQATWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRET
AIYLALQDSGLEVNIVTDSQYALGIIQAQPDKSESELVNQIIEQLIKKEK
>2HMI:B|PDBID|CHAIN|SEQUENCE
PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI
NKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAF
QSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLL
VQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVRQLSKLLRGTKALT
PSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTE
WWTEYWQATWIPEWEFVNTPPLVKLWYQLE
>2HMI:C|PDBID|CHAIN|SEQUENCE
DIQMTQTTSSLSASLGDRVTISCSASQDISSYLNWYQQKPEGTVKLLIYY
EDFATYYCQQYSKFPWTFGGGTKLEIKRADAAPTVSIFPPSSEQLTSGGA
NSWTDQDSKDSTYSMSSTLTLTADEYEAANSYTCAATHKTSTSPIVKSFN
>2HMI:D|PDBID|CHAIN|SEQUENCE
QITLKESGPGIVQPSQPFRLTCTFSGFSLSTSGIGVTWIRQPSGKGLEWL
FLNMMTVETADTAIYYCAQSAITSVTDSAMDHWGQGTSVTVSSAATTPPS
TVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSTWPSETVTCNVAHP
>2HMI:E|PDBID|CHAIN|SEQUENCE
ATGGCGCCCGAACAGGGAC
>2HMI:F|PDBID|CHAIN|SEQUENCE
GTCCCTGTTCGGGCGCCA

We’d like to compile a dictionary with the same information:

sequences_2HMI = {
    "A": "PISPIETVPVKLKPGMDGPKVKQWPLTEEKI...",
    "B": "PISPIETVPVKLKPGMDGPKVKQWPLTEEKI...",
    "C": "DIQMTQTTSSLSASLGDRVTISCSASQDISS...",
    "D": "QITLKESGPGIVQPSQPFRLTCTFSGFSLST...",
    "E": "ATGGCGCCCGAACAGGGAC",
    "F": "GTCCCTGTTCGGGCGCCA",
}

From this dictionary, it’s very easy to extract the sequence of every individual chain:

>>> print sequences_2HMI["F"]
"GTCCCTGTTCGGGCGCCA"

as well as computing how many chains there are:

num_catene = len(sequences_2HMI)

Example. Dictionaries can be used to describe histograms. For instance, let’s take a sequence:

seq = "GTCCCTGTTCGGGCGCCA"

We compute the number of the various nucleotides:

num_A = seq.count("A")                          # 1
num_T = seq.count("T")                          # 4
num_C = seq.count("C")                          # 7
num_G = seq.count("G")                          # 6

It is easy to write down a corresponding histogram into a dictionary:

histogram = {
    "A": float(num_A) / len(seq),               # 1 / 18 ~ 0.06
    "T": float(num_T) / len(seq),               # 4 / 18 ~ 0.22
    "C": float(num_C) / len(seq),               # 7 / 18 ~ 0.38
    "G": float(num_G) / len(seq),               # 6 / 18 ~ 0.33
}

Let’s say we are now interested in the proportion of adenosine; we can write:

prop_A = histogram["A"]
print prop

We can also check whether the histogram represents a “true” multinomial distribution by checking whether the sum of the probabilities is (approximately) 1:

print histogram["A"] + histogram["C"] + ...

Example. Dictionaries are also very useful for describing protein interaction networks (physical, functional, genomic, you name it):

partners_of = {
    "2JWD": ("1A3A",),
    "1A3A": ("2JWD", "3BLU", "2ZTI"),
    "2ZTI": ("1A3A", "3BLF"),
    "3BLU": ("1A3A", "3BLF"),
    "3BLF": ("3BLU", "2ZTI"),
}

which represents the following network:

2JWD ----- 1A3A ----- 2ZTI
             |          |
             |          |
           3BLU ----- 3BLF

Here partners_of["1A3A"] is a tuple with all of the binding partners of protein 1A3A.

We can use the dictionary to compute the n-step neighborhood of 1A3A:

# 1-step neighborhood of 1A3A
neighborhood_at_1 = partners_of["1A3A"]

# 2-step neighborhood of 1A3A (it may include repeated elements)
neighborhood_at_2 = \
    [partners_of[p] for p in neighborhood_at_1]

# 3-step neighborhood of 1A3A (again, may include repeated elements)
neighborhood_at_3 = \
    [partners_of[p] for p in neighborhood_at_2]

Note that the very same idea can be used to encode social networks (Facebook, Twitter, Google+), to see who-is-friends-with-who, and to discover communities in said networks.

Note

Dictionaries are just one of the many ways to encode a network (or graph). An alternative is to use an adjacency matrix, which can be implemented (as usual) as a list of lists.