==================== Python: Dictionaries ==================== A dictionary represents a *map* between objects: it maps from a *key* to the corresponding *value*. .. warning:: Just like lists, dictionaries are **mutable**! The abstract syntax for defining a dictionary is:: { key1: value1, key2: value2, ... } | **** **Example**. In order to define a dictionary implementing the *genetic code*, we write:: genetic_code = { "UUU": "F", # phenilalanyne "UCU": "S", # serine "UAU": "Y", # tyrosine "UGU": "C", # cysteine "UUC": "F", # phenilalanyne "UCC": "S", # serine "UAC": "Y", # tyrosine # etc. } Here ``genetic_code`` maps from three-letter nucleotide strings (the keys) to the corresponding amino acid character (the value). To use a dictionary, we resort to the usual *extraction* operator, as follows:: >>> aa = genetic_code["UUU"] >>> print aa "phenylalanine" >>> aa = genetic_code["UCU"] >>> print aa "serine" I can use the ``genetic_code`` dictionary to "simulate" the process of translation and convert an RNA sequence into the corresponding aminoacid sequence. For instance, starting from the following mRNA sequence:: rna = "UUUUCUUAUUGUUUCUCC" I can split it in triples:: >>> triples = [rna[i:i+3] for i in range(0, len(rna), 3)] >>> print triples ['UUU', 'UCU', 'UAU', 'UGU', 'UUC', 'UCC'] At this point, I can translate the triples with:: >>> aminoacids = [genetic_code[triple] for triple in triples] >>> protein = "".join(aminoacids) >>> print protein "FSYCFS" Of course, this is a very simple functional model of translation. The most obvious difference is that the above Python code does *not* care about termination codons. Support for them will be added in due time. .. warning:: Keys are unique: the same key can not be used more than once. Value are *not* unique: different keys can map to the same value. In the genetic code example, each key is a unique three-letter string, which is associated to a given value; the same value (for instance ``"serine"`` is associated to multiple keys:: print genetic_code["UCU"] # "S", serine print genetic_code["UCC"] # "S", serine **** **Example**. Let's build a dictionary that maps from amino acids to their (approximated) volume in cubic Amstrongs:: volume_of = { "A": 67.0, "C": 86.0, "D": 91.0, "E": 109.0, "F": 135.0, "G": 48.0, "H": 118.0, "I": 124.0, "K": 135.0, "L": 124.0, "M": 124.0, "N": 96.0, "P": 90.0, "Q": 114.0, "R": 148.0, "S": 73.0, "T": 93.0, "V": 105.0, "W": 163.0, "Y": 141.0, } # Print the volume of a cysteine print volume_of["C"] # 86.0 print type(volume_of["C"]) # float Here the keys are strings and the values are floats. **** | .. warning:: There are no restrictions on the kinds of objects that can appear as *values*. The *keys*, however, **must** be immutable objects. This means that ``list`` and ``dict`` objects can not be used as keys. More formally, here is what object types you can use where: ===== =========== =========== Type As keys As values ===== =========== =========== bool ✓ ✓ int ✓ ✓ float ✓ ✓ str ✓ ✓ list **NO** ✓ tuple ✓ ✓ dict **NO** ✓ ===== =========== =========== This restriction is due to how dictionaries are implemented in Python (and most other programming languages, really). **** **Example**. Let's create a dictionary that maps from amino acids to a list of two properties, mass and volume:: properties_of = { "A": [ 89.09, 67.0], "C": [121.15, 86.0], "D": [133.10, 91.0], "E": [147.13, 109.0], # ... } # Print the properties of alanine print properties_of["A"] # [89.09, 67.0] print type(properties_of["A"]) # list Here the keys are ``str`` (immutable) and the values are ``list`` (mutable). Can I create the *inverse* dictionary (from property list to amino acids)? No. Let's write the very first key-value pair:: aa_of = { [89.09, 67.0]: "A" } Ideally, this dictionary would map from the list:: [89.09, 67.0] to ``"A"``. However Python raises an error:: Traceback (most recent call last): File "", line 1, in TypeError: unhashable type: 'list' # ^^^^^^^^^^^^^^^^^^^^^^^ # read: list is mutable! To solve the problem, I can use tuples in place of lists:: aa_of = { ( 89.09, 67.0): "A", (121.15, 86.0): "C", (133.10, 91.0): "D", (147.13, 109.0): "E", # ... } aa = aa_of[(133.10, 91.0)] print aa # "D" print type(aa) # str Now that the keys are immutable, everything works. **** | Operations ---------- =========== ======================= ================================================ Returns Operator Meaning =========== ======================= ================================================ ``int`` ``len(dict)`` Return the number of key-value pairs ``object`` ``dict[object]`` Extracts the value associated to a key -- ``dict[object]=object`` Inserts or replaces a key-value pair =========== ======================= ================================================ The only major behavioral difference lies in the assignment operator (the last one). The syntax is the same as for lists, and the meaning too; however, for dictionaries it can be used to add entirely *new* key-value pairs. Let's see a few examples. **** **Example**. Starting from an empty dictionary:: code = {} print code print len(code) I want to build (this time, incrementally) the genetic code dictionary I introduced in the very first example of this chapter. Let's add the key-value pairs one by one with the assignment operator:: code["UUU"] = "F" # phenylalanine code["UCU"] = "M" # methionine code["UAU"] = "Y" # tyrosine # ... print code print len(code) Here I am adding *new* key-value pairs to a dictionary. Whoops, I made a mistake! ``"UCU"`` should map to an ``"S"``, not to an ``"M"``! I can solve the problem by *replacing* the value associated to the key ``"UCU"``, using the same syntax as above:: code["UCU"] = "S" # serine print code print len(code) So, if the *key* was already there, the assignment operator simply replaces the *value* it is associated with. Qui alla key ``"UCU"``, che gia' era nel dizionario, associo un nuovo value ``"S"`` (serina). **** .. warning:: It does not make sense, however, to *extract* values associated to keys not present in the dictionary. For instance:: >>> code[":-("] makes Python raise an error:: Traceback (most recent call last): File "", line 1, in KeyError: ":-(" # ^^^^^ # the dictionary contains no such key | Methods ------- =================== =========================== ============================================= Returns Method Meaning =================== =========================== ============================================= ``bool`` ``dict.has_key(object)`` ``True`` if the object is a key of the dict ``list`` ``dict.keys()`` Get the keys as a list ``list`` ``dict.values()`` Get the values as a list ``list-of-tuples`` ``dict.items()`` Get the key-value pairs as a list of pairs =================== =========================== ============================================= **** **Example**. Starting from ``code``:: code = { "UUU": "F", # phenylalanine "UCU": "S", # serine "UAU": "Y", # tyrosine "UGU": "C", # cysteine "UUC": "F", # phenylalanine "UCC": "S", # serine "UAC": "Y", # tirosine # ... } I can get the list of keys:: >>> keys = code.keys() >>> print keys ["UUU", "UCU", "UAU", ...] and the list of values:: >>> values = code.values() >>> print values ["F", "S", Y", "C", ...] and the list of both keys and values:: >>> key_value_pairs = code.items() >>> print key_value_pairs [("UUU", "F"), ("UCU", "S"), ...] Now that I have a bunch of lists, I can apply any of the list operations/methods to perform the tasks I need! Finally, to check whether a given object appears as a key, I can write:: >>> print code.has_key("UUU") True >>> print code.has_key(":-(") False **** .. warning:: Key-value pairs are stored in the dictionary in a seemingly arbitrary order. In other words, Python does **not** guarantee that the order in which the key-value pairs are inserted is preserved. For instance:: >>> d = {} >>> d["z"] = "zeta" >>> d["a"] = "a" >>> d {'a': 'a', 'z': 'zeta'} >>> d.keys() ['a', 'z'] >>> d.values() ['a', 'zeta'] >>> d.items() [('a', 'a'), ('z', 'zeta')] Here I inserted ``("z", "zeta")`` first, and ``("a", "a")`` second. However, the order in which they are *stored* in the dictionary is the exact opposite! **** **Example**. We can use a dictionary to represent a complex structured object, for instance the properties of a protein chain:: chain = { "name": "1A3A", "chain": "B", "sequence": "MANLFKLGAENIFLGRKAATK...", "num_scop_domains": 4, "num_pfam_domains": 1, } print chain["name"] print chain["sequence"] print chain["num_scop_domains"] Of course, writing a dictionary like this by hand is inconvenient. We will later see how such dictionaries can be created by reading the data automatically out of some biological database. (Imagine having 100k such dictionaries to analyze!) **** **Example**. Given the following FASTA sequence (cut down to a reasonable size) describing the primary sequence of the HIV-1 `retrotranscriptase `_ protein (taken from the `PDB `_):: >2HMI:A|PDBID|CHAIN|SEQUENCE PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI NKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAF QSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLL VQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVRQLSKLLRGTKALT PSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTE WWTEYWQATWIPEWEFVNTPPLVKLWYQLEKEPIVGAETFYVDGAANRET AIYLALQDSGLEVNIVTDSQYALGIIQAQPDKSESELVNQIIEQLIKKEK >2HMI:B|PDBID|CHAIN|SEQUENCE PISPIETVPVKLKPGMDGPKVKQWPLTEEKIKALVEICTEMEKEGKISKI NKRTQDFWEVQLGIPHPAGLKKKKSVTVLDVGDAYFSVPLDEDFRKYTAF QSSMTKILEPFKKQNPDIVIYQYMDDLYVGSDLEIGQHRTKIEELRQHLL VQPIVLPEKDSWTVNDIQKLVGKLNWASQIYPGIKVRQLSKLLRGTKALT PSKDLIAEIQKQGQGQWTYQIYQEPFKNLKTGKYARMRGAHTNDVKQLTE WWTEYWQATWIPEWEFVNTPPLVKLWYQLE >2HMI:C|PDBID|CHAIN|SEQUENCE DIQMTQTTSSLSASLGDRVTISCSASQDISSYLNWYQQKPEGTVKLLIYY EDFATYYCQQYSKFPWTFGGGTKLEIKRADAAPTVSIFPPSSEQLTSGGA NSWTDQDSKDSTYSMSSTLTLTADEYEAANSYTCAATHKTSTSPIVKSFN >2HMI:D|PDBID|CHAIN|SEQUENCE QITLKESGPGIVQPSQPFRLTCTFSGFSLSTSGIGVTWIRQPSGKGLEWL FLNMMTVETADTAIYYCAQSAITSVTDSAMDHWGQGTSVTVSSAATTPPS TVTWNSGSLSSGVHTFPAVLQSDLYTLSSSVTVPSSTWPSETVTCNVAHP >2HMI:E|PDBID|CHAIN|SEQUENCE ATGGCGCCCGAACAGGGAC >2HMI:F|PDBID|CHAIN|SEQUENCE GTCCCTGTTCGGGCGCCA We'd like to compile a dictionary with the same information:: sequences_2HMI = { "A": "PISPIETVPVKLKPGMDGPKVKQWPLTEEKI...", "B": "PISPIETVPVKLKPGMDGPKVKQWPLTEEKI...", "C": "DIQMTQTTSSLSASLGDRVTISCSASQDISS...", "D": "QITLKESGPGIVQPSQPFRLTCTFSGFSLST...", "E": "ATGGCGCCCGAACAGGGAC", "F": "GTCCCTGTTCGGGCGCCA", } From this dictionary, it's very easy to extract the sequence of every individual chain:: >>> print sequences_2HMI["F"] "GTCCCTGTTCGGGCGCCA" as well as computing how many chains there are:: num_catene = len(sequences_2HMI) **** **Example**. Dictionaries can be used to describe *histograms*. For instance, let's take a sequence:: seq = "GTCCCTGTTCGGGCGCCA" We compute the number of the various nucleotides:: num_A = seq.count("A") # 1 num_T = seq.count("T") # 4 num_C = seq.count("C") # 7 num_G = seq.count("G") # 6 It is easy to write down a corresponding histogram into a dictionary:: histogram = { "A": float(num_A) / len(seq), # 1 / 18 ~ 0.06 "T": float(num_T) / len(seq), # 4 / 18 ~ 0.22 "C": float(num_C) / len(seq), # 7 / 18 ~ 0.38 "G": float(num_G) / len(seq), # 6 / 18 ~ 0.33 } Let's say we are now interested in the proportion of adenosine; we can write:: prop_A = histogram["A"] print prop We can also check whether the histogram represents a "true" multinomial distribution by checking whether the sum of the probabilities is (approximately) 1:: print histogram["A"] + histogram["C"] + ... **** **Example**. Dictionaries are also very useful for describing protein interaction networks (physical, functional, genomic, you name it):: partners_of = { "2JWD": ("1A3A",), "1A3A": ("2JWD", "3BLU", "2ZTI"), "2ZTI": ("1A3A", "3BLF"), "3BLU": ("1A3A", "3BLF"), "3BLF": ("3BLU", "2ZTI"), } which represents the following network:: 2JWD ----- 1A3A ----- 2ZTI | | | | 3BLU ----- 3BLF Here ``partners_of["1A3A"]`` is a tuple with all of the binding partners of protein 1A3A. We can use the dictionary to compute the n-step neighborhood of 1A3A:: # 1-step neighborhood of 1A3A neighborhood_at_1 = partners_of["1A3A"] # 2-step neighborhood of 1A3A (it may include repeated elements) neighborhood_at_2 = \ [partners_of[p] for p in neighborhood_at_1] # 3-step neighborhood of 1A3A (again, may include repeated elements) neighborhood_at_3 = \ [partners_of[p] for p in neighborhood_at_2] Note that the very same idea can be used to encode *social networks* (Facebook, Twitter, Google+), to see who-is-friends-with-who, and to discover communities in said networks. .. note:: Dictionaries are just one of the many ways to encode a network (or graph). An alternative is to use an `adjacency matrix `_, which can be implemented (as usual) as a list of lists.