============= Python: Lists ============= Lists are ordered sequences of arbitrary elements (objects). Lists are defined using **square** brackets, as follows:: # A list of integers (notice that the 1 appears twice) integers = [1, 2, 3, 1] # A list of strings uniprot_proteins = ["Y08501", "Q95747"] # A list of heterogeneous objects things = ["Y08501", 0.13, "Q95747", 0.96] # A list of lists two_level_list = [ ["Y08501", 120, 520], ["Q95747", 550, 920], ] # An empty list empty = [] # A list containing two empty lists a_weird_list = [ [], [] ] Operations ---------- ======== ====================== ==================================================== Returns Operator Meaning ======== ====================== ==================================================== bool == Check whether two lists are identical bool != Check whether two lists are different int len(list) Compute the length of a list list list + list Concatenate two list (returns a new list) list list * int Replicate a list multiple times bool element in list Check whether an element appears in the list list list[int:int] Extracts a sub-list list list[int] = object Assigns a new value to an element list range(int, [int]) Compute the integers in a given range ======== ====================== ==================================================== Lists offer almost the same operators as strings, with a couple of additions. **** **Example**. range() returns a list of integers in a given range:: >>> numbers = range(5) >>> print numbers [0, 1, 2, 3, 4] >>> numbers = range(0, 5) >>> print numbers [0, 1, 2, 3, 4] >>> numbers = range(2, 4) >>> print numbers [2, 3] >>> numbers = range(4, 2) # the range is backwards! >>> print numbers [] **** **Example**. Just like with strings, you can extract an element or a range of elements from a list:: >>> numbers = range(10) >>> first_element = numbers[0] >>> print first_element 0 >>> last_element = numbers[-1] >>> print last_element 9 >>> the_other_elements = numbers[1:-1] >>> print the_other_elements [1, 2, 3, 4, 5, 6, 7, 8] **** **Example**. The assignment operator assigns a new value to an *already existing* list element. It is more or less the opposite of the extraction operator. A couple of examples:: >>> numbers = range(5) >>> print numbers [0, 1, 2, 3, 4] >>> numbers[0] = "first" >>> print numbers ["first", 1, 2, 3, 4] >>> numbers[-1] = "last" >>> print numbers ["first", 1, 2, 3, "last"] >>> numbers[len(numbers)/2] = "middle" >>> print numbers ["first", 1, "middle", 3, "last"] If the index is out-of-bounds, the assignment raises an error:: >>> numbers = range(5) >>> numbers[100] = "out-of-bounds" Traceback (most recent call last): File "", line 1, in IndexError: list assignment index out of range **** .. warning:: The assignment operator does not change the length of the list! It modifies an *existing* element (at a given position); it does **not** add a new element. .. warning:: Lists are ordered: the order of the elements matters:: [1, 2, 3] != [3, 2, 1] Lists are not sets: objects may appear more than once:: [3, 3, "a", "a"] != [3, "a"] | Exercises --------- #. Create an empty list using the bracket notation. Check whether it is really empty using len(). #. Create a list with the first five non-negative integers using range(). #. Create a list with one hundred 0 elements. *Hint*: note the replication operator. #. Given:: list_1 = range(10) list_2 = range(10, 20) concatenate the two lists, and assign the result to a new variable full_list. Use the equality comparison operator == to check whether it matches the result of range(20). #. Create a list of three strings: "I am", "a", "list". Then print the type and length *of the three elements* (manually, one by one). #. Given the list:: list = [0.0, "b", [3], [4, 5]] #. How long is it? #. What is the type of the first element of list? #. How long is the second element of list? #. How long is the third element of list? #. What is the value of the last element of list? How long is it? #. Does the list contain an element "b"? #. Does the list contain an element 4? #. What is the difference between these "lists"?:: list_1 = [1, 2, 3] list_2 = ["1", "2", "3"] list_3 = "[1, 2, 3]" *Hint*: is the third one actually a list? #. Which of the following code fragments are wrong? #. list = [] #. list = [} #. list = [[]] #. list = [1 2 3] #. list = range(3), element = list[3] #. list = range(3), element = list[-1] #. list = range(3), sublist = list[0:2] #. list = range(3), sublist = list[0:3] #. list = range(3), sublist = list[0:-1] #. list = range(3), list[2] = "two" #. list = range(3), list[3] = "three" #. list = range(3), list[-1] = "three" #. list = range(3), list[1.2] = "one point two" #. list = range(3), list[1] = ["protein1", "protein2"] #. Given the list:: matrix = [ [1, 2, 3], # <-- 1st row [4, 5, 6], # <-- 2nd row [7, 8, 9], # <-- 3rd riga ] # ^ ^ ^ # | | | # | | +-- 3rd column # | +----- 2nd column # +-------- 1st column How can I: #. Extract the first row? #. Extract the second element of the first row? #. Sum all elements of the first row? (Perform the sum manually) #. Create a new list containing the elements of the second column? (Manually) #. Create a new list containing the elements of the diagonal? (Manually) #. Create a list by concatenating the first, second, and third rows? Methods ------- ======== =========================== =================================================== Returns Method Meaning ======== =========================== =================================================== None list.append(object) Add a **new** element at the end of the list None list.extend(list) Add several **new** elements at the end of the list None list.insert(int,object) Add a **new** element at some given position None list.remove(object) Remove the first occurrence of an element None list.reverse() Invert the order of the elements None list.sort() Sort the elements int list.count(object) Count the occurrences of an element ======== =========================== =================================================== .. warning:: All list methods (except count()): - Modify the input list - Do not have a return value (they return None) In other words, they behave the exact *opposite* of string methods! As a consequence, if I do:: list = range(10) print list result = list.append(10) print list print result the list is modified in the process and result will be None. The same is true for all other methods (except count()). This may look a bit surprising, and it is easy to mistakenly expect append() or reverse() to return a list. They do not. By the way, this is the reason why we can not write code like:: list = [] list.append(1).append(2).append(3) because the first append() does **not** return a list (it returns None, and None of course does not support the append() method -- so the second append() always fails: Python raises an error.) | **** **Example**. append() adds a **new** element at the end of a list:: list = range(10) print list # [0, 1, 2, ..., 9] print len(list) # 10 list.append(10) print list # [0, 1, 2, ..., 9, 10] print len(list) # 11 list.append(11) print list # [0, 1, 2, ..., 9, 10, 11] print len(list) # 12 Note how list changes in the process. **** **Example**. extend() adds several **new** elements at the end of a list:: list = range(10) list.extend(range(10,20)) print list # [0, 1, 2, ..., 19] print len(list) # 20 **** **Example**. insert() adds a **new** element at an arbitrary position:: list = range(10) print list # [0, 1, ..., 9] print len(list) # 10 list.insert(2, "marker") print list # [0, 1, "marker", 3, ..., 9] print len(list) # 11 **** **Example**. Contrary to append(), insert() and extend(), list concatenation does **not** modify the original list. You get a **new** list instead:: list_1 = range(0, 10) list_2 = range(10, 20) # let's use the concatenation operator full_list = list_1 + list_2 print list_1, "+", list_2, "->", full_list # now let's use extend() instead full_list = list_1.extend(list_2) print list_1 print list_2 print full_list Note how with extend(), list_1 changes while full_list is None, as expected. **** **Example**. remove() removes the first occurrence of a given value:: list = ["a", "list", "not", "a", "string"] # ^^^ ^^^ list.remove("a") print list # ["list", "not", "a", "string"] **** **Example**. sort() and reverse() reorder the elements in the list:: list = [3, 2, 1, 5, 4] list.reverse() print list # [4, 5, 1, 2, 3] list.sort() print list # [1, 2, 3, 4, 5] It also works with strings (it defaults to lexicographic ordering). Check this out:: list = ["AC", "GT"] print list list.reverse() print list # ["GT", "AC"] list.sort() print list # ["AC", "GT"] **** **Example**. count() returns the number of occurrences of a value in a list:: list = ["a", "c", "g", "t", "a"] num_a = list.count("a") num_g = list.count("g") print num_a, num_g # 2, 1 Of course, count() does not modify the original list. **** | .. warning:: A technical note. Recall that lists are *mutable*, and that (like all variables) they contain *references* to objects, not the objects themselves. So what? Here are a few examples showing what are the consequences of the previous observation. **** **Example**. This code:: sublist = range(5) list = [sublist] print list creates a new list list that "contains" another list sublist. When I modify sublist (which is mutable), I end up modifying list:: sublist.append(5) print sublist print list **** **Example**. Another interesting case:: list = range(5) print list not_a_copy = list print not_a_copy # here both list and not_a_copy end up referring to the same # object, so if I change list, I also end up changing # not_a_copy! list.append(5) print list print not_a_copy In order to create a real, independent copy of a list, I have to use the extraction operator (or list comprehension, as we will see), as follows:: list = range(5) print list real_copy = list[:] # or: real_copy = [elem for elem in list] print real_copy list.append(5) print list print real_copy **** | Exercises --------- #. Create a new empty list list. Then add an integer, a string, and another list. #. Starting from the list list = range(3) (reset it after every bullet point!), what happens when I do:: #. list.append(3) #. list.append([3]) #. list.extend([3]) #. list.extend(3) #. list.insert(0, 3) #. list.insert(3, 3) #. list.insert(3, [3]) #. list.insert([3], 3) #. What is the difference between:: list = [] list.append(range(10)) list.append(range(10, 20)) and:: list = [] list.extend(range(10)) list.extend(range(10, 20)) How long is list in the two cases? #. What does this code do?:: list = [0, 0, 0, 0] list.remove(0) #. What does this code do?:: list = [1, 2, 3, 4, 5] list.reverse() list.sort() Is it equivalent to the following code?:: list = [1, 2, 3, 4, 5] list.reverse().sort() #. Given the list:: list = range(10) create a new list reversed_list with the same elements as list, but in reversed order using reverse(). list must not be changed in the process. #. Given the list list:: motifs = [ "KSYK", "SVALVV" "GVTGI", "VGSSLAEVLKLPD", ] create a new list sorted_motifs with the same elements as motifs, but sorted alphanumerically using sort(). motifs must not be changed in the process. | String-List Methods ------------------- =============== =========================== ================================================= Returns method Meaning =============== =========================== ================================================= list-of-str str.split(str) Split a string into a list of strings (words) str str.join(list-of-str) Joins a list of strings (words) into a string =============== =========================== ================================================= **** **Example**. We have a multi-line string taken from a PDB structure file:: structure_chain_a = """SER A 96 77.253 20.522 75.007 VAL A 97 76.066 22.304 71.921 PRO A 98 77.731 23.371 68.681 SER A 99 80.136 26.246 68.973 GLN A 100 79.039 29.534 67.364 LYS A 101 81.787 32.022 68.157""" Let's split the multi-line string into a list of single-line strings, for ease of processing:: lines = structure_chain_a.split("\n") print lines[0] print lines[1] # ... Now we can extract for, say, the second line, all its words:: words = lines[1].split() print words It is now pretty easy to extract the coordinates of the residue:: coords = words[-3:] print coords **** **Example**. join() is essentially the inverse operation:: list_of_strings = [ ">1A3A:A|PDBID|CHAIN|SEQUENCE", "MANLFKLGAENIFLGRKAATKEEAIRFA", ] multiline_string = "\n".join(list_of_strings) print multiline_string It is not a super-interesting method, but it can be useful for printing pretty formatted text. .. warning:: Note that join() takes a list of strings! This won't work:: " ".join([1, 2, 3]) | Exercises --------- #. Given the text:: text = """The Wellcome Trust Sanger Institute is a world leader in genome research.""" create a list of string that includes all of the words (i.e. substrings separated by spaces) of text. Then print how many words there are. #. The table below:: tabella = [ "protein | database | domain | start | end", "YNL275W | Pfam | PF00955 | 236 | 498", "YHR065C | SMART | SM00490 | 335 | 416", "YKL053C-A | Pfam | PF05254 | 5 | 72", "YOR349W | PANTHER | 353 | 414", ] (taken from Saccharomyces Genome Database _) represents a list of *domains* that have been identified in a given yeast protein. Each row is a single domain instance (except the first). Use split() to obtain the list of titles of the various columns (from the first row), making sure that the column names contain no spurious space characters. *Hint*: strip() is not necessary; it is sufficient to use split() correctly. #. Given the list of strings:: words = ["word_1", "word_2", "word_3"] build, using join() together with an appropriate delimiter, the following strings: #. "word_1 word_2 word_3" #. "word_1,word_2,word_3" #. "word_1 e word_2 e word_3" #. "word_1word_2word3" #. r"word_1\word_2\word_3" #. Given the list of strings:: random_sentences = [ "Taci. Su le soglie", "del bosco non odo", "parole che dici", "umane; ma odo", "parole piu' nuove", "che parlano gocciole e foglie", "lontane." ] use join() to create a new multi-line string poem. The expected result is:: >>> print poem Taci. Su le soglie del bosco non odo parole che dici umane; ma odo parole piu' nuove che parlano gocciole e foglie lontane. *Hint*: what delimiter should I use? | List Comprehension ------------------ The *list comprehension* operator allows to **filter** or **transform** a list. .. warning:: The original list is left unchanged. A new list is created instead. **As a filter**. Given an arbitrary list original, I can create a **new** list that only contains those elements of original that satisfy a given condition. The abstract syntax is:: filtered = [element for element in original if condition(element)] Here condition() is arbitrary. Let's see a few examples. **** **Example**. Let's create a list with the even numbers in the range [0, 99]:: numbers = range(100 even_numbers = [n for n in numbers if n % 2 == 0] print even_numbers **** **Example**. Given a list of DNA sequences:: sequences = ["ACTGG", "CCTGT", "ATTTA", "TATAGC"] we keep only those sequences that contain at least one adenosine:: sequences_with_a = [sequence for sequence in sequences if "A" in sequence] print sequences_with_a If we want only those with *no* adenosine, we can invert the condition:: sequences_without_a = [sequence for sequence in sequences if not "A" in sequence] print sequences_without_a **** **Example**. When no condition is given, no filtering is performed:: list = range(5) print list list_2 = [element for element in list] print list_2 The above code creates a copy of the original list list:: **** **Example**. This list describes a gene regulation network:: triples = [ ["G1C2W9", "G1C2Q7", 0.2], ["G1C2W9", "G1C2Q4", 0.9], ["Q6NMS1", "G1C2W9", 0.8], # ^^^^^^ ^^^^^^ ^^^ # gene1 gene2 correlation ] Each "triple" has three elements: two *A. Thaliana* genes, and a measure of correlation of their expression (given by, say, a microarray experiment). I can use a *list comprehension* to keep only the pairs of genes with high correlation:: high_correlation_genes = \ [triple[:-1] for triple in triples if triple[-1] > 0.75] I can also keep only those genes that are highly correlated with the "G1C2W9" gene:: threshold = 0.75 interesting_genes = \ [triple[0] for triple in triples if triple[1] == "G1C2W9" and triple[-1] >= threshold] + \ [triple[1] for triple in triples if triple[0] == "G1C2W9" and triple[-1] >= threshold] **** .. warning:: The name of the "temporary" variable holding the current element (in the examples above, n, sequence and triple, respectively) is arbitrary. This code:: list = range(10) print [x for x in list if x > 5] is *perfectly identical* to this code:: list = range(10) print [y for y in list if y > 5] The name of the variable, x or y, does not make any difference. You are free to pick any name you like. **** | **As a transformation**. Given an arbitrary list original, I can use a *list comprehension* to also transform the elements in the list in some way. The abstract syntax is:: transformed = [transform(element) for element in original] The transformation transform() is arbitrary. **** **Example**. Given the list:: numbers = range(10) let's create a new list with their doubles:: doubles = [n * 2 for n in numbers] # ^^^^^ # transformation print doubles **** **Example**. Given a list of paths:: paths = ["aatable", "fasta.1", "fasta.2"] let's add a prefix "data/" to each and every element:: prefixed_paths = ["data/" + path for path in paths] # ^^^^^^^ # transformation print prefixed_paths **** **Example**. Given the list of primary sequences:: sequences = [ "MVLTIYPDELVQIVSDKIASNK", "GKITLNQLWDIS", "KYFDLSDKKVKQFVLSCVILKKDIE", "VYCDGAITTKNVTDIIGDANHSYS", ] let's compute the length of *each* sequence, and save them in another list:: lengths = [len(seq) for seq in sequences] print lengths **** **Example**. Given the list of strings:: atoms = [ "SER A 96 77.253 20.522 75.007", "VAL A 97 76.066 22.304 71.921", "PRO A 98 77.731 23.371 68.681", "SER A 99 80.136 26.246 68.973", "GLN A 100 79.039 29.534 67.364", "LYS A 101 81.787 32.022 68.157", ] which represents (part of) the 3D structure of a protein chain, I want to compute a list of lists which should hold, for each residue (that is, for every row of atoms), its coordinates:: coords = [row.split()[-3:] for row in atoms] The result is:: >>> print coords [ ["77.253", "20.522", "75.007"], ["76.066", "22.304", "71.921"], ["77.731", "23.371", "68.681"], ["80.136", "26.246", "68.973"], ["79.039", "29.534", "67.364"], ["81.787", "32.022", "68.157"], ] **** | **Jointly transforming and filtering.*** Given a list original, I can both transform and filter its elements *jointly* using the complete version of the *list comprehension* operator. The abstract syntax is:: new_list = [transform(element) for element in original if condition(element)] **** **Example**. Given the integers from 0 to 99, I want to keep only the even ones and divide them by 3:: result = [n / 3.0 for n in range(100) if n % 2 == 0] print result **** **Example**. Given the list of strings:: atoms = [ "SER A 96 77.253 20.522 75.007", "VAL A 97 76.066 22.304 71.921", "PRO A 98 77.731 23.371 68.681", "SER A 99 80.136 26.246 68.973", "GLN A 100 79.039 29.534 67.364", "LYS A 101 81.787 32.022 68.157", ] we used:: coords = [row.split()[-3:] for row in atoms] to obtain:: >>> print coords [ ["77.253", "20.522", "75.007"], ["76.066", "22.304", "71.921"], ["77.731", "23.371", "68.681"], ["80.136", "26.246", "68.973"], ["79.039", "29.534", "67.364"], ["81.787", "32.022", "68.157"], ] We now make things more complex: we only want the coordinates of the serines. Let's write:: coords = [row.split()[-3:] for row in atoms if row.split()[0] == "SER"] **** | Exercises --------- #. Given the list:: list = range(100) #. Create a new list list_plus_3 that holds the elements of list plus 3. The expected result is:: [3, 4, 5, ...] #. Create a new list odds that holds only the odd elements in list. The expected result is:: [1, 3, 5, ...] *Hint*: adapt one of the previous examples. #. Create a new list opposites that holds the arithmetical opposites (the opposite of :math:x is :math:-x) of the elements in list. The expected result is:: [0, -1, -2, ...] #. Create a new list inverses that holds the arithmetical inverse (the inverse of :math:x is :math:\frac{1}{x}) of the elements in list. Make sure to skip those elements that have no inverse (like 0). The expected result is:: [1, 0.5, 0.33334, ...] *Hint*: skip = filter out. #. Create a new list with only the first and last element of list. The expected result is:: [0, 99] *Hint*: is a *list comprehension* required? #. Create a new list with all elements of list, except the first and the last. The expected result is:: [1, 2, ..., 97, 98] #. Count how many odd numbers there are in list. They should be 50. *Hint*: use a *list comprehension* plus... something else. #. Create a new list holding all elements of list, divided by 5. The expected result is:: [0.0, 0.2, 0.4, ...] *Hint*: make sure that the results are float! #. Create a new list holding only the multiples of 5 appearing in list; the multiples should be divided by 5. The expected result is:: [0.0, 1.0, 2.0, ..., 19.0] #. Create a new list list_of_strings containing the same elements as list, but converted into strings. The expected result is:: ["0", "1", "2", ...] #. Count how many strings in list_of_strings represent an odd number. The expected result is, again, 50. #. Create a **string** that contains all the elements in list, separated by spaces. The expected result is:: "0 1 2 ..." *Hint*: use a *list comprehension* plus... something else. #. For each of the following bullet points, write two *list comprehensions* to convert from list_1 to list_2 and vice versa. #. :: list_1 = [1, 2, 3] list_2 = ["1", "2", "3"] #. :: list_1 = ["name", "surname", "age"] list_2 = [["name"], ["surname"], ["age"]] #. :: list_1 = ["ACTC", "TTTGGG", "CT"] list_2 = [["actc", 4], ["tttgggcc", 6], ["ct", 2]] #. Given the list:: list = range(10) which of the following code fragments are correct? What do they compute? #. [x for x in list] #. [y for y in list] #. [y for x in list] #. ["x" for x in list] #. [str(x) for x in list] #. [x for str(x) in list] #. [x + 1 for x in list] #. [x + 1 for x in list if x == 2] #. Let's consider the string:: clusters = """\ >Cluster 0 0 >YLR106C at 100.00% >Cluster 50 0 >YPL082C at 100.00% >Cluster 54 0 >YHL009W-A at 90.80% 1 >YHL009W-B at 100.00% 2 >YJL113W at 98.77% 3 >YJL114W at 97.35% >Cluster 52 0 >YBR208C at 100.00% """ extracted from the output of a clustering algorithm (CD-HIT _) applied to the *S. Cerevisiae* genome (taken from SGD _) clusters encodes information about protein clusters; the proteins were clustered together based on their sequence similarity. (The details are not important.) A cluster begins with the line:: >Cluster N where N is the cluster ID. The contents of the cluster are given in the following lines, for instance:: >Cluster 54 0 >YHL009W-A at 90.80% 1 >YHL009W-B at 100.00% 2 >YJL113W at 98.77% 3 >YJL114W at 97.35% represents a cluster (with ID 54) of four sequences. Four proteins belong to the cluster: protein "YHL009W-A" (with 90.80% similarity to the cluster center), protein "YHL009-B" (with 100.00% similarity) *etc.* Given the string clusters, use a *list comprehension* (together with other operations on strings, such as split()) for: #. Extracting the IDs of the various clusters. The expected result is:: >>> print cluster_ids ["0", "50", "54", "52"] #. Extracting the names of *all* proteins, including duplicates. The expected result is:: >>> print protein_names ["YLR1106C", "YPL082C", "YHL00W-A", ...] #. Extracting all protein-similarity pairs for *all* proteins. Th expected result is:: >>> print protein_similarity_pairs [["YLR106C", 100.0], ["YPL082C", 100.0], ["YHL009W-A", 90.8], # ... ] #. Given the :math:3\times 3 matrix (list of lists):: matrix = [range(0,3), range(3,6), range(6,9)] #. Create a matrix upside_down containing all rows of matrix from bottom to top. #. (Hard.) Create a matrix palindrome containing all columns of matrix from left to right. #. (Hard.) Re-create matrix from scratch using a *list comprehension*. #. (Hard.) Given the list:: list = range(100) Create a list squares containing the squares of all the elements in list. The expected result is:: [0, 1, 4, 9, ...] Next, create a list difference_of_squares containing, in every position i, the value of:: squares[i+1] - squares[i] making sure to avoid the case i == len(list) (because in that case squares[i+1] is undefined). Feel free to use auxiliary variables as required. (Blame this page _ for inspiring this exercise.) #. Given the following list of mouse gene symbols:: mouse_genes = ["Fus", "Tdp43", "Sod1", "Ighmbp2", "Srsf2"] #. Sort the list alphabetically. #. In the sorted ist, convert mouse symbols into human gene symbols. *Hint*: in human gene symbols all letters are in upper-case.