Python: Lists

Lists are ordered sequences of arbitrary elements (objects).

Lists are defined using square brackets, as follows:

# A list of integers (notice that the 1 appears twice)
integers = [1, 2, 3, 1]

# A list of strings
uniprot_proteins = ["Y08501", "Q95747"]

# A list of heterogeneous objects
things = ["Y08501", 0.13, "Q95747", 0.96]

# A list of lists
two_level_list = [
    ["Y08501", 120, 520],
    ["Q95747", 550, 920],
]

# An empty list
empty = []

# A list containing two empty lists
a_weird_list = [ [], [] ]

Operations

Returns Operator Meaning
bool == Check whether two lists are identical
bool != Check whether two lists are different
int len(list) Compute the length of a list
list list + list Concatenate two list (returns a new list)
list list * int Replicate a list multiple times
bool element in list Check whether an element appears in the list
list list[int:int] Extracts a sub-list
list list[int] = object Assigns a new value to an element
list range(int, [int]) Compute the integers in a given range

Lists offer almost the same operators as strings, with a couple of additions.


Example. range() returns a list of integers in a given range:

>>> numbers = range(5)
>>> print numbers
[0, 1, 2, 3, 4]

>>> numbers = range(0, 5)
>>> print numbers
[0, 1, 2, 3, 4]

>>> numbers = range(2, 4)
>>> print numbers
[2, 3]

>>> numbers = range(4, 2) # the range is backwards!
>>> print numbers
[]

Example. Just like with strings, you can extract an element or a range of elements from a list:

>>> numbers = range(10)
>>> first_element = numbers[0]
>>> print first_element
0

>>> last_element = numbers[-1]
>>> print last_element
9

>>> the_other_elements = numbers[1:-1]
>>> print the_other_elements
[1, 2, 3, 4, 5, 6, 7, 8]

Example. The assignment operator assigns a new value to an already existing list element. It is more or less the opposite of the extraction operator. A couple of examples:

>>> numbers = range(5)
>>> print numbers
[0, 1, 2, 3, 4]

>>> numbers[0] = "first"
>>> print numbers
["first", 1, 2, 3, 4]

>>> numbers[-1] = "last"
>>> print numbers
["first", 1, 2, 3, "last"]

>>> numbers[len(numbers)/2] = "middle"
>>> print numbers
["first", 1, "middle", 3, "last"]

If the index is out-of-bounds, the assignment raises an error:

>>> numbers = range(5)
>>> numbers[100] = "out-of-bounds"
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
IndexError: list assignment index out of range

Warning

The assignment operator does not change the length of the list!

It modifies an existing element (at a given position); it does not add a new element.

Warning

Lists are ordered: the order of the elements matters:

[1, 2, 3] != [3, 2, 1]

Lists are not sets: objects may appear more than once:

[3, 3, "a", "a"] != [3, "a"]

Exercises

  1. Create an empty list using the bracket notation. Check whether it is really empty using len().

  2. Create a list with the first five non-negative integers using range().

  3. Create a list with one hundred 0 elements.

    Hint: note the replication operator.

  4. Given:

    list_1 = range(10)
    list_2 = range(10, 20)
    

    concatenate the two lists, and assign the result to a new variable full_list. Use the equality comparison operator == to check whether it matches the result of range(20).

  5. Create a list of three strings: "I am", "a", "list". Then print the type and length of the three elements (manually, one by one).

  6. Given the list:

    list = [0.0, "b", [3], [4, 5]]
    
    1. How long is it?
    2. What is the type of the first element of list?
    3. How long is the second element of list?
    4. How long is the third element of list?
    5. What is the value of the last element of list? How long is it?
    6. Does the list contain an element "b"?
    7. Does the list contain an element 4?
  7. What is the difference between these “lists”?:

    list_1 = [1, 2, 3]
    list_2 = ["1", "2", "3"]
    list_3 = "[1, 2, 3]"
    

    Hint: is the third one actually a list?

  8. Which of the following code fragments are wrong?

    1. list = []
    2. list = [}
    3. list = [[]]
    4. list = [1 2 3]
    5. list = range(3), element = list[3]
    6. list = range(3), element = list[-1]
    7. list = range(3), sublist = list[0:2]
    8. list = range(3), sublist = list[0:3]
    9. list = range(3), sublist = list[0:-1]
    10. list = range(3), list[2] = "two"
    11. list = range(3), list[3] = "three"
    12. list = range(3), list[-1] = "three"
    13. list = range(3), list[1.2] = "one point two"
    14. list = range(3), list[1] = ["protein1", "protein2"]
  9. Given the list:

    matrix = [
        [1, 2, 3],          # <-- 1st row
        [4, 5, 6],          # <-- 2nd row
        [7, 8, 9],          # <-- 3rd riga
    ]
    #    ^  ^  ^
    #    |  |  |
    #    |  |  +-- 3rd column
    #    |  +----- 2nd column
    #    +-------- 1st column
    

    How can I:

    1. Extract the first row?
    2. Extract the second element of the first row?
    3. Sum all elements of the first row? (Perform the sum manually)
    4. Create a new list containing the elements of the second column? (Manually)
    5. Create a new list containing the elements of the diagonal? (Manually)
    6. Create a list by concatenating the first, second, and third rows?

Methods

Returns Method Meaning
None list.append(object) Add a new element at the end of the list
None list.extend(list) Add several new elements at the end of the list
None list.insert(int,object) Add a new element at some given position
None list.remove(object) Remove the first occurrence of an element
None list.reverse() Invert the order of the elements
None list.sort() Sort the elements
int list.count(object) Count the occurrences of an element

Warning

All list methods (except count()):

  • Modify the input list
  • Do not have a return value (they return None)

In other words, they behave the exact opposite of string methods!

As a consequence, if I do:

list = range(10)
print list

result = list.append(10)

print list
print result

the list is modified in the process and result will be None. The same is true for all other methods (except count()).

This may look a bit surprising, and it is easy to mistakenly expect append() or reverse() to return a list. They do not.

By the way, this is the reason why we can not write code like:

list = []
list.append(1).append(2).append(3)

because the first append() does not return a list (it returns None, and None of course does not support the append() method – so the second append() always fails: Python raises an error.)



Example. append() adds a new element at the end of a list:

list = range(10)
print list                          # [0, 1, 2, ..., 9]
print len(list)                     # 10

list.append(10)
print list                          # [0, 1, 2, ..., 9, 10]
print len(list)                     # 11

list.append(11)
print list                          # [0, 1, 2, ..., 9, 10, 11]
print len(list)                     # 12

Note how list changes in the process.


Example. extend() adds several new elements at the end of a list:

list = range(10)
list.extend(range(10,20))
print list                          # [0, 1, 2, ..., 19]
print len(list)                     # 20

Example. insert() adds a new element at an arbitrary position:

list = range(10)
print list                          # [0, 1, ..., 9]
print len(list)                     # 10

list.insert(2, "marker")
print list                          # [0, 1, "marker", 3, ..., 9]
print len(list)                     # 11

Example. Contrary to append(), insert() and extend(), list concatenation does not modify the original list. You get a new list instead:

list_1 = range(0, 10)
list_2 = range(10, 20)

# let's use the concatenation operator
full_list = list_1 + list_2
print list_1, "+", list_2, "->", full_list

# now let's use extend() instead
full_list = list_1.extend(list_2)
print list_1
print list_2
print full_list

Note how with extend(), list_1 changes while full_list is None, as expected.


Example. remove() removes the first occurrence of a given value:

list = ["a", "list", "not", "a", "string"]
#       ^^^                 ^^^

list.remove("a")
print list                          # ["list", "not", "a", "string"]

Example. sort() and reverse() reorder the elements in the list:

list = [3, 2, 1, 5, 4]

list.reverse()
print list                          # [4, 5, 1, 2, 3]

list.sort()
print list                          # [1, 2, 3, 4, 5]

It also works with strings (it defaults to lexicographic ordering). Check this out:

list = ["AC", "GT"]
print list

list.reverse()
print list                          # ["GT", "AC"]

list.sort()
print list                          # ["AC", "GT"]

Example. count() returns the number of occurrences of a value in a list:

list = ["a", "c", "g", "t", "a"]
num_a = list.count("a")
num_g = list.count("g")
print num_a, num_g                  # 2, 1

Of course, count() does not modify the original list.



Warning

A technical note. Recall that lists are mutable, and that (like all variables) they contain references to objects, not the objects themselves.

So what? Here are a few examples showing what are the consequences of the previous observation.


Example. This code:

sublist = range(5)
list = [sublist]
print list

creates a new list list that “contains” another list sublist. When I modify sublist (which is mutable), I end up modifying list:

sublist.append(5)
print sublist

print list

Example. Another interesting case:

list = range(5)
print list

not_a_copy = list
print not_a_copy

# here both list and not_a_copy end up referring to the same
# object, so  if I change list, I also end up changing
# not_a_copy!

list.append(5)
print list

print not_a_copy

In order to create a real, independent copy of a list, I have to use the extraction operator (or list comprehension, as we will see), as follows:

list = range(5)
print list

real_copy = list[:]
# or: real_copy = [elem for elem in list]
print real_copy

list.append(5)
print list

print real_copy


Exercises

  1. Create a new empty list list. Then add an integer, a string, and another list.

  2. Starting from the list list = range(3) (reset it after every bullet point!), what happens when I do:

    #. ``list.append(3)``
    #. ``list.append([3])``
    #. ``list.extend([3])``
    #. ``list.extend(3)``
    #. ``list.insert(0, 3)``
    #. ``list.insert(3, 3)``
    #. ``list.insert(3, [3])``
    #. ``list.insert([3], 3)``
    
  3. What is the difference between:

    list = []
    list.append(range(10))
    list.append(range(10, 20))
    

    and:

    list = []
    list.extend(range(10))
    list.extend(range(10, 20))
    

    How long is list in the two cases?

  4. What does this code do?:

    list = [0, 0, 0, 0]
    list.remove(0)
    
  5. What does this code do?:

    list = [1, 2, 3, 4, 5]
    list.reverse()
    list.sort()
    

    Is it equivalent to the following code?:

    list = [1, 2, 3, 4, 5]
    list.reverse().sort()
    
  6. Given the list:

    list = range(10)
    

    create a new list reversed_list with the same elements as list, but in reversed order using reverse().

    list must not be changed in the process.

  7. Given the list list:

    motifs = [
        "KSYK",
        "SVALVV"
        "GVTGI",
        "VGSSLAEVLKLPD",
    ]
    

    create a new list sorted_motifs with the same elements as motifs, but sorted alphanumerically using sort().

    motifs must not be changed in the process.


String-List Methods

Returns method Meaning
list-of-str str.split(str) Split a string into a list of strings (words)
str str.join(list-of-str) Joins a list of strings (words) into a string

Example. We have a multi-line string taken from a PDB structure file:

structure_chain_a = """SER A 96 77.253 20.522 75.007
VAL A 97 76.066 22.304 71.921
PRO A 98 77.731 23.371 68.681
SER A 99 80.136 26.246 68.973
GLN A 100 79.039 29.534 67.364
LYS A 101 81.787 32.022 68.157"""

Let’s split the multi-line string into a list of single-line strings, for ease of processing:

lines = structure_chain_a.split("\n")
print lines[0]
print lines[1]
# ...

Now we can extract for, say, the second line, all its words:

words = lines[1].split()
print words

It is now pretty easy to extract the coordinates of the residue:

coords = words[-3:]
print coords

Example. join() is essentially the inverse operation:

list_of_strings = [
    ">1A3A:A|PDBID|CHAIN|SEQUENCE",
    "MANLFKLGAENIFLGRKAATKEEAIRFA",
]

multiline_string = "\n".join(list_of_strings)
print multiline_string

It is not a super-interesting method, but it can be useful for printing pretty formatted text.

Warning

Note that join() takes a list of strings! This won’t work:

" ".join([1, 2, 3])

Exercises

  1. Given the text:

    text = """The Wellcome Trust Sanger Institute
    is a world leader in genome research."""
    

    create a list of string that includes all of the words (i.e. substrings separated by spaces) of text.

    Then print how many words there are.

  2. The table below:

    tabella = [
        "protein | database | domain | start | end",
        "YNL275W | Pfam | PF00955 | 236 | 498",
        "YHR065C | SMART | SM00490 | 335 | 416",
        "YKL053C-A | Pfam | PF05254 | 5 | 72",
        "YOR349W | PANTHER | 353 | 414",
    ]
    

    (taken from Saccharomyces Genome Database) represents a list of domains that have been identified in a given yeast protein.

    Each row is a single domain instance (except the first).

    Use split() to obtain the list of titles of the various columns (from the first row), making sure that the column names contain no spurious space characters.

    Hint: strip() is not necessary; it is sufficient to use split() correctly.

  3. Given the list of strings:

    words = ["word_1", "word_2", "word_3"]
    

    build, using join() together with an appropriate delimiter, the following strings:

    1. "word_1 word_2 word_3"
    2. "word_1,word_2,word_3"
    3. "word_1 e word_2 e word_3"
    4. "word_1word_2word3"
    5. r"word_1\word_2\word_3"
  4. Given the list of strings:

    random_sentences = [
        "Taci. Su le soglie",
        "del bosco non odo",
        "parole che dici",
        "umane; ma odo",
        "parole piu' nuove",
        "che parlano gocciole e foglie",
        "lontane."
    ]
    

    use join() to create a new multi-line string poem. The expected result is:

    >>> print poem
    Taci. Su le soglie
    del bosco non odo
    parole che dici
    umane; ma odo
    parole piu' nuove
    che parlano gocciole e foglie
    lontane.
    

    Hint: what delimiter should I use?


List Comprehension

The list comprehension operator allows to filter or transform a list.

Warning

The original list is left unchanged. A new list is created instead.

As a filter. Given an arbitrary list original, I can create a new list that only contains those elements of original that satisfy a given condition.

The abstract syntax is:

filtered = [element
            for element in original
            if condition(element)]

Here condition() is arbitrary. Let’s see a few examples.


Example. Let’s create a list with the even numbers in the range [0, 99]:

numbers = range(100

even_numbers = [n
                for n in numbers
                if n % 2 == 0]
print even_numbers

Example. Given a list of DNA sequences:

sequences = ["ACTGG", "CCTGT", "ATTTA", "TATAGC"]

we keep only those sequences that contain at least one adenosine:

sequences_with_a = [sequence
                    for sequence in sequences
                    if "A" in sequence]
print sequences_with_a

If we want only those with no adenosine, we can invert the condition:

sequences_without_a = [sequence
                       for sequence in sequences
                       if not "A" in sequence]
print sequences_without_a

Example. When no condition is given, no filtering is performed:

list = range(5)
print list

list_2 = [element for element in list]
print list_2

The above code creates a copy of the original list list:

****

Example. This list describes a gene regulation network:

triples = [
    ["G1C2W9", "G1C2Q7", 0.2],
    ["G1C2W9", "G1C2Q4", 0.9],
    ["Q6NMS1", "G1C2W9", 0.8],
    # ^^^^^^    ^^^^^^   ^^^
    #  gene1     gene2   correlation
]

Each “triple” has three elements: two A. Thaliana genes, and a measure of correlation of their expression (given by, say, a microarray experiment).

I can use a list comprehension to keep only the pairs of genes with high correlation:

high_correlation_genes = \
    [triple[:-1] for triple in triples
     if triple[-1] > 0.75]

I can also keep only those genes that are highly correlated with the "G1C2W9" gene:

threshold = 0.75
interesting_genes = \
    [triple[0] for triple in triples
     if triple[1] == "G1C2W9" and triple[-1] >= threshold] + \
    [triple[1] for triple in triples
     if triple[0] == "G1C2W9" and triple[-1] >= threshold]

Warning

The name of the “temporary” variable holding the current element (in the examples above, n, sequence and triple, respectively) is arbitrary.

This code:

list = range(10)
print [x for x in list if x > 5]

is perfectly identical to this code:

list = range(10)
print [y for y in list if y > 5]

The name of the variable, x or y, does not make any difference. You are free to pick any name you like.



As a transformation. Given an arbitrary list original, I can use a list comprehension to also transform the elements in the list in some way.

The abstract syntax is:

transformed = [transform(element)
               for element in original]

The transformation transform() is arbitrary.


Example. Given the list:

numbers = range(10)

let’s create a new list with their doubles:

doubles = [n * 2 for n in numbers]
#          ^^^^^
#          transformation
print doubles

Example. Given a list of paths:

paths = ["aatable", "fasta.1", "fasta.2"]

let’s add a prefix "data/" to each and every element:

prefixed_paths = ["data/" + path for path in paths]
#                 ^^^^^^^
#                 transformation
print prefixed_paths

Example. Given the list of primary sequences:

sequences = [
    "MVLTIYPDELVQIVSDKIASNK",
    "GKITLNQLWDIS",
    "KYFDLSDKKVKQFVLSCVILKKDIE",
    "VYCDGAITTKNVTDIIGDANHSYS",
]

let’s compute the length of each sequence, and save them in another list:

lengths = [len(seq) for seq in sequences]
print lengths

Example. Given the list of strings:

atoms = [
    "SER A 96 77.253 20.522 75.007",
    "VAL A 97 76.066 22.304 71.921",
    "PRO A 98 77.731 23.371 68.681",
    "SER A 99 80.136 26.246 68.973",
    "GLN A 100 79.039 29.534 67.364",
    "LYS A 101 81.787 32.022 68.157",
]

which represents (part of) the 3D structure of a protein chain, I want to compute a list of lists which should hold, for each residue (that is, for every row of atoms), its coordinates:

coords = [row.split()[-3:] for row in atoms]

The result is:

>>> print coords
[
    ["77.253", "20.522", "75.007"],
    ["76.066", "22.304", "71.921"],
    ["77.731", "23.371", "68.681"],
    ["80.136", "26.246", "68.973"],
    ["79.039", "29.534", "67.364"],
    ["81.787", "32.022", "68.157"],
]


Jointly transforming and filtering.* Given a list original, I can both transform and filter its elements jointly using the complete version of the list comprehension operator.

The abstract syntax is:

new_list = [transform(element)
            for element in original
            if condition(element)]

Example. Given the integers from 0 to 99, I want to keep only the even ones and divide them by 3:

result = [n / 3.0
          for n in range(100)
          if n % 2 == 0]
print result

Example. Given the list of strings:

atoms = [
    "SER A 96 77.253 20.522 75.007",
    "VAL A 97 76.066 22.304 71.921",
    "PRO A 98 77.731 23.371 68.681",
    "SER A 99 80.136 26.246 68.973",
    "GLN A 100 79.039 29.534 67.364",
    "LYS A 101 81.787 32.022 68.157",
]

we used:

coords = [row.split()[-3:] for row in atoms]

to obtain:

>>> print coords
[
    ["77.253", "20.522", "75.007"],
    ["76.066", "22.304", "71.921"],
    ["77.731", "23.371", "68.681"],
    ["80.136", "26.246", "68.973"],
    ["79.039", "29.534", "67.364"],
    ["81.787", "32.022", "68.157"],
]

We now make things more complex: we only want the coordinates of the serines. Let’s write:

coords = [row.split()[-3:] for row in atoms
          if row.split()[0] == "SER"]


Exercises

  1. Given the list:

    list = range(100)
    
    1. Create a new list list_plus_3 that holds the elements of list plus 3. The expected result is:

      [3, 4, 5, ...]
      
    2. Create a new list odds that holds only the odd elements in list. The expected result is:

      [1, 3, 5, ...]
      

      Hint: adapt one of the previous examples.

    3. Create a new list opposites that holds the arithmetical opposites (the opposite of \(x\) is \(-x\)) of the elements in list. The expected result is:

      [0, -1, -2, ...]
      
    4. Create a new list inverses that holds the arithmetical inverse (the inverse of \(x\) is \(\frac{1}{x}\)) of the elements in list.

      Make sure to skip those elements that have no inverse (like 0).

      The expected result is:

      [1, 0.5, 0.33334, ...]
      

      Hint: skip = filter out.

    5. Create a new list with only the first and last element of list. The expected result is:

      [0, 99]
      

      Hint: is a list comprehension required?

    6. Create a new list with all elements of list, except the first and the last. The expected result is:

      [1, 2, ..., 97, 98]
      
    7. Count how many odd numbers there are in list. They should be 50.

      Hint: use a list comprehension plus... something else.

    8. Create a new list holding all elements of list, divided by 5. The expected result is:

      [0.0, 0.2, 0.4, ...]
      

      Hint: make sure that the results are float!

    9. Create a new list holding only the multiples of 5 appearing in list; the multiples should be divided by 5. The expected result is:

      [0.0, 1.0, 2.0, ..., 19.0]
      
    10. Create a new list list_of_strings containing the same elements as list, but converted into strings. The expected result is:

      ["0", "1", "2", ...]
      
    11. Count how many strings in list_of_strings represent an odd number. The expected result is, again, 50.

    12. Create a string that contains all the elements in list, separated by spaces. The expected result is:

      "0 1 2 ..."
      

      Hint: use a list comprehension plus... something else.

  2. For each of the following bullet points, write two list comprehensions to convert from list_1 to list_2 and vice versa.

    1. list_1 = [1, 2, 3]
      list_2 = ["1", "2", "3"]
      
    2. list_1 = ["name", "surname", "age"]
      list_2 = [["name"], ["surname"], ["age"]]
      
    3. list_1 = ["ACTC", "TTTGGG", "CT"]
      list_2 = [["actc", 4], ["tttgggcc", 6], ["ct", 2]]
      
  3. Given the list:

    list = range(10)
    

    which of the following code fragments are correct? What do they compute?

    1. [x for x in list]
    2. [y for y in list]
    3. [y for x in list]
    4. ["x" for x in list]
    5. [str(x) for x in list]
    6. [x for str(x) in list]
    7. [x + 1 for x in list]
    8. [x + 1 for x in list if x == 2]
  4. Let’s consider the string:

    clusters = """\
    >Cluster 0
    0 >YLR106C at 100.00%
    >Cluster 50
    0 >YPL082C at 100.00%
    >Cluster 54
    0 >YHL009W-A at 90.80%
    1 >YHL009W-B at 100.00%
    2 >YJL113W at 98.77%
    3 >YJL114W at 97.35%
    >Cluster 52
    0 >YBR208C at 100.00%
    """
    

    extracted from the output of a clustering algorithm (CD-HIT) applied to the S. Cerevisiae genome (taken from SGD)

    clusters encodes information about protein clusters; the proteins were clustered together based on their sequence similarity. (The details are not important.)

    A cluster begins with the line:

    >Cluster N
    

    where N is the cluster ID. The contents of the cluster are given in the following lines, for instance:

    >Cluster 54
    0 >YHL009W-A at 90.80%
    1 >YHL009W-B at 100.00%
    2 >YJL113W at 98.77%
    3 >YJL114W at 97.35%
    

    represents a cluster (with ID 54) of four sequences. Four proteins belong to the cluster: protein "YHL009W-A" (with 90.80% similarity to the cluster center), protein "YHL009-B" (with 100.00% similarity) etc.

    Given the string clusters, use a list comprehension (together with other operations on strings, such as split()) for:

    1. Extracting the IDs of the various clusters. The expected result is:

      >>> print cluster_ids
      ["0", "50", "54", "52"]
      
    2. Extracting the names of all proteins, including duplicates. The expected result is:

      >>> print protein_names
      ["YLR1106C", "YPL082C", "YHL00W-A", ...]
      
    3. Extracting all protein-similarity pairs for all proteins. Th expected result is:

      >>> print protein_similarity_pairs
      [["YLR106C", 100.0],
       ["YPL082C", 100.0],
       ["YHL009W-A", 90.8],
       # ...
      ]
      
  5. Given the \(3\times 3\) matrix (list of lists):

    matrix = [range(0,3), range(3,6), range(6,9)]
    
    1. Create a matrix upside_down containing all rows of matrix from bottom to top.
    2. (Hard.) Create a matrix palindrome containing all columns of matrix from left to right.
    3. (Hard.) Re-create matrix from scratch using a list comprehension.
  6. (Hard.) Given the list:

    list = range(100)
    

    Create a list squares containing the squares of all the elements in list. The expected result is:

    [0, 1, 4, 9, ...]
    

    Next, create a list difference_of_squares containing, in every position i, the value of:

    squares[i+1] - squares[i]
    

    making sure to avoid the case i == len(list) (because in that case squares[i+1] is undefined).

    Feel free to use auxiliary variables as required.

    (Blame this page for inspiring this exercise.)

  7. Given the following list of mouse gene symbols:

    mouse_genes = ["Fus", "Tdp43", "Sod1", "Ighmbp2", "Srsf2"]
    
    1. Sort the list alphabetically.
    2. In the sorted ist, convert mouse symbols into human gene symbols.

    Hint: in human gene symbols all letters are in upper-case.