Python: Lists¶
Lists are ordered sequences of arbitrary elements (objects).
Lists are defined using square brackets, as follows:
# A list of integers (notice that the 1 appears twice)
integers = [1, 2, 3, 1]
# A list of strings
uniprot_proteins = ["Y08501", "Q95747"]
# A list of heterogeneous objects
things = ["Y08501", 0.13, "Q95747", 0.96]
# A list of lists
two_level_list = [
["Y08501", 120, 520],
["Q95747", 550, 920],
]
# An empty list
empty = []
# A list containing two empty lists
a_weird_list = [ [], [] ]
Operations¶
Returns | Operator | Meaning |
---|---|---|
bool |
== |
Check whether two lists are identical |
bool |
!= |
Check whether two lists are different |
int |
len(list) |
Compute the length of a list |
list |
list + list |
Concatenate two list (returns a new list) |
list |
list * int |
Replicate a list multiple times |
bool |
element in list |
Check whether an element appears in the list |
list |
list[int:int] |
Extracts a sub-list |
list |
list[int] = object |
Assigns a new value to an element |
list |
range(int, [int]) |
Compute the integers in a given range |
Lists offer almost the same operators as strings, with a couple of additions.
Example. range()
returns a list of integers in a given range:
>>> numbers = range(5)
>>> print numbers
[0, 1, 2, 3, 4]
>>> numbers = range(0, 5)
>>> print numbers
[0, 1, 2, 3, 4]
>>> numbers = range(2, 4)
>>> print numbers
[2, 3]
>>> numbers = range(4, 2) # the range is backwards!
>>> print numbers
[]
Example. Just like with strings, you can extract an element or a range of elements from a list:
>>> numbers = range(10)
>>> first_element = numbers[0]
>>> print first_element
0
>>> last_element = numbers[-1]
>>> print last_element
9
>>> the_other_elements = numbers[1:-1]
>>> print the_other_elements
[1, 2, 3, 4, 5, 6, 7, 8]
Example. The assignment operator assigns a new value to an already existing list element. It is more or less the opposite of the extraction operator. A couple of examples:
>>> numbers = range(5)
>>> print numbers
[0, 1, 2, 3, 4]
>>> numbers[0] = "first"
>>> print numbers
["first", 1, 2, 3, 4]
>>> numbers[-1] = "last"
>>> print numbers
["first", 1, 2, 3, "last"]
>>> numbers[len(numbers)/2] = "middle"
>>> print numbers
["first", 1, "middle", 3, "last"]
If the index is out-of-bounds, the assignment raises an error:
>>> numbers = range(5)
>>> numbers[100] = "out-of-bounds"
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
IndexError: list assignment index out of range
Warning
The assignment operator does not change the length of the list!
It modifies an existing element (at a given position); it does not add a new element.
Warning
Lists are ordered: the order of the elements matters:
[1, 2, 3] != [3, 2, 1]
Lists are not sets: objects may appear more than once:
[3, 3, "a", "a"] != [3, "a"]
Exercises¶
Create an empty list using the bracket notation. Check whether it is really empty using
len()
.Create a list with the first five non-negative integers using
range()
.Create a list with one hundred
0
elements.Hint: note the replication operator.
Given:
list_1 = range(10) list_2 = range(10, 20)
concatenate the two lists, and assign the result to a new variable
full_list
. Use the equality comparison operator==
to check whether it matches the result ofrange(20)
.Create a list of three strings:
"I am"
,"a"
,"list"
. Then print the type and length of the three elements (manually, one by one).Given the list:
list = [0.0, "b", [3], [4, 5]]
- How long is it?
- What is the type of the first element of
list
? - How long is the second element of
list
? - How long is the third element of
list
? - What is the value of the last element of
list
? How long is it? - Does the list contain an element
"b"
? - Does the list contain an element
4
?
What is the difference between these “lists”?:
list_1 = [1, 2, 3] list_2 = ["1", "2", "3"] list_3 = "[1, 2, 3]"
Hint: is the third one actually a list?
Which of the following code fragments are wrong?
list = []
list = [}
list = [[]]
list = [1 2 3]
list = range(3)
,element = list[3]
list = range(3)
,element = list[-1]
list = range(3)
,sublist = list[0:2]
list = range(3)
,sublist = list[0:3]
list = range(3)
,sublist = list[0:-1]
list = range(3)
,list[2] = "two"
list = range(3)
,list[3] = "three"
list = range(3)
,list[-1] = "three"
list = range(3)
,list[1.2] = "one point two"
list = range(3)
,list[1] = ["protein1", "protein2"]
Given the list:
matrix = [ [1, 2, 3], # <-- 1st row [4, 5, 6], # <-- 2nd row [7, 8, 9], # <-- 3rd riga ] # ^ ^ ^ # | | | # | | +-- 3rd column # | +----- 2nd column # +-------- 1st column
How can I:
- Extract the first row?
- Extract the second element of the first row?
- Sum all elements of the first row? (Perform the sum manually)
- Create a new list containing the elements of the second column? (Manually)
- Create a new list containing the elements of the diagonal? (Manually)
- Create a list by concatenating the first, second, and third rows?
Methods¶
Returns | Method | Meaning |
---|---|---|
None |
list.append(object) |
Add a new element at the end of the list |
None |
list.extend(list) |
Add several new elements at the end of the list |
None |
list.insert(int,object) |
Add a new element at some given position |
None |
list.remove(object) |
Remove the first occurrence of an element |
None |
list.reverse() |
Invert the order of the elements |
None |
list.sort() |
Sort the elements |
int |
list.count(object) |
Count the occurrences of an element |
Warning
All list methods (except count()
):
- Modify the input list
- Do not have a return value (they return
None
)
In other words, they behave the exact opposite of string methods!
As a consequence, if I do:
list = range(10)
print list
result = list.append(10)
print list
print result
the list is modified in the process and result
will be None
. The
same is true for all other methods (except count()
).
This may look a bit surprising, and it is easy to mistakenly expect
append()
or reverse()
to return a list. They do not.
By the way, this is the reason why we can not write code like:
list = []
list.append(1).append(2).append(3)
because the first append()
does not return a list (it returns
None
, and None
of course does not support the append()
method
– so the second append()
always fails: Python raises an
error.)
Example. append()
adds a new element at the end of a list:
list = range(10)
print list # [0, 1, 2, ..., 9]
print len(list) # 10
list.append(10)
print list # [0, 1, 2, ..., 9, 10]
print len(list) # 11
list.append(11)
print list # [0, 1, 2, ..., 9, 10, 11]
print len(list) # 12
Note how list
changes in the process.
Example. extend()
adds several new elements at the end of a list:
list = range(10)
list.extend(range(10,20))
print list # [0, 1, 2, ..., 19]
print len(list) # 20
Example. insert()
adds a new element at an arbitrary position:
list = range(10)
print list # [0, 1, ..., 9]
print len(list) # 10
list.insert(2, "marker")
print list # [0, 1, "marker", 3, ..., 9]
print len(list) # 11
Example. Contrary to append()
, insert()
and extend()
, list
concatenation does not modify the original list. You get a new list
instead:
list_1 = range(0, 10)
list_2 = range(10, 20)
# let's use the concatenation operator
full_list = list_1 + list_2
print list_1, "+", list_2, "->", full_list
# now let's use extend() instead
full_list = list_1.extend(list_2)
print list_1
print list_2
print full_list
Note how with extend()
, list_1
changes while full_list
is None
,
as expected.
Example. remove()
removes the first occurrence of a given value:
list = ["a", "list", "not", "a", "string"]
# ^^^ ^^^
list.remove("a")
print list # ["list", "not", "a", "string"]
Example. sort()
and reverse()
reorder the elements in the list:
list = [3, 2, 1, 5, 4]
list.reverse()
print list # [4, 5, 1, 2, 3]
list.sort()
print list # [1, 2, 3, 4, 5]
It also works with strings (it defaults to lexicographic ordering). Check this out:
list = ["AC", "GT"]
print list
list.reverse()
print list # ["GT", "AC"]
list.sort()
print list # ["AC", "GT"]
Example. count()
returns the number of occurrences of a value in
a list:
list = ["a", "c", "g", "t", "a"]
num_a = list.count("a")
num_g = list.count("g")
print num_a, num_g # 2, 1
Of course, count()
does not modify the original list.
Warning
A technical note. Recall that lists are mutable, and that (like all variables) they contain references to objects, not the objects themselves.
So what? Here are a few examples showing what are the consequences of the previous observation.
Example. This code:
sublist = range(5)
list = [sublist]
print list
creates a new list list
that “contains” another list sublist
. When
I modify sublist
(which is mutable), I end up modifying list
:
sublist.append(5)
print sublist
print list
Example. Another interesting case:
list = range(5)
print list
not_a_copy = list
print not_a_copy
# here both list and not_a_copy end up referring to the same
# object, so if I change list, I also end up changing
# not_a_copy!
list.append(5)
print list
print not_a_copy
In order to create a real, independent copy of a list, I have to use the extraction operator (or list comprehension, as we will see), as follows:
list = range(5)
print list
real_copy = list[:]
# or: real_copy = [elem for elem in list]
print real_copy
list.append(5)
print list
print real_copy
Exercises¶
Create a new empty list
list
. Then add an integer, a string, and another list.Starting from the list
list = range(3)
(reset it after every bullet point!), what happens when I do:#. ``list.append(3)`` #. ``list.append([3])`` #. ``list.extend([3])`` #. ``list.extend(3)`` #. ``list.insert(0, 3)`` #. ``list.insert(3, 3)`` #. ``list.insert(3, [3])`` #. ``list.insert([3], 3)``
What is the difference between:
list = [] list.append(range(10)) list.append(range(10, 20))
and:
list = [] list.extend(range(10)) list.extend(range(10, 20))
How long is
list
in the two cases?What does this code do?:
list = [0, 0, 0, 0] list.remove(0)
What does this code do?:
list = [1, 2, 3, 4, 5] list.reverse() list.sort()
Is it equivalent to the following code?:
list = [1, 2, 3, 4, 5] list.reverse().sort()
Given the list:
list = range(10)
create a new list
reversed_list
with the same elements aslist
, but in reversed order usingreverse()
.list
must not be changed in the process.Given the list list:
motifs = [ "KSYK", "SVALVV" "GVTGI", "VGSSLAEVLKLPD", ]
create a new list
sorted_motifs
with the same elements asmotifs
, but sorted alphanumerically usingsort()
.motifs
must not be changed in the process.
String-List Methods¶
Returns | method | Meaning |
---|---|---|
list-of-str |
str.split(str) |
Split a string into a list of strings (words) |
str |
str.join(list-of-str) |
Joins a list of strings (words) into a string |
Example. We have a multi-line string taken from a PDB structure file:
structure_chain_a = """SER A 96 77.253 20.522 75.007
VAL A 97 76.066 22.304 71.921
PRO A 98 77.731 23.371 68.681
SER A 99 80.136 26.246 68.973
GLN A 100 79.039 29.534 67.364
LYS A 101 81.787 32.022 68.157"""
Let’s split the multi-line string into a list of single-line strings, for ease of processing:
lines = structure_chain_a.split("\n")
print lines[0]
print lines[1]
# ...
Now we can extract for, say, the second line, all its words:
words = lines[1].split()
print words
It is now pretty easy to extract the coordinates of the residue:
coords = words[-3:]
print coords
Example. join()
is essentially the inverse operation:
list_of_strings = [
">1A3A:A|PDBID|CHAIN|SEQUENCE",
"MANLFKLGAENIFLGRKAATKEEAIRFA",
]
multiline_string = "\n".join(list_of_strings)
print multiline_string
It is not a super-interesting method, but it can be useful for printing pretty formatted text.
Warning
Note that join()
takes a list of strings! This won’t work:
" ".join([1, 2, 3])
Exercises¶
Given the text:
text = """The Wellcome Trust Sanger Institute is a world leader in genome research."""
create a list of string that includes all of the words (i.e. substrings separated by spaces) of
text
.Then print how many words there are.
The table below:
tabella = [ "protein | database | domain | start | end", "YNL275W | Pfam | PF00955 | 236 | 498", "YHR065C | SMART | SM00490 | 335 | 416", "YKL053C-A | Pfam | PF05254 | 5 | 72", "YOR349W | PANTHER | 353 | 414", ]
(taken from Saccharomyces Genome Database) represents a list of domains that have been identified in a given yeast protein.
Each row is a single domain instance (except the first).
Use
split()
to obtain the list of titles of the various columns (from the first row), making sure that the column names contain no spurious space characters.Hint:
strip()
is not necessary; it is sufficient to usesplit()
correctly.Given the list of strings:
words = ["word_1", "word_2", "word_3"]
build, using
join()
together with an appropriate delimiter, the following strings:"word_1 word_2 word_3"
"word_1,word_2,word_3"
"word_1 e word_2 e word_3"
"word_1word_2word3"
r"word_1\word_2\word_3"
Given the list of strings:
random_sentences = [ "Taci. Su le soglie", "del bosco non odo", "parole che dici", "umane; ma odo", "parole piu' nuove", "che parlano gocciole e foglie", "lontane." ]
use
join()
to create a new multi-line stringpoem
. The expected result is:>>> print poem Taci. Su le soglie del bosco non odo parole che dici umane; ma odo parole piu' nuove che parlano gocciole e foglie lontane.
Hint: what delimiter should I use?
List Comprehension¶
The list comprehension operator allows to filter or transform a list.
Warning
The original list is left unchanged. A new list is created instead.
As a filter. Given an arbitrary list original
, I can create a new
list that only contains those elements of original
that satisfy a given
condition.
The abstract syntax is:
filtered = [element
for element in original
if condition(element)]
Here condition()
is arbitrary. Let’s see a few examples.
Example. Let’s create a list with the even numbers in the range [0, 99]:
numbers = range(100
even_numbers = [n
for n in numbers
if n % 2 == 0]
print even_numbers
Example. Given a list of DNA sequences:
sequences = ["ACTGG", "CCTGT", "ATTTA", "TATAGC"]
we keep only those sequences that contain at least one adenosine:
sequences_with_a = [sequence
for sequence in sequences
if "A" in sequence]
print sequences_with_a
If we want only those with no adenosine, we can invert the condition:
sequences_without_a = [sequence
for sequence in sequences
if not "A" in sequence]
print sequences_without_a
Example. When no condition is given, no filtering is performed:
list = range(5)
print list
list_2 = [element for element in list]
print list_2
The above code creates a copy of the original list list
:
****
Example. This list describes a gene regulation network:
triples = [
["G1C2W9", "G1C2Q7", 0.2],
["G1C2W9", "G1C2Q4", 0.9],
["Q6NMS1", "G1C2W9", 0.8],
# ^^^^^^ ^^^^^^ ^^^
# gene1 gene2 correlation
]
Each “triple” has three elements: two A. Thaliana genes, and a measure of correlation of their expression (given by, say, a microarray experiment).
I can use a list comprehension to keep only the pairs of genes with high correlation:
high_correlation_genes = \
[triple[:-1] for triple in triples
if triple[-1] > 0.75]
I can also keep only those genes that are highly correlated with the
"G1C2W9"
gene:
threshold = 0.75
interesting_genes = \
[triple[0] for triple in triples
if triple[1] == "G1C2W9" and triple[-1] >= threshold] + \
[triple[1] for triple in triples
if triple[0] == "G1C2W9" and triple[-1] >= threshold]
Warning
The name of the “temporary” variable holding the current element (in
the examples above, n
, sequence
and triple
, respectively)
is arbitrary.
This code:
list = range(10)
print [x for x in list if x > 5]
is perfectly identical to this code:
list = range(10)
print [y for y in list if y > 5]
The name of the variable, x
or y
, does not make any difference.
You are free to pick any name you like.
As a transformation. Given an arbitrary list original
, I can use a
list comprehension to also transform the elements in the list in some way.
The abstract syntax is:
transformed = [transform(element)
for element in original]
The transformation transform()
is arbitrary.
Example. Given the list:
numbers = range(10)
let’s create a new list with their doubles:
doubles = [n * 2 for n in numbers]
# ^^^^^
# transformation
print doubles
Example. Given a list of paths:
paths = ["aatable", "fasta.1", "fasta.2"]
let’s add a prefix "data/"
to each and every element:
prefixed_paths = ["data/" + path for path in paths]
# ^^^^^^^
# transformation
print prefixed_paths
Example. Given the list of primary sequences:
sequences = [
"MVLTIYPDELVQIVSDKIASNK",
"GKITLNQLWDIS",
"KYFDLSDKKVKQFVLSCVILKKDIE",
"VYCDGAITTKNVTDIIGDANHSYS",
]
let’s compute the length of each sequence, and save them in another list:
lengths = [len(seq) for seq in sequences]
print lengths
Example. Given the list of strings:
atoms = [
"SER A 96 77.253 20.522 75.007",
"VAL A 97 76.066 22.304 71.921",
"PRO A 98 77.731 23.371 68.681",
"SER A 99 80.136 26.246 68.973",
"GLN A 100 79.039 29.534 67.364",
"LYS A 101 81.787 32.022 68.157",
]
which represents (part of) the 3D structure of a protein chain, I want to
compute a list of lists which should hold, for each residue (that is, for
every row of atoms
), its coordinates:
coords = [row.split()[-3:] for row in atoms]
The result is:
>>> print coords
[
["77.253", "20.522", "75.007"],
["76.066", "22.304", "71.921"],
["77.731", "23.371", "68.681"],
["80.136", "26.246", "68.973"],
["79.039", "29.534", "67.364"],
["81.787", "32.022", "68.157"],
]
Jointly transforming and filtering.* Given a list original
, I can
both transform and filter its elements jointly using the complete version
of the list comprehension operator.
The abstract syntax is:
new_list = [transform(element)
for element in original
if condition(element)]
Example. Given the integers from 0 to 99, I want to keep only the even ones and divide them by 3:
result = [n / 3.0
for n in range(100)
if n % 2 == 0]
print result
Example. Given the list of strings:
atoms = [
"SER A 96 77.253 20.522 75.007",
"VAL A 97 76.066 22.304 71.921",
"PRO A 98 77.731 23.371 68.681",
"SER A 99 80.136 26.246 68.973",
"GLN A 100 79.039 29.534 67.364",
"LYS A 101 81.787 32.022 68.157",
]
we used:
coords = [row.split()[-3:] for row in atoms]
to obtain:
>>> print coords
[
["77.253", "20.522", "75.007"],
["76.066", "22.304", "71.921"],
["77.731", "23.371", "68.681"],
["80.136", "26.246", "68.973"],
["79.039", "29.534", "67.364"],
["81.787", "32.022", "68.157"],
]
We now make things more complex: we only want the coordinates of the serines. Let’s write:
coords = [row.split()[-3:] for row in atoms
if row.split()[0] == "SER"]
Exercises¶
Given the list:
list = range(100)
Create a new list
list_plus_3
that holds the elements oflist
plus3
. The expected result is:[3, 4, 5, ...]
Create a new list
odds
that holds only the odd elements inlist
. The expected result is:[1, 3, 5, ...]
Hint: adapt one of the previous examples.
Create a new list
opposites
that holds the arithmetical opposites (the opposite of \(x\) is \(-x\)) of the elements inlist
. The expected result is:[0, -1, -2, ...]
Create a new list
inverses
that holds the arithmetical inverse (the inverse of \(x\) is \(\frac{1}{x}\)) of the elements inlist
.Make sure to skip those elements that have no inverse (like
0
).The expected result is:
[1, 0.5, 0.33334, ...]
Hint: skip = filter out.
Create a new list with only the first and last element of
list
. The expected result is:[0, 99]
Hint: is a list comprehension required?
Create a new list with all elements of
list
, except the first and the last. The expected result is:[1, 2, ..., 97, 98]
Count how many odd numbers there are in
list
. They should be50
.Hint: use a list comprehension plus... something else.
Create a new list holding all elements of
list
, divided by 5. The expected result is:[0.0, 0.2, 0.4, ...]
Hint: make sure that the results are
float
!Create a new list holding only the multiples of
5
appearing inlist
; the multiples should be divided by5
. The expected result is:[0.0, 1.0, 2.0, ..., 19.0]
Create a new list
list_of_strings
containing the same elements aslist
, but converted into strings. The expected result is:["0", "1", "2", ...]
Count how many strings in
list_of_strings
represent an odd number. The expected result is, again,50
.Create a string that contains all the elements in
list
, separated by spaces. The expected result is:"0 1 2 ..."
Hint: use a list comprehension plus... something else.
For each of the following bullet points, write two list comprehensions to convert from
list_1
tolist_2
and vice versa.list_1 = [1, 2, 3] list_2 = ["1", "2", "3"]
list_1 = ["name", "surname", "age"] list_2 = [["name"], ["surname"], ["age"]]
list_1 = ["ACTC", "TTTGGG", "CT"] list_2 = [["actc", 4], ["tttgggcc", 6], ["ct", 2]]
Given the list:
list = range(10)
which of the following code fragments are correct? What do they compute?
[x for x in list]
[y for y in list]
[y for x in list]
["x" for x in list]
[str(x) for x in list]
[x for str(x) in list]
[x + 1 for x in list]
[x + 1 for x in list if x == 2]
Let’s consider the string:
clusters = """\ >Cluster 0 0 >YLR106C at 100.00% >Cluster 50 0 >YPL082C at 100.00% >Cluster 54 0 >YHL009W-A at 90.80% 1 >YHL009W-B at 100.00% 2 >YJL113W at 98.77% 3 >YJL114W at 97.35% >Cluster 52 0 >YBR208C at 100.00% """
extracted from the output of a clustering algorithm (CD-HIT) applied to the S. Cerevisiae genome (taken from SGD)
clusters
encodes information about protein clusters; the proteins were clustered together based on their sequence similarity. (The details are not important.)A cluster begins with the line:
>Cluster N
where
N
is the cluster ID. The contents of the cluster are given in the following lines, for instance:>Cluster 54 0 >YHL009W-A at 90.80% 1 >YHL009W-B at 100.00% 2 >YJL113W at 98.77% 3 >YJL114W at 97.35%
represents a cluster (with ID
54
) of four sequences. Four proteins belong to the cluster: protein"YHL009W-A"
(with90.80%
similarity to the cluster center), protein"YHL009-B"
(with100.00%
similarity) etc.Given the string
clusters
, use a list comprehension (together with other operations on strings, such assplit()
) for:Extracting the IDs of the various clusters. The expected result is:
>>> print cluster_ids ["0", "50", "54", "52"]
Extracting the names of all proteins, including duplicates. The expected result is:
>>> print protein_names ["YLR1106C", "YPL082C", "YHL00W-A", ...]
Extracting all protein-similarity pairs for all proteins. Th expected result is:
>>> print protein_similarity_pairs [["YLR106C", 100.0], ["YPL082C", 100.0], ["YHL009W-A", 90.8], # ... ]
Given the \(3\times 3\) matrix (list of lists):
matrix = [range(0,3), range(3,6), range(6,9)]
- Create a matrix
upside_down
containing all rows ofmatrix
from bottom to top. - (Hard.) Create a matrix
palindrome
containing all columns ofmatrix
from left to right. - (Hard.) Re-create
matrix
from scratch using a list comprehension.
- Create a matrix
(Hard.) Given the list:
list = range(100)
Create a list
squares
containing the squares of all the elements inlist
. The expected result is:[0, 1, 4, 9, ...]
Next, create a list
difference_of_squares
containing, in every positioni
, the value of:squares[i+1] - squares[i]
making sure to avoid the case
i == len(list)
(because in that casesquares[i+1]
is undefined).Feel free to use auxiliary variables as required.
(Blame this page for inspiring this exercise.)
Given the following list of mouse gene symbols:
mouse_genes = ["Fus", "Tdp43", "Sod1", "Ighmbp2", "Srsf2"]
- Sort the list alphabetically.
- In the sorted ist, convert mouse symbols into human gene symbols.
Hint: in human gene symbols all letters are in upper-case.