Python: Complex Statements¶
Conditional code: if
¶
The if
/elif
/else
statements allow to write code that gets
executed if and only if some condition is satisfied.
For instance:
if condition:
print "the condition is True"
executes the print
statement if and only if condition
evaluates to
True
.
We can also have multiple mutually exclusive alternatives:
if condition:
print "the condition is True"
else:
print "the condition is not True"
Here only one branch is executed, based on the value of condition
.
The same is true here:
if condition1:
print "condition1 is True"
elif condition2:
print "condition1 is not True"
print "but condition2 is!"
elif condition3:
print "condition1 and condition2 are not True:
print "but condition3 is!"
else:
print "no condition is True"
The if
, elif
and else
form a “chain”: only one of the branches
is executed.
Example. Suppose we have two Booleans c1
and c2
. Let’s see in
detail which lines are executed based on the value of the two variables:
# c1 c2 | c1 c2 | c1 c2 | c1 c2
# True True | True False | False True | False False
# ----------+------------+------------+------------
print "begin" # yes | yes | yes | yes
if c1: # yes | yes | yes | yes
print "1" # yes | yes | no | no
elif c2: # no | no | yes | yes
print "2" # no | no | yes | no
else: # no | no | no | yes
print "none" # no | no | no | yes
print "end " # yes | yes | yes | yes
Let’s break the above into pieces:
- if
c1
isTrue
, then the value ofc2
does not matter: neither theelif c2
nor theelse
are executed. - if
c1
isFalse
,c2
decides whether theelif c2
is executed or theelse
is.
In other words, "c1"
and "c2"
are evaluated sequentially.
Assume that, instead, we want to print "2"
independently of whether c1
is True
or not. The only way to do that is to avoid the elif
‘s:
print "begin"
if c1:
print "1"
if c2:
print "2"
if not c1 and not c2:
print "0"
Here the if
‘s do not form a chain anymore: they are independent of
one another!
Example. Python uses the indentation to decide which code is “inside”
the if
and which code is “outside”.
Let’s write a short program to check whether the user is a mentalist:
print "I am thinking of a number between 1 and 10"
is_mentalist = int(raw_input("which one? ")) == 72
print "Computing..."
if is_mentalist:
print "CONGRATULATIONS!!!"
print "you are a mentalist"
else:
print "thanks for playing"
print "better luck next time"
print "done"
In this example, the print
statements that are indented after the if
and the else
are “inside”: they are conditional on is_mentalist
.
The other print
‘s are “outside”: they are executed unconditionally.
Example. This code opens a file and checks whether it is a “valid” FASTA
file. In order to do so, it checks whether (1) the file is empty, and (2) the
file contains lines that start with the ">"
character:
lines = open("data/prot-fasta/1A3A.fasta").readlines()
if len(lines) == 0:
print "the file is empty"
else:
first_characters = [line[0] for line in lines]
if not ">" in first_characters:
print "not a fasta file"
else:
print "a fasta file"
print "done"
Quiz:
- Can the code print both that the file is empty and that the file is valid?
- Can the code not print
"done"
?- If the file is actually empty, what is the value of the variable
first_characters
at the end of the execution?- Can I simplify the code by using an
elif
statement?
Iterative code: for
¶
The for
statement allows to write code that is executed multiple times.
In particular, the code inside the for
is executed once for each and
every element in a collection (i.e. string, list, tuple, dictionary).
The abstract syntax is:
collection = range(10) # a list, for instance
for element in collection:
body(element)
This for
iterates over all the elements element
in the collection and
executes the body(element)
block on each of them.
Just like for list comprehensions, the element
variable is defined by the
for
loop. At the Nth iteration, element
will refer to the Nth element
of collection
.
The flow of the execution can be modified with the break
and continue
statements, see below for details.
Warning
If collection
is a:
str
, thefor
iterates over the characters.list
, thefor
iterates over the elements.tuple
, thefor
iterates over the elements.dict
, thefor
iterates over the keys.
Example. This for
:
l = [1, 25, 6, 27, 57, 12]
for number in l:
print number
iterates over all elements of l
, from beginning to end. At each iteration,
the value of number
changes (as shown by the print
).
The above for
is equivalent to this code:
number = l[0]
print number
number = l[1]
print number
number = l[2]
print number
# etc.
except that it is a lot shorter!
Example. Let’s compute the sum of all elements of the previous list.
We can do that by modifying the for
as follows:
l = [1, 25, 6, 27, 57, 12]
s = 0
for number in l:
s = s + number # equiv. s += number
print "the sum is", s
Here s
plays the role of a support variable. It is initialized to 0
just before the loop. Then, each number in l
is added, in turn, to s
.
By the end of the for
,
The above code is equivalent to:
s = 0
number = l[0]
s += number
number = l[1]
s += number
# etc.
Example. Now let’s find the largest element in the list. The idea is:
- We use a support variable
largest_so_far
that always (at all iterations) holds the largest element found so far. It is initialized to some sensible value. - We use a
for
to iterate over all elements of the list. - If the current element is smaller than or equal than
largest_so_far
, the latter is left untouched. - Otherwise,
largest_so_far
is updated to reference the current element.
Once the for
is done, i.e. after iterating over the very last element
of the list, largest_so_far
will hold the largest element in the list.
Let’s write:
l = [1, 25, 6, 27, 57, 12]
# l[0] is a sensible initial value
largest_so_far = l[0]
for number in l[1:]:
if number > largest_so_far:
largest_so_far = number
print "the maximum is", largest_so_far
Example. Given the following table (list of strings):
table = [
"protein domain start end",
"YNL275W PF00955 236 498",
"YHR065C SM00490 335 416",
"YKL053C-A PF05254 5 72",
"YOR349W PANTHER 353 414",
]
I want to convert it to a dictionary like this:
data = {
"YNL275W": ("PF00955", 236, 498),
"YHR065C": ("SM00490", 335, 416),
"YKL053C-A": ("PF05254", 5, 72),
"YOR349W": ("PANTHER", 353, 414)
}
The keys are taken from the first column, while the values are the remaining columns. Let’s write:
# the dictionary is initially empty
data = {}
# for each line in the table (except the header)
for line in table[1:]:
words = line.split()
protein = words[0]
domain = words[1]
pos0 = int(words[2])
pos1 = int(words[3])
# update the dictionary
data[protein] = (domain, pos0, pos1)
Example. The break
statement allows to interrupt the for
. For
instance:
path = raw_input("write a path to a file: ")
lines = open(path).readlines()
for line in lines:
line = line.strip()
print "processing:", line
# if the line is "STOP", we break out of the
# for loop: the remaining lines are not
# processed
if line == "STOP":
break
# <--- when Python encounters the break statement,
# it "jumps" here
This code reads a text file and prints each line on screen. However, as soon
as it finds a "STOP"
line, it executes the break
, which exits the
for
loop.
All the lines coming after the "STOP"
line are not processed.
Example. The continue
statement allows to skip to the next iteration,
skipping the remainder of the code in this iteration. For instance:
path = raw_input("write a path to a file: ")
lines = open(path).readlines()
for line in lines:
line = line.strip()
print "processing:", line
if line == "CONTINUE":
continue
# <--- if the continue is executed, the code from here...
print "this is not a CONTINUE line"
# <--- ... to here is not executed
reads a user-provided text file. It prints every line in turn. If the line
is "CONTINUE"
, the continue
statement skips over the second print
.
The for
cycle restarts from the next line.
Iterative code: while
¶
The while
statement allows to write code that repeats as long as a certain
condition is true. The while
stops iterating as soon as the condition is
not true anymore.
The abstract syntax is:
while condition:
do_stuff()
condition = check_condition()
As with the for
, the break
and continue
statements can be used to
modify the flow of the execution.
Note
The big difference between the for
and while
statements is:
for element in collection:
executes N times, where N is the length ofcollection
.while condition:
executes an indefinite number of times, that is, as long as the condition is true.
Example. The while
statement is useful when the value of condition
can not be known beforehand, for instance when interacting with a user.
Let’s write a while
that asks the user whether she wants to stop, and
keeps asking as long as the user does not reply "yes"
:
while raw_input("do you want me to stop? ") != "yes":
print "Then I'll keep going!"
Example. Let’s see another simple example with a break
:
while True: # this is an infinite while!
ans = raw_input("what is the capital of Italy? ")
if ans.lower() == "rome":
print "correct"
break
print "try again"
# <--- the break jumps here
print "done"
I can not really do the same with a for
loop!
Let’s make the code ask the user whether she actually wants to retry:
while True:
ans = raw_input("what is the capital of Italy? ")
if ans.lower() == "rome":
print "correct"
break
ans = raw_input("try again? ")
if ans.lower() == "no":
print "allright"
break
Nested code¶
Now that we know what if
, for
and while
do, we can combine them in
arbitrary ways by properly nesting (that is, indenting) the statements.
Example. Let’s write a simulator of a two-hand clock (hours and minutes):
for h in range(24):
for m in range(60):
print "time =", h, ":", m
Here the external for
iterates over the 24 hours; for each hour, the inner
for
iterates over the 60 minutes.
Every time the internal for
completes, the external for
completes
one iteration.
Let’s extend the simulator to a hour-minutes-seconds clock:
for h in range(24):
for m in range(60):
for s in range(60):
print "time =", h, ":", m, ":", s
Of course, it is possible to take days into consideration by adding one more
external loop that iterates over range(1, 366)
.
Example. I want to check whether a list contains repeated elements, and if it does, what are their positions. Starting from:
numbers = [5, 9, 4, 4, 9, 2]
we can use two nested for
statements to iterate over the pairs of
elements of numbers
.
For every element (let’s say the one in position i
), I want to check
whether the following elements (those in position i+1
to len(numbers) - 1
)
match.
A picture is worth a thousand words:
+---+---+---+---+---+---+
| 5 | 9 | 4 | 4 | 9 | 2 |
+---+---+---+---+---+---+
^
i
\__________________/
the possible positions of the 2nd element
+---+---+---+---+---+---+
| 5 | 9 | 4 | 4 | 9 | 2 |
+---+---+---+---+---+---+
^ ^
i MATCH!
\______________/
the possible positions of the 2nd element
+---+---+---+---+---+---+
| 5 | 9 | 4 | 4 | 9 | 2 |
+---+---+---+---+---+---+
^ ^
i MATCH!
\__________/
the possible positions of the 2nd element
Let’s write:
matches = []
for i in range(len(numbers)):
# the number at position i
number_at_i = numbers[i]
for j in range(i + 1, len(numbers)):
# the number at position j
number_at_j = numbers[j]
# do they match?
if number_at_i == number_at_j:
# they do! let's store their
# positions
matches.append((i, j))
print matches
Let’s verify whether matches
actually identifies pairs of identical
elements:
for pair in matches:
number_at_i = numbers[pair[0]]
number_at_j = numbers[pair[1]]
print number_at_i == number_at_j
Example. Given the contents of a FASTA a file:
>>> lines = open("data/prot-fasta/3J01.fasta").readlines()
>>> print lines
[
">3J01:0|PDBID|CHAIN|SEQUENCE",
"AVQQNKPTRSKRGMRRSHDALTAVTSLSVDKTSGEKHLRHHITADGYYRGRKVIAK",
">3J01:1|PDBID|CHAIN|SEQUENCE",
"AKGIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKIK",
">3J01:2|PDBID|CHAIN|SEQUENCE",
"MKRTFQPSVLKRNRSHGFRARMATKNGRQVLARRRAKGRARLTVSK",
">3J01:3|PDBID|CHAIN|SEQUENCE",
# ...
]
I want to convert lines
into a dictionary that maps from each header (key)
to the corresponding sequence (value). Let’s write:
sequence_of = {}
for line in lines:
# remove newlines and spaces around the line
line = line.strip()
if line.startswith(">"):
# this is a header, store it for later use
header = line
else:
# this is a sequence
sequence = line
# now let's use the header we read at the
# *previous* iteration and the sequence we
# got at the *current* iteration to update
# dictionary
sequence_of[header] = sequence
# we are done; print the dictionary
print sequence_of
This code works as long as the sequences only span one line. However, this is
not the case for the FASTA file we have. Looking closer, we see that lines
includes these lines:
lines = [
# ...
">3J01:5|PDBID|CHAIN|SEQUENCE",
"MAKLTKRMRVIREKVDATKQYDINEAIALLKELATAKFVESVDVAVNLGIDARKSDQNVRGATVLPHGTGRSVRVAVFTQ",
"GANAEAAKAAGAELVGMEDLADQIKKGEMNFDVVIASPDAMRVVGQLGQVLGPRGLMPNPKVGTVTPNVAEAVKNAKAGQ",
"VRYRNDKNGIIHTTIGKVDFDADKLKENLEALLVALKKAKPTQAKGVYIKKVSISTTMGAGVAVDQAGLSASVN",
# ...
]
So there is one header, then a multi-line sequence.
Unfortunately with the above code, the first line of the sequence is overwritten by the second line of the sequence, which is then overwritten by the third line of the sequence. In other words, only the last line of a multi-line sequence makes it to the dictionary.
Let’s fix the code:
sequence_of = {}
for line in lines:
line = line.strip()
if line.startswith(">"):
header = line
else:
sequence = line
# the first time we encounter a header, we
# associate it to an empty string
if not sequence_of.has_key(header):
sequence_of[header] = ""
# now we take whatever sequence is associated
# to the header and concatenate it with the
# current line
old_sequence = sequence_of[header]
new_sequence = old_sequence + sequence
sequence_of[header] = new_sequence
A shorter version:
for line in lines:
line = line.strip()
if line.startswith(">"):
header = line
else:
if not sequence_of.has_key(header):
sequence_of[header] = line
else:
sequence_of[header] += line
Example. Same setup as before. Some anonymous jester wrote the FASTA file wrong: sequences come before the corresponding headers. For instance:
wrong_fasta = [
# first sequence
"AVQQNKPTRSKRGMRRSHDALTAVTSLSVDKTSGEKHLRHHITADGYYRGRKVIAK",
# first header
">3J01:0|PDBID|CHAIN|SEQUENCE",
# second sequence
"AKGIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKIK",
# second header
">3J01:1|PDBID|CHAIN|SEQUENCE",
]
Our code of course relies on the header coming before the sequence – so it does not work in this case. How do make it work again?
We have to rewrite the code based on the fact that the header is not known when we get the sequence. It is true however that we know the sequence when we get the header.
Let’s write:
sequence_of = {}
# this variable is used to hold the multi-line
# sequence we have seen `so far'
# it is initialized with an empty string, because
# we have not seen any sequence yet!
latest_sequence_seen = ""
for line in lines:
line = line.strip()
if line.startswith(">"):
# this is a header line. at this point the
# sequence is known, and we can update the
# dictionary
sequence_of[line] = latest_sequence_seen
# reset the latest sequence seen (so not to
# mix the sequences of different proteins/genes)
latest_sequence_seen = ""
else:
# this is a sequence line. we do not know
# the header yet. let's just add this sequence
# to the
latest_sequence_seen += line
print sequence_of