========================== Python: Complex Statements ========================== Conditional code: ``if`` ------------------------ The ``if``/``elif``/``else`` statements allow to write code that gets executed **if and only if** some condition is satisfied. For instance:: if condition: print "the condition is True" executes the ``print`` statement if and only if ``condition`` evaluates to ``True``. We can also have multiple *mutually exclusive* alternatives:: if condition: print "the condition is True" else: print "the condition is not True" Here only one *branch* is executed, based on the value of ``condition``. The same is true here:: if condition1: print "condition1 is True" elif condition2: print "condition1 is not True" print "but condition2 is!" elif condition3: print "condition1 and condition2 are not True: print "but condition3 is!" else: print "no condition is True" The ``if``, ``elif`` and ``else`` form a "chain": only one of the branches is executed. **** **Example**. Suppose we have two Booleans ``c1`` and ``c2``. Let's see in detail which lines are executed based on the value of the two variables:: # c1 c2 | c1 c2 | c1 c2 | c1 c2 # True True | True False | False True | False False # ----------+------------+------------+------------ print "begin" # yes | yes | yes | yes if c1: # yes | yes | yes | yes print "1" # yes | yes | no | no elif c2: # no | no | yes | yes print "2" # no | no | yes | no else: # no | no | no | yes print "none" # no | no | no | yes print "end " # yes | yes | yes | yes Let's break the above into pieces: - if ``c1`` is ``True``, then the value of ``c2`` does not matter: neither the ``elif c2`` nor the ``else`` are executed. - if ``c1`` is ``False``, ``c2`` decides whether the ``elif c2`` is executed or the ``else`` is. In other words, ``"c1"`` and ``"c2"`` are evaluated sequentially. | Assume that, instead, we want to print ``"2"`` independently of whether ``c1`` is ``True`` or not. The only way to do that is to avoid the ``elif``'s:: print "begin" if c1: print "1" if c2: print "2" if not c1 and not c2: print "0" Here the ``if``'s do not form a chain anymore: they are *independent* of one another! **** **Example**. Python uses the *indentation* to decide which code is "inside" the ``if`` and which code is "outside". Let's write a short program to check whether the user is a mentalist:: print "I am thinking of a number between 1 and 10" is_mentalist = int(raw_input("which one? ")) == 72 print "Computing..." if is_mentalist: print "CONGRATULATIONS!!!" print "you are a mentalist" else: print "thanks for playing" print "better luck next time" print "done" In this example, the ``print`` statements that are indented after the ``if`` and the ``else`` are "inside": they are conditional on ``is_mentalist``. The other ``print``'s are "outside": they are executed unconditionally. **** **Example**. This code opens a file and checks whether it is a "valid" FASTA file. In order to do so, it checks whether (1) the file is empty, and (2) the file contains lines that start with the ``">"`` character:: lines = open("data/prot-fasta/1A3A.fasta").readlines() if len(lines) == 0: print "the file is empty" else: first_characters = [line[0] for line in lines] if not ">" in first_characters: print "not a fasta file" else: print "a fasta file" print "done" **Quiz**: #. Can the code print both that the file is empty *and* that the file is valid? #. Can the code *not* print ``"done"``? #. If the file is actually empty, what is the value of the variable ``first_characters`` at the end of the execution? #. Can I simplify the code by using an ``elif`` statement? | Iterative code: ``for`` ----------------------- The ``for`` statement allows to write code that is executed multiple times. In particular, the code inside the ``for`` is executed once for each and every element in a collection (i.e. string, list, tuple, dictionary). The abstract syntax is:: collection = range(10) # a list, for instance for element in collection: body(element) This ``for`` iterates over all the elements ``element`` in the collection and executes the ``body(element)`` block on each of them. Just like for *list comprehensions*, the ``element`` variable is defined by the ``for`` loop. At the Nth iteration, ``element`` will refer to the Nth element of ``collection``. The flow of the execution can be modified with the ``break`` and ``continue`` statements, see below for details. .. warning:: If ``collection`` is a: - ``str``, the ``for`` iterates over the characters. - ``list``, the ``for`` iterates over the elements. - ``tuple``, the ``for`` iterates over the elements. - ``dict``, the ``for`` iterates over the keys. **** **Example**. This ``for``:: l = [1, 25, 6, 27, 57, 12] for number in l: print number iterates over all elements of ``l``, from beginning to end. At each iteration, the value of ``number`` changes (as shown by the ``print``). The above ``for`` is equivalent to this code:: number = l[0] print number number = l[1] print number number = l[2] print number # etc. except that it is a **lot** shorter! **** **Example**. Let's compute the *sum* of all elements of the previous list. We can do that by modifying the ``for`` as follows:: l = [1, 25, 6, 27, 57, 12] s = 0 for number in l: s = s + number # equiv. s += number print "the sum is", s Here ``s`` plays the role of a support variable. It is initialized to ``0`` just before the loop. Then, each number in ``l`` is added, in turn, to ``s``. By the end of the ``for``, The above code is equivalent to:: s = 0 number = l[0] s += number number = l[1] s += number # etc. **** **Example**. Now let's find the *largest* element in the list. The idea is: * We use a support variable ``largest_so_far`` that always (at all iterations) holds the largest element found so far. It is initialized to some sensible value. * We use a ``for`` to iterate over all elements of the list. * If the current element is smaller than or equal than ``largest_so_far``, the latter is left untouched. * Otherwise, ``largest_so_far`` is updated to reference the current element. Once the ``for`` is done, i.e. after iterating over the very last element of the list, ``largest_so_far`` will hold the largest element in the list. Let's write:: l = [1, 25, 6, 27, 57, 12] # l[0] is a sensible initial value largest_so_far = l[0] for number in l[1:]: if number > largest_so_far: largest_so_far = number print "the maximum is", largest_so_far **** **Example**. Given the following table (list of strings):: table = [ "protein domain start end", "YNL275W PF00955 236 498", "YHR065C SM00490 335 416", "YKL053C-A PF05254 5 72", "YOR349W PANTHER 353 414", ] I want to convert it to a dictionary like this:: data = { "YNL275W": ("PF00955", 236, 498), "YHR065C": ("SM00490", 335, 416), "YKL053C-A": ("PF05254", 5, 72), "YOR349W": ("PANTHER", 353, 414) } The keys are taken from the first column, while the values are the remaining columns. Let's write:: # the dictionary is initially empty data = {} # for each line in the table (except the header) for line in table[1:]: words = line.split() protein = words[0] domain = words[1] pos0 = int(words[2]) pos1 = int(words[3]) # update the dictionary data[protein] = (domain, pos0, pos1) **** **Example**. The ``break`` statement allows to interrupt the ``for``. For instance:: path = raw_input("write a path to a file: ") lines = open(path).readlines() for line in lines: line = line.strip() print "processing:", line # if the line is "STOP", we break out of the # for loop: the remaining lines are not # processed if line == "STOP": break # <--- when Python encounters the break statement, # it "jumps" here This code reads a text file and prints each line on screen. However, as soon as it finds a ``"STOP"`` line, it executes the ``break``, which exits the ``for`` loop. All the lines coming after the ``"STOP"`` line are *not* processed. **Example**. The ``continue`` statement allows to skip to the next iteration, skipping the remainder of the code in this iteration. For instance:: path = raw_input("write a path to a file: ") lines = open(path).readlines() for line in lines: line = line.strip() print "processing:", line if line == "CONTINUE": continue # <--- if the continue is executed, the code from here... print "this is not a CONTINUE line" # <--- ... to here is not executed reads a user-provided text file. It prints every line in turn. If the line is ``"CONTINUE"``, the ``continue`` statement skips over the second ``print``. The ``for`` cycle restarts from the *next* line. | Iterative code: ``while`` ------------------------- The ``while`` statement allows to write code that repeats as long as a certain condition is true. The ``while`` stops iterating as soon as the condition is not true anymore. The abstract syntax is:: while condition: do_stuff() condition = check_condition() As with the ``for``, the ``break`` and ``continue`` statements can be used to modify the flow of the execution. .. note:: The big difference between the ``for`` and ``while`` statements is: - ``for element in collection:`` executes **N times**, where N is the length of ``collection``. - ``while condition:`` executes an **indefinite** number of times, that is, as long as the condition is true. **** **Example**. The ``while`` statement is useful when the value of ``condition`` can not be known beforehand, for instance when interacting with a user. Let's write a ``while`` that asks the user whether she wants to stop, and keeps asking as long as the user does not reply ``"yes"``:: while raw_input("do you want me to stop? ") != "yes": print "Then I'll keep going!" **** **Example**. Let's see another simple example with a ``break``:: while True: # this is an infinite while! ans = raw_input("what is the capital of Italy? ") if ans.lower() == "rome": print "correct" break print "try again" # <--- the break jumps here print "done" I can not really do the same with a ``for`` loop! Let's make the code ask the user whether she actually wants to retry:: while True: ans = raw_input("what is the capital of Italy? ") if ans.lower() == "rome": print "correct" break ans = raw_input("try again? ") if ans.lower() == "no": print "allright" break | Nested code ----------- Now that we know what ``if``, ``for`` and ``while`` do, we can combine them in arbitrary ways by properly nesting (that is, indenting) the statements. **** **Example**. Let's write a simulator of a two-hand clock (hours and minutes):: for h in range(24): for m in range(60): print "time =", h, ":", m Here the external ``for`` iterates over the 24 hours; for each hour, the inner ``for`` iterates over the 60 minutes. Every time the internal ``for`` completes, the external ``for`` completes one iteration. Let's extend the simulator to a hour-minutes-seconds clock:: for h in range(24): for m in range(60): for s in range(60): print "time =", h, ":", m, ":", s Of course, it is possible to take days into consideration by adding one more external loop that iterates over ``range(1, 366)``. **** **Example**. I want to check whether a list contains repeated elements, and if it does, what are their positions. Starting from:: numbers = [5, 9, 4, 4, 9, 2] we can use two nested ``for`` statements to iterate over the *pairs* of elements of ``numbers``. For every element (let's say the one in position ``i``), I want to check whether the following elements (those in position ``i+1`` to ``len(numbers) - 1``) match. A picture is worth a thousand words:: +---+---+---+---+---+---+ | 5 | 9 | 4 | 4 | 9 | 2 | +---+---+---+---+---+---+ ^ i \__________________/ the possible positions of the 2nd element +---+---+---+---+---+---+ | 5 | 9 | 4 | 4 | 9 | 2 | +---+---+---+---+---+---+ ^ ^ i MATCH! \______________/ the possible positions of the 2nd element +---+---+---+---+---+---+ | 5 | 9 | 4 | 4 | 9 | 2 | +---+---+---+---+---+---+ ^ ^ i MATCH! \__________/ the possible positions of the 2nd element Let's write:: matches = [] for i in range(len(numbers)): # the number at position i number_at_i = numbers[i] for j in range(i + 1, len(numbers)): # the number at position j number_at_j = numbers[j] # do they match? if number_at_i == number_at_j: # they do! let's store their # positions matches.append((i, j)) print matches Let's verify whether ``matches`` actually identifies pairs of identical elements:: for pair in matches: number_at_i = numbers[pair[0]] number_at_j = numbers[pair[1]] print number_at_i == number_at_j **** **Example**. Given the contents of a FASTA a file:: >>> lines = open("data/prot-fasta/3J01.fasta").readlines() >>> print lines [ ">3J01:0|PDBID|CHAIN|SEQUENCE", "AVQQNKPTRSKRGMRRSHDALTAVTSLSVDKTSGEKHLRHHITADGYYRGRKVIAK", ">3J01:1|PDBID|CHAIN|SEQUENCE", "AKGIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKIK", ">3J01:2|PDBID|CHAIN|SEQUENCE", "MKRTFQPSVLKRNRSHGFRARMATKNGRQVLARRRAKGRARLTVSK", ">3J01:3|PDBID|CHAIN|SEQUENCE", # ... ] I want to convert ``lines`` into a dictionary that maps from each header (key) to the corresponding sequence (value). Let's write:: sequence_of = {} for line in lines: # remove newlines and spaces around the line line = line.strip() if line.startswith(">"): # this is a header, store it for later use header = line else: # this is a sequence sequence = line # now let's use the header we read at the # *previous* iteration and the sequence we # got at the *current* iteration to update # dictionary sequence_of[header] = sequence # we are done; print the dictionary print sequence_of This code works as long as the sequences only span one line. However, this is not the case for the FASTA file we have. Looking closer, we see that ``lines`` includes these lines:: lines = [ # ... ">3J01:5|PDBID|CHAIN|SEQUENCE", "MAKLTKRMRVIREKVDATKQYDINEAIALLKELATAKFVESVDVAVNLGIDARKSDQNVRGATVLPHGTGRSVRVAVFTQ", "GANAEAAKAAGAELVGMEDLADQIKKGEMNFDVVIASPDAMRVVGQLGQVLGPRGLMPNPKVGTVTPNVAEAVKNAKAGQ", "VRYRNDKNGIIHTTIGKVDFDADKLKENLEALLVALKKAKPTQAKGVYIKKVSISTTMGAGVAVDQAGLSASVN", # ... ] So there is one header, then a multi-line sequence. Unfortunately with the above code, the first line of the sequence is overwritten by the second line of the sequence, which is then overwritten by the third line of the sequence. In other words, *only the last line of a multi-line sequence makes it to the dictionary*. Let's fix the code:: sequence_of = {} for line in lines: line = line.strip() if line.startswith(">"): header = line else: sequence = line # the first time we encounter a header, we # associate it to an empty string if not sequence_of.has_key(header): sequence_of[header] = "" # now we take whatever sequence is associated # to the header and concatenate it with the # current line old_sequence = sequence_of[header] new_sequence = old_sequence + sequence sequence_of[header] = new_sequence A shorter version:: for line in lines: line = line.strip() if line.startswith(">"): header = line else: if not sequence_of.has_key(header): sequence_of[header] = line else: sequence_of[header] += line **Example**. Same setup as before. Some anonymous jester wrote the FASTA file wrong: sequences come **before** the corresponding headers. For instance:: wrong_fasta = [ # first sequence "AVQQNKPTRSKRGMRRSHDALTAVTSLSVDKTSGEKHLRHHITADGYYRGRKVIAK", # first header ">3J01:0|PDBID|CHAIN|SEQUENCE", # second sequence "AKGIREKIKLVSSAGTGHFYTTTKNKRTKPEKLELKKFDPVVRQHVIYKEAKIK", # second header ">3J01:1|PDBID|CHAIN|SEQUENCE", ] Our code of course relies on the header coming before the sequence -- so it does not work in this case. How do make it work again? We have to rewrite the code based on the fact that the header is *not* known when we get the sequence. It is true however that we know the sequence when we get the header. Let's write:: sequence_of = {} # this variable is used to hold the multi-line # sequence we have seen `so far' # it is initialized with an empty string, because # we have not seen any sequence yet! latest_sequence_seen = "" for line in lines: line = line.strip() if line.startswith(">"): # this is a header line. at this point the # sequence is known, and we can update the # dictionary sequence_of[line] = latest_sequence_seen # reset the latest sequence seen (so not to # mix the sequences of different proteins/genes) latest_sequence_seen = "" else: # this is a sequence line. we do not know # the header yet. let's just add this sequence # to the latest_sequence_seen += line print sequence_of