================ Python: Strings ================ Strings are **immutable** objects representing text. To define a string, we can write:: var = "text" or, equivalently:: var = 'text' To insert special characters, we need to perform *escaping* with a backslash ``\``:: path = "data\\fasta" or use the prefix ``r`` (*raw*):: path = r"data\fasta" .. note:: Here is a reference list of escape characters. You will probably only need the most obvious ones, like ``\\``, ``\n`` and ``\t``. ================ ===================================================== Escape Character Meaning ================ ===================================================== ``\\`` Backslash ``\'`` Single-quote ``\"`` Double-quote ``\a`` ASCII bell ``\b`` ASCII backspace ``\f`` ASCII formfeed ``\n`` ASCII linefeed (also known as newline) ``\r`` Carriage Return ``\t`` Horizontal Tab ``\v`` ASCII vertical tab ``\N{name}`` Unicode character name (Unicode only!) ``\uxxxx`` Unicode 16-bit hex value xxxx (u'' string only) ``\Uxxxxxxxx`` Unicode 32-bit hex value xxxxxxxx (u'' string only) ``\ooo`` Character with octal value ooo ``\xhh`` Character with hex value hh ================ ===================================================== To create a multi-line string, we can manually place the *newline* character ``\n`` at each line:: sad_joke = "Time flies like an arrow.\nFruit flies like a banana." print sad_joke or we can use triple quotes:: sad_joke = """Time flies like an arrow. Fruit flies like a banana.""" print sad_joke .. warning:: ``print`` interprets special characters, while terminal echo doesn't. Try to write:: print path and (from the interpreter):: path In the rirst case, we see one slash (the escaping slash is automatically interpreted by ``print``), in the second case we see two slashes (the escape slash is not interpreted). The same if we print ``sad_joke``. String-Number conversion -------------------------- We can convert a number into a string with ``str()``:: n = 10 s = str(n) print n, type(n) print s, type(s) ``int()`` or ``float()`` perform the opposite conversion:: n = int("123") q = float("1.23") print n, type(n) print q, type(q) If the string doesn't contain the correct numeric type, Python will give an error message:: int("3.14") # Not an int float("ribosome") # Not a number int("1 2 3") # Not a number int("fifteen") # Not a number Operations ---------- ======== ================== =========================================== Result Operator Meaning ======== ================== =========================================== ``bool`` ``==`` Check whether two strings are identical. ``int`` ``len(str)`` Return the length of the string ``str`` ``str + str`` Concatenate two strings ``str`` ``str * int`` Replicate the string ``bool`` ``str in str`` Check if a string is present in another string ``str`` ``str[int:int]`` Extract a sub-string ======== ================== =========================================== **Example**. Let's concatenate two strings:: string = "one" + " " + "string" length = len(string) print "the string:", string, "is", length, "characters long" Another example:: string = "Python is hell!" * 1000 print "the string is", len(string), "characters long" .. warning:: We cannot concatenate strings with other types. For example:: var = 123 print "the value of var is" + var gives an error message. Two working alternatives:: print "the value of var is" + str(123) or:: print "the value of var is", var (In the second case we miss a space between ``is`` and ``123``.) **Example**. The operator ``substring in string`` checks if ``substring`` appears once or more times in ``string``, for example:: string = "A beautiful journey" print "A" in string # True print "beautiful" in string # True print "BEAUTIFUL" in string # False print "ul jour" in string # True print "Gengis Khan" in string # False print " " in string # True print " " in string # False The result is always ``True`` or ``False``. **Example**. To extract a substring we can use indexes:: # 0 -1 # |1 -2| # ||2 -3|| # ||| ... ||| alphabet = "abcdefghijklmnopqrstuvwxyz" print alphabet[0] # "a" print alphabet[1] # "b" print alphabet[len(alphabet)-1] # "z" print alphabet[len(alphabet)] # Error print alphabet[10000] # Error print alphabet[-1] # "z" print alphabet[-2] # "y" print alphabet[0:1] # "a" print alphabet[0:2] # "ab" print alphabet[0:5] # "abcde" print alphabet[:5] # "abcde" print alphabet[-5:-1] # "vwxy" print alphabet[-5:] # "vwxyz" print alphabet[10:-10] # "klmnop" .. warning:: Extraction is inclusive with respect to the first index, but exclusive with respect to the second. In other words ``alphabet[i:j]`` corresponds to:: alphabet[i] + alphabet[i+1] + ... + alphabet[j-1] Note that ``alphabet[j]`` is excluded. .. warning:: Extraction return a *new* string, leaving the original unvaried:: alphabet = "abcdefghijklmnopqrstuvwxyz" substring = alphabet[2:-2] print substring print alphabet # Is unvaried Methods ------- ======== =========================== =================================================== Result Method Meaning ======== =========================== =================================================== ``str`` ``str.upper()`` Return the string in upper case ``str`` ``str.lower()`` Return the string in lower case ``str`` ``str.strip(str)`` Remove strings from the sides ``str`` ``str.lstrip(str)`` Remove strings from the left ``str`` ``str.rstrip(str)`` Remove strings from the right ``bool`` ``str.startswith(str)`` Check if the string starts with another ``bool`` ``str.endswith(str)`` Check if the string ends with another ``int`` ``str.find(str)`` Return the position of a substring ``int`` ``str.count(str)`` Count the number of occurrences of a substring ``str`` ``str.replace(str, str)`` Replace substrings ======== =========================== =================================================== .. warning:: Methods return a *new* string, leaving the original unvaried (as with extraction):: alphabet = "abcdefghijklmnopqrstuvwxyz" alphabet_upper = alphabet.upper() print alphabet_upper print alphabet # Is unvaried **Example**. ``upper()`` and ``lower()`` are very simple:: text = "No Yelling" result = text.upper() print result result = result.lower() print result **Example**. ``strip()`` variants are also simple:: text = " one example " print text.strip() # equivalent to text.strip(" ") print text.lstrip() # idem print text.rstrip() # idem print text # text is unvaried Note that the space between ``"one"`` and ``"example"`` is never removed. We can specify more than one *character* to be removed:: "AAAA one example BBBB".strip("AB") **Example**. The same is valid with ``startswith()`` and ``endswith()``:: text = "123456789" print text.startswith("1") # True print text.startswith("a") # False print text.endswith("56789") # True print text.endswith("5ABC9") # False **Example**. ``find()`` returns the position of the first occurrence of a substring, or ``-1`` if the substring never occurs:: text = "123456789" print text.find("1") # 0 print text.find("56789") # 4 print text.find("Q") # -1 **Example**. ``replace()`` returns a copy of the string where a substring is replaced with another:: text = "if roses were rotten, then" print text.replace("ro", "gro") **Example**. Given this unformatted string of aminoacids:: sequence = ">MAnlFKLgaENIFLGrKW " To increase uniformity, we want to remove the ``">"`` character, remove spaces and finally convert everything to upper case:: s1 = sequence.lstrip(">") s2 = s2.rstrip(" ") s3 = s2.upper() print s3 Alternatively, all in one step:: print sequence.lstrip(">").rstrip(" ").upper() Why does it work? Let's write it with brackets:: print ( ( sequence.lstrip(">") ).rstrip(" ") ).upper() \_____________________/ str \_____________________________________/ str \_____________________________________________/ str As you can see, the result of each method is a string (as ``s1``, ``s2`` e ``s3`` in the example above); and we can invoke string methods. Exercises --------- #. How can I: #. Create a string consisting of five spaces only. #. Check whether a string contains at least one space. #. Check whether a string contains exactly five (arbitrary) characters. #. Create an empty string, and check whether it is really empty. #. Create a string that contains one hundred copies of ``"Python is great"``. #. Given the strings ``"but cell"``, ``"biology"`` and ``"is way better"``, compose them into the string ``"but cell biology is way better"``. #. Check whether the string ``"12345"`` begins with ``1`` (the character, not the number!) #. Create a string consisting of a single character ``\``. (Check whether the output matches using both the echo of the interpreter and ``print``, and possibly also with ``len()``) #. Check whether the string ``"\\"`` contains one or two backslashes. #. Check whether a string (of choice) begins or ends by ``\``. #. Check whether a string (of choice) contains ``x`` at least three times at the beginning and/or at the end. For instance, the following strings satisfy the desideratum:: "x....xx" # 1 + 2 >= 3 "xx....x" # 2 + 1 >= 3 "xxxx..." # 4 + 0 >= 3 while these do not:: "x.....x" # 1 + 1 < 3 "...x..." # 0 + 0 < 3 "......." # 0 + 0 < 3 #. Given the string:: s = "0123456789" which of the following extractions are correct? #. ``s[9]`` #. ``s[10]`` #. ``s[:10]`` #. ``s[1000]`` #. ``s[0]`` #. ``s[-1]`` #. ``s[1:5]`` #. ``s[-1:-5]`` #. ``s[-5:-1]`` #. ``s[-1000]`` #. Create a two-line string that contains the two following lines of text **literally**, including all the special characters and the implicit newline character: *never say "never"!* *said the sad turtle* #. Given the strings:: string = "a 1 b 2" digit = "DIGIT" character = "CHARACTER" replace all the digits in the variable ``string`` with the text provided by the variable ``digit``, and all alphabetic characters with the content of the variable ``character``. The result should look like this:: "CHARACTER DIGIT CHARACTER DIGIT" You are free to use auxiliary variables to hold any intermediate results, but do not need to. #. Given the following multi-line sequence:: chain_a = """SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG EPHHELPPGSTKRALPNNT""" which represents the aminoacid sequence of the DNA-binding domain of the `Tumor Suppressor Protein TP53 `_ , answer the following questions. #. How many lines does it hold? #. How long is the sequence? (Do not forget to ignore the special characters!) #. Remove all newline characters, and put the result in a new variable ``sequence``. #. How many cysteines ``"C"`` are there in the sequence? How many histidines ``"H"``? #. Does the chain contain the sub-sequence ``"NLRVEYLDDRN"``? In what position? #. How can I use ``find()`` and the sub-string extraction ``[i:j]`` operators to extract the first line from ``chain_a``? #. Given (a small portion of) the tertiary structure of chain A of the TP53 protein:: structure_chain_a = """SER A 96 77.253 20.522 75.007 VAL A 97 76.066 22.304 71.921 PRO A 98 77.731 23.371 68.681 SER A 99 80.136 26.246 68.973 GLN A 100 79.039 29.534 67.364 LYS A 101 81.787 32.022 68.157""" Each line represents an :math:`C_\alpha` atom of the backbone of the structure. Of each atom, we know: - the aminoacid code of the residue - the chain (which is always ``"A"`` in this example) - the position of the residue within the chain (starting from the N-terminal) - and the :math:`x, y, z` coordinates of the atom #. Extract the second line using ``find()`` and the extraction operator. Put the line in a new variable ``line``. #. Extract the coordinates of the second residue, and put them into three variables ``x``, ``y``, and ``z``. #. Extract the coordinates from *third* residue as well, putting them in different variables ``x_prime``, ``y_prime``, ``z_prime`` #. Compute the Euclidean distance between the two residues: :math:`d((x,y,z),(x',y',z')) = \sqrt{(x-x')^2 + (y-y')^2 + (z-z')^2}` *Hint*: make sure to use ``float`` numbers when computing the distance. #. Given the following DNA sequence, part of the `BRCA2 `_ human gene:: dna_seq = """GGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCT GTGGCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTT GCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAG ATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCA AAAAAGAACTGCACCTCTGGAGCGG""" #. Calculate the **GC-content** of the sequence #. Convert the **DNA** sequence into an **RNA** sequence #. Assuming that this sequence contains an **intron** ranging from nucleotide *51* to nucleotide *156*, store the sequence of the intron in a string, and the sequence of the spliced transcript in another string.