============================ Python: Strings (Solutions) ============================ .. note:: Later on in the solutions, I will sometimes use the backslash character ``\`` at the end of a line. When used this way, ``\`` tells Python that the command continues on the following line, allowing to break long commands over multiple lines. #. Solutions: #. Solution:: # 12345 text = " " print text print len(text) #. Solution:: at_least_one_space = " " in text # check whether it works print " " in "nospaceatallhere" print " " in "onlyonespacehere--> <--" print " " in "more spaces in here" #. Solution:: exactly_5_characters = len(text) == 5 # check whether it works print len("1234") == 5 print len("12345") == 5 print len("123456") == 5 #. Solution:: empty_string = "" print len(empty_string) == 0 #. Solution:: base = "Python is great" repeats = base * 100 # check whether the length is correct print len(repeats) == len(base) * 100 #. Solution:: part_1 = "but cell" part_2 = "biology" part_3 = "is way better" text = (part_1 + part_2 + part_3) * 1000 #. Let's try this:: start_with_1 = "12345".startswith(1) but Python gives an error message:: Traceback (most recent call last): File "", line 1, in TypeError: startswith first arg must be str, unicode, or tuple, not int # ^^^^^^^^^^^^^^^^^^^^^ ^^^^^^^ The error message, see highlighted parts, says that ``startswith()`` requires the argument to be a string, non an *int* as in our case: ``1``, is an int. The solution is:: start_with_1 = "12345".startswith("1") print start_with_1 the value is ``True``, as expected. #. Solution:: string = "\\" string print string print len(string) # 1 alternatively:: string = r"\" string print string print len(string) # 1 #. Already checked before, the answer is no. Anyway:: backslash = r"\" print backslash*2 in "\\" # False #. First method:: backslash = r"\" condition = text.startswith(backslash) or \ text.endswith(backslash) Second method:: condition = (text[0] == backslash) or \ (text[-1] == backslash) #. Solution:: condition = \ text.startswith("xxx") or \ (text.startswith("xx") and text.endswith("x")) or \ (text.startswith("x") and text.endswith("xx")) or \ text.endswith("xxx") It's worth to check the condition using the examples provided in the exercise. #. Solution:: s = "0123456789" print len(s) # 10 Which of the following extractions are correct? #. ``s[9]``: correct, extracts the last character. #. ``s[10]``: invalid. #. ``s[:10]``: corrett, extracts all characters (remember that the second index, ``10`` in this case, is exclusive.) #. ``s[1000]``: invalid. #. ``s[0]``: correct, extracts the first character. #. ``s[-1]``: correct, extracts the last character. #. ``s[1:5]``: correct, ectracts from the 2nd to the 6th character. #. ``s[-1:-5]``: correct #. ``s[-5:-1]``: correct, but nothing is extracted (indexes are inverted!) #. ``s[-1000]``: invalid. #. Solution (one of two possible solutions):: text = """never say \"never!\" \said the sad turtle.""" #. Solution:: string = "a 1 b 2 c 3" digit = "DIGIT" character = "CHARACTER" result = string.replace("1", digit) result = result.replace("2", digit) result = result.replace("3", digit) result = result.replace("a", character) result = result.replace("b", character) result = result.replace("c", character) print result # "CHARACTER DIGIT CHARACTER ..." In one line:: print string.replace("1", digit).replace("2", digit) ... #. Solution:: chain_a = """SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG EPHHELPPGSTKRALPNNT""" num_lines = chain_a.count("\n") + 1 print num_lines # 6 # NOTE: we want to know the length of the actual *sequence*, non the length of the *string* length_sequence = len(chain_a) - chain_a.count("\n") print length_sequenza # 219 sequence = chain_a.replace("\n", "") print len(chain_a) - len(sequence) # 5 (giusto) print len(sequence) # 219 num_cysteine = sequence.count("C") num_histidine = sequence.count("H") print num_cysteine, num_histidine # 10, 9 print "NLRVEYLDDRN" in sequence # True print sequence.find("NLRVEYLDDRN") # 106 # let's check print sequence[106 : 106 + len("NLRVEYLDDRN")] # "NLRVEYLDDRN" index_first_newline = chain_a.find("\n") first_line = chain_a[:index_first_newline] print first_line #. Solution:: structure_chain_a = """SER A 96 77.253 20.522 75.007 VAL A 97 76.066 22.304 71.921 PRO A 98 77.731 23.371 68.681 SER A 99 80.136 26.246 68.973 GLN A 100 79.039 29.534 67.364 LYS A 101 81.787 32.022 68.157""" # I use a variable with a shorter name chain = structure_chain_a index_first_newline = chain.find("\n") index_second_newline = chain[index_first_newline + 1:].find("\n") index_third_newline = chain[index_second_newline + 1:].find("\n") print index_first_newline, index_second_newline, index_third_newline second_line = chain[index_first_newline + 1 : index_second_newline] print second_line # "VAL A 97 76.066 22.304 71.921" # | | | | | | # 01234567890123456789012345678 # 0 1 2 x = second_line[9:15] y = second_line[16:22] z = second_line[23:] print x, y, z # NOTE: they are all strings third_line = chain[index_second_newline + 1 : index_third_newline] print third_line # "PRO A 98 77.731 23.371 68.681" # | | | | | | # 01234567890123456789012345678 # 0 1 2 x_prime = third_line[9:15] y_prime = third_line[16:22] z_prime = third_line[23:] print x_prime, y_prime, z_prime # NOTE: they are all strings # we should convert all variables to floats, in order to calculate distances x, y, z = float(x), float(y), float(z) x_prime, y_prime, z_prime = float(x_prime), float(y_prime), float(z_prime) diff_x = x - x_prime diff_y = y - y_prime diff_z = z - z_prime distance = (diff_x**2 + diff_y**2 + diff_z**2)**0.5 print distance The solution is way simpler using ``split()``:: lines = chain.split("\n") second_line = lines[1] third_line = lines[2] words = second_line.split() x, y, z = float(words[-3]), float(words[-2]), float(words[-1]) words = third_line.split() x_prime, y_prime, z_prime = float(words[-3]), float(words[-2]), float(words[-1]) distance = ((x - x_prime)**2 + (y - y_prime)**2 + (z - z_prime)**2)**0.5 #. Solutions: #. Solution:: dna_seq = dna_seq.replace("\n", "") # Remove newline characters length = len(dna_seq) # Calculate length ng = dna_seq.count("G") # Calculate the number of Gs nc = dna_seq.count("C") # Calculate the number of Cs gc_cont = (ng + nc)/float(length) # Calculate the GC-content #. Solution:: rna_seq = dna_seq.replace("T","U") #. Solution:: intron = dna_seq[50:156] # Careful with indexes exon1 = dna_seq[:50] # Careful with indexes exon2 = dna_seq[156:] # Careful with indexes spliced = exon1+exon2