Python: Strings¶
Strings are immutable objects representing text.
To define a string, we can write:
var = "text"
or, equivalently:
var = 'text'
To insert special characters, we need to perform escaping with a backslash
\
:
path = "data\\fasta"
or use the prefix r
(raw):
path = r"data\fasta"
Note
Here is a reference list of escape characters. You will probably only need
the most obvious ones, like \\
, \n
and \t
.
Escape Character | Meaning |
---|---|
\\ |
Backslash |
\' |
Single-quote |
\" |
Double-quote |
\a |
ASCII bell |
\b |
ASCII backspace |
\f |
ASCII formfeed |
\n |
ASCII linefeed (also known as newline) |
\r |
Carriage Return |
\t |
Horizontal Tab |
\v |
ASCII vertical tab |
\N{name} |
Unicode character name (Unicode only!) |
\uxxxx |
Unicode 16-bit hex value xxxx (u’’ string only) |
\Uxxxxxxxx |
Unicode 32-bit hex value xxxxxxxx (u’’ string only) |
\ooo |
Character with octal value ooo |
\xhh |
Character with hex value hh |
To create a multi-line string, we can manually place the
newline character \n
at each line:
sad_joke = "Time flies like an arrow.\nFruit flies like a banana."
print sad_joke
or we can use triple quotes:
sad_joke = """Time flies like an arrow.
Fruit flies like a banana."""
print sad_joke
Warning
print
interprets special characters, while terminal echo doesn’t.
Try to write:
print path
and (from the interpreter):
path
In the rirst case, we see one slash (the escaping slash is automatically interpreted by print
), in the second case we see two slashes (the escape slash is not interpreted).
The same if we print sad_joke
.
String-Number conversion¶
We can convert a number into a string with str()
:
n = 10
s = str(n)
print n, type(n)
print s, type(s)
int()
or float()
perform the opposite conversion:
n = int("123")
q = float("1.23")
print n, type(n)
print q, type(q)
If the string doesn’t contain the correct numeric type, Python will give an error message:
int("3.14") # Not an int
float("ribosome") # Not a number
int("1 2 3") # Not a number
int("fifteen") # Not a number
Operations¶
Result | Operator | Meaning |
---|---|---|
bool |
== |
Check whether two strings are identical. |
int |
len(str) |
Return the length of the string |
str |
str + str |
Concatenate two strings |
str |
str * int |
Replicate the string |
bool |
str in str |
Check if a string is present in another string |
str |
str[int:int] |
Extract a sub-string |
Example. Let’s concatenate two strings:
string = "one" + " " + "string"
length = len(string)
print "the string:", string, "is", length, "characters long"
Another example:
string = "Python is hell!" * 1000
print "the string is", len(string), "characters long"
Warning
We cannot concatenate strings with other types. For example:
var = 123
print "the value of var is" + var
gives an error message. Two working alternatives:
print "the value of var is" + str(123)
or:
print "the value of var is", var
(In the second case we miss a space between is
and 123
.)
Example. The operator substring in string
checks if
substring
appears once or more times in string
, for example:
string = "A beautiful journey"
print "A" in string # True
print "beautiful" in string # True
print "BEAUTIFUL" in string # False
print "ul jour" in string # True
print "Gengis Khan" in string # False
print " " in string # True
print " " in string # False
The result is always True
or False
.
Example. To extract a substring we can use indexes:
# 0 -1
# |1 -2|
# ||2 -3||
# ||| ... |||
alphabet = "abcdefghijklmnopqrstuvwxyz"
print alphabet[0] # "a"
print alphabet[1] # "b"
print alphabet[len(alphabet)-1] # "z"
print alphabet[len(alphabet)] # Error
print alphabet[10000] # Error
print alphabet[-1] # "z"
print alphabet[-2] # "y"
print alphabet[0:1] # "a"
print alphabet[0:2] # "ab"
print alphabet[0:5] # "abcde"
print alphabet[:5] # "abcde"
print alphabet[-5:-1] # "vwxy"
print alphabet[-5:] # "vwxyz"
print alphabet[10:-10] # "klmnop"
Warning
Extraction is inclusive with respect to the first index, but exclusive with respect to the second. In other words alphabet[i:j]
corresponds to:
alphabet[i] + alphabet[i+1] + ... + alphabet[j-1]
Note that alphabet[j]
is excluded.
Warning
Extraction return a new string, leaving the original unvaried:
alphabet = "abcdefghijklmnopqrstuvwxyz"
substring = alphabet[2:-2]
print substring
print alphabet # Is unvaried
Methods¶
Result | Method | Meaning |
---|---|---|
str |
str.upper() |
Return the string in upper case |
str |
str.lower() |
Return the string in lower case |
str |
str.strip(str) |
Remove strings from the sides |
str |
str.lstrip(str) |
Remove strings from the left |
str |
str.rstrip(str) |
Remove strings from the right |
bool |
str.startswith(str) |
Check if the string starts with another |
bool |
str.endswith(str) |
Check if the string ends with another |
int |
str.find(str) |
Return the position of a substring |
int |
str.count(str) |
Count the number of occurrences of a substring |
str |
str.replace(str, str) |
Replace substrings |
Warning
Methods return a new string, leaving the original unvaried (as with extraction):
alphabet = "abcdefghijklmnopqrstuvwxyz"
alphabet_upper = alphabet.upper()
print alphabet_upper
print alphabet # Is unvaried
Example. upper()
and lower()
are very simple:
text = "No Yelling"
result = text.upper()
print result
result = result.lower()
print result
Example. strip()
variants are also simple:
text = " one example "
print text.strip() # equivalent to text.strip(" ")
print text.lstrip() # idem
print text.rstrip() # idem
print text # text is unvaried
Note that the space between "one"
and "example"
is never removed. We can specify more than one character to be removed:
"AAAA one example BBBB".strip("AB")
Example. The same is valid with startswith()
and endswith()
:
text = "123456789"
print text.startswith("1") # True
print text.startswith("a") # False
print text.endswith("56789") # True
print text.endswith("5ABC9") # False
Example. find()
returns the position of the first occurrence of a substring, or -1
if the substring never occurs:
text = "123456789"
print text.find("1") # 0
print text.find("56789") # 4
print text.find("Q") # -1
Example. replace()
returns a copy of the string where a substring is replaced with another:
text = "if roses were rotten, then"
print text.replace("ro", "gro")
Example. Given this unformatted string of aminoacids:
sequence = ">MAnlFKLgaENIFLGrKW "
To increase uniformity, we want to remove the ">"
character, remove spaces and finally convert everything to upper case:
s1 = sequence.lstrip(">")
s2 = s2.rstrip(" ")
s3 = s2.upper()
print s3
Alternatively, all in one step:
print sequence.lstrip(">").rstrip(" ").upper()
Why does it work? Let’s write it with brackets:
print ( ( sequence.lstrip(">") ).rstrip(" ") ).upper()
\_____________________/
str
\_____________________________________/
str
\_____________________________________________/
str
As you can see, the result of each method is a string (as s1
, s2
e s3
in the example above);
and we can invoke string methods.
Exercises¶
How can I:
Create a string consisting of five spaces only.
Check whether a string contains at least one space.
Check whether a string contains exactly five (arbitrary) characters.
Create an empty string, and check whether it is really empty.
Create a string that contains one hundred copies of
"Python is great"
.Given the strings
"but cell"
,"biology"
and"is way better"
, compose them into the string"but cell biology is way better"
.Check whether the string
"12345"
begins with1
(the character, not the number!)Create a string consisting of a single character
\
. (Check whether the output matches using both the echo of the interpreter andprint
, and possibly also withlen()
)Check whether the string
"\\"
contains one or two backslashes.Check whether a string (of choice) begins or ends by
\
.Check whether a string (of choice) contains
x
at least three times at the beginning and/or at the end. For instance, the following strings satisfy the desideratum:"x....xx" # 1 + 2 >= 3 "xx....x" # 2 + 1 >= 3 "xxxx..." # 4 + 0 >= 3
while these do not:
"x.....x" # 1 + 1 < 3 "...x..." # 0 + 0 < 3 "......." # 0 + 0 < 3
Given the string:
s = "0123456789"
which of the following extractions are correct?
s[9]
s[10]
s[:10]
s[1000]
s[0]
s[-1]
s[1:5]
s[-1:-5]
s[-5:-1]
s[-1000]
Create a two-line string that contains the two following lines of text literally, including all the special characters and the implicit newline character:
never say “never”!
said the sad turtle
Given the strings:
string = "a 1 b 2" digit = "DIGIT" character = "CHARACTER"
replace all the digits in the variable
string
with the text provided by the variabledigit
, and all alphabetic characters with the content of the variablecharacter
.The result should look like this:
"CHARACTER DIGIT CHARACTER DIGIT"
You are free to use auxiliary variables to hold any intermediate results, but do not need to.
Given the following multi-line sequence:
chain_a = """SSSVPSQKTYQGSYGFRLGFLHSGTAKSVTCTYSPALNKM FCQLAKTCPVQLWVDSTPPPGTRVRAMAIYKQSQHMTEVV RRCPHHERCSDSDGLAPPQHLIRVEGNLRVEYLDDRNTFR HSVVVPYEPPEVGSDCTTIHYNYMCNSSCMGGMNRRPILT IITLEDSSGNLLGRNSFEVRVCACPGRDRRTEEENLRKKG EPHHELPPGSTKRALPNNT"""
which represents the aminoacid sequence of the DNA-binding domain of the Tumor Suppressor Protein TP53 , answer the following questions.
- How many lines does it hold?
- How long is the sequence? (Do not forget to ignore the special characters!)
- Remove all newline characters, and put the result in a new variable
sequence
. - How many cysteines
"C"
are there in the sequence? How many histidines"H"
? - Does the chain contain the sub-sequence
"NLRVEYLDDRN"
? In what position? - How can I use
find()
and the sub-string extraction[i:j]
operators to extract the first line fromchain_a
?
Given (a small portion of) the tertiary structure of chain A of the TP53 protein:
structure_chain_a = """SER A 96 77.253 20.522 75.007 VAL A 97 76.066 22.304 71.921 PRO A 98 77.731 23.371 68.681 SER A 99 80.136 26.246 68.973 GLN A 100 79.039 29.534 67.364 LYS A 101 81.787 32.022 68.157"""
Each line represents an \(C_\alpha\) atom of the backbone of the structure. Of each atom, we know: - the aminoacid code of the residue - the chain (which is always
"A"
in this example) - the position of the residue within the chain (starting from the N-terminal) - and the \(x, y, z\) coordinates of the atomExtract the second line using
find()
and the extraction operator. Put the line in a new variableline
.Extract the coordinates of the second residue, and put them into three variables
x
,y
, andz
.Extract the coordinates from third residue as well, putting them in different variables
x_prime
,y_prime
,z_prime
Compute the Euclidean distance between the two residues:
\(d((x,y,z),(x',y',z')) = \sqrt{(x-x')^2 + (y-y')^2 + (z-z')^2}\)
Hint: make sure to use
float
numbers when computing the distance.
Given the following DNA sequence, part of the BRCA2 human gene:
dna_seq = """GGGCTTGTGGCGCGAGCTTCTGAAACTAGGCGGCAGAGGCGGAGCCGCT GTGGCACTGCTGCGCCTCTGCTGCGCCTCGGGTGTCTTTT GCGGCGGTGGGTCGCCGCCGGGAGAAGCGTGAGGGGACAG ATTTGTGACCGGCGCGGTTTTTGTCAGCTTACTCCGGCCA AAAAAGAACTGCACCTCTGGAGCGG"""
- Calculate the GC-content of the sequence
- Convert the DNA sequence into an RNA sequence
- Assuming that this sequence contains an intron ranging from nucleotide 51 to nucleotide 156, store the sequence of the intron in a string, and the sequence of the spliced transcript in another string.