Biopython (Exercises)¶
Biopython: Sequences¶
Note
The sequences.fasta file is available at:
Write a Python program that asks the user for two DNA sequences, and prints the reverse complement of their concatenation.
Write a Python program that asks the user for a DNA sequence, and prints both the corresponding mRNA sequence and protein sequence, including stop codons (according to the standard translation table).
Write a Python program that asks the user for a DNA sequence, and prints both the corresponding mRNA sequence and protein sequence, including stop codons (according to the Yeast Mitochondrial Code translation table).
Write a Python program that takes the sequence of the 1AI4 PDB protein (download the FASTA file manually), and writes a corresponding UniProt file.
Write a Python program that takes the
sequences.fasta
file and writes N single-sequence FASTA files, calledsequence{number}.fasta
, each one containing a single sequence of the original file.Do the same, but create N GenBank files instead.
Write a Python program that takes the
sequences.fasta
file and writes arevcomp.fasta
file with the reverse complements of the original sequences.Hint. The
SeqIO.write()
function can write an entire list ofSeqIO
recordsSolve Exercise 3 of the Programs section using Biopython where appropriate.
Solve Exercise 2 of the Programs section using Biopython where appropriate.
Hint. Study carefully the
.annotations
of theSeqRecord
obtained by parsing the UniProt file.Find and download a single sequence record from genbank. The genbank identifier of the record is HE805982.1. This record contains information about the DNA region coding for HBx, a multifunctional hepatitis B viral protein involved in modulating several pathways by directly or indirectly interacting with hosts factors (protein degradation, apoptosis, transcription, signal transduction, cell cycle progress, and genetic stability). Write a Python program that, using the genbank record, saves the corresponding protein sequence in fasta format.
From UniProt find and download the records relative to the four human ELAV proteins (ELAVL1: Q15717, ELAVL2: Q12926, ELAVL3: Q14576, ELAVL4: P26378). Download each record in text format and store the four records in a dedicated directory. Now write a python program that takes all the uniprot files in the directory and appends all the sequences in fasta format, to the file created in the previous exercise. Can you print sequences ordered by increasing length?
Hint. The
os.listdir()
function in theos
module can save in a list all the names of the files in a directory.Write a program that, given a fasta file containing multiple protein sequences and a string specified by the user, prints to a new file only sequences that contain at least one occurrence of the string (regular expressions are allowed). Test your program with the file
sequences.fasta
, printing sequences containing a stretch of at least three glutamines.Try to execute exercises 20.1.6 and 20.1.7 in the biopython tutorial page:
Biopython: Structures¶
Note
The P53 files (FASTA and three PDB files) are available here:
Ask the user for the path to a
pdb
file. Print the header of the PDB file on screen, including the name, author, resolution and release date of the PDB structure.Same as above, but for an
mmcif
file.Ask the user for the path to a
pdb
file. Print the number of models it contains. For each model, print the name length of its chains.Same as above, but for an
mmcif
file.Ask the user for the path to a
pdb
file, as well as the name of a chain. Print the number of atoms that each of its residues has.Ask the user for the path to a
pdb
file, and convert its sequence to aFASTA
file.Note. You will need to convert from three-letter amino acid codes to one-letter codes.
Hint. You can use the
SeqIO
module, but it is not mandatory.Ask the user for the ID of a PDB structure (e.g.
"1A3A"
), and download it somewhere using thePDBList
class. Then print the hierarchy of the structure: how many models; for each model how many chains; for each chain how many residues.Note. Try to make the output look pretty.
Ask the user the path to two
pdb
files, then perform structural alignment on them.Ask the user the path to two
pdb
files and afloat
threshold. For each chain in the first structure, compare it to each chain in the second structure: if any two residues in the first chain is close enough to a residue of the second chain (i.e. their distance is smaller than the threshold), print the chain ID and residue type of the two residues, as well as their distance.Given the files
1TSR.pdb
andwt_1tsr_B.pdb
in the P53 package, make sure that they have exactly the same residues.Given the files
wt_1tsr_B.pdb
andwt_1tsr_B_nowater.pdb
in the P53 package, make sure that the only missing molecules are the"HOH"
ones.Hint. Use the
hetero
attribute of the residues.