Pandas: SolutionsΒΆ
Solutions:
import pandas as pd import matplotlib.pyplot as plt df_iris = pd.read_csv("iris.csv")
Solution:
# number of rows and columns print df_iris.shape # only rows print len(df_iris.index) # only columns print len(df_iris.columns)
Solution:
# average petal length print df_iris.PetalLength.mean()
Solution:
# average of all numeric columns print df_iris.mean()
Solution:
# rows corresponding to petal length outliers print df_iris[df_iris.PetalLength > 1.5*df_iris.PetalLength.mean()]
Solution:
# group-wise standard deviation grouped = df_iris.groupby(df_iris.Name) iris_std_by_name = grouped.aggregate(pd.DataFrame.std) print iris_std_by_name # or in a single row print df_iris.groupby(df_iris.Name).aggregate(pd.DataFrame.std)
Solution:
# grouped outliers for name, dataframe in df_iris.groupby(df_iris.Name): print dataframe[dataframe.PetalLength > 1.5*df_iris.PetalLength.mean()]
Solution:
# WARNING: there are *no* group-wise outliers! # group-wise outliers for name, dataframe in df_iris.groupby(df_iris.Name): print dataframe[dataframe.PetalLength > 1.5*dataframe.PetalLength.mean()] # using groupby-aggregate-merge grouped = df_iris.groupby(df_iris.Name, as_index=False) petlen_mean_by_name = grouped.aggregate(pd.DataFrame.mean) merged = pd.merge(iris, petlen_mean_by_name, on="Name") # use `print merged.columns` to take a peek at the column names; the # column_x columns come from the left table (i.e. the original iris # DataFrame), the column_y ones come # from the right table (i.e. the # group-wise means) print merged[merged.PetalLength_x > 1.5 * merged.PetalLength_y]
Solutions:
import pandas as pd import matplotlib.pyplot as plt df_gene = pd.read_csv("gene_table.txt")
Solution:
# number of genes (there is exactly one per row) print df_gene.shape[0]
Solution:
# minimum, maximum, average and median number of isoforms per gene print df_gene.transcript_count.min(), \ df_gene.transcript_count.max(), \ df_gene.transcript_count.mean(), \ df_gene.transcript_count.median()
Solution:
# plot a histogram of the number of known isoforms per gene df_gene.transcript_count.plot(kind="hist",bins=100) plt.show()
Solution:
# computes, for each gene_biotype, the number of associated genes grouped = df_gene.groupby(df_gene.gene_biotype) grouped_number_by_biotype = grouped.aggregate(pd.DataFrame.count) print grouped_number_by_biotype
Solution:
rows = [] for biotype, subdf in df_gene.groupby(df_gene.gene_biotype): rows.append((len(subdf), biotype)) # alternative 1 rows.sort() rows.reverse() print rows # alternative 2, more advanced print sorted(rows, reverse=True)
Solution:
print len(df_gene.chromosome.unique()) # or equivalently (the result is a numpy array) print df_gene.shape[0]
Solution: almost identical to the solution of point 5.
Solution:
for chromosome, subdf in df_gene.groupby(df_gene.chromosome): num_plus = float(len(subdf[subdf.strand == "+"])) perc_plus = 100 * num_plus / len(subdf) print chromosome, perc_plus
Solution:
df_gene.groupby(df_gene.gene_biotype, as_index=False) \ .aggregate(pd.DataFrame.mean)[['gene_biotype', 'transcript_count']]
Solutions:
import pandas as pd import matplotlib.pyplot as plt df_tt = pd.read_csv("transcript_table.txt")
Solution:
print df_tt.shape[0]
Solution:
# minimum, maximum, average and median length of human transcripts print df_tt.transcript_length.min(), \ df_tt.transcript_length.max(), \ df_tt.transcript_length.mean(), \ df_tt.transcript_length.median()
Solution:
# minimum, maximum, average and median length of the CDS of human transcripts (excluding values equal to 0) print (df_tt.cds_length[df_tt.cds_length>0]).min(), \ (df_tt.cds_length[df_tt.cds_length>0]).max(), \ (df_tt.cds_length[df_tt.cds_length>0]).mean(), \ (df_tt.cds_length[df_tt.cds_length>0]).median()
Solution:
# computes the percentage of human transcripts with a CDS length that is a multiple of 3 (excluding values equal to 0) cds_number = df_tt.cds_length[df_tt.cds_length>0].shape[0] cds_multiple_number = df_tt.cds_length[df_tt.cds_length>0][df_tt.cds_length%3==0].shape[0] print (float(cds_multiple_number)/float(cds_number))*100 # Is the percentage equal to 100%? # How can you explain coding sequences whose length is not multiple of 3? (remember the rules of the genetic code)
Solution:
# minimum, maximum, average and median length of the UTRs of human transcripts (excluding values equal to 0) # for 5' UTRs filter_df = df_tt[df_tt.utr5_length>0] print filter_df.utr5_length.min(), \ filter_df.utr5_length.max(), \ filter_df.utr5_length.mean(), \ filter_df.utr5_length.median() # for 3' UTRs filter_df = df_tt[df_tt.utr3_length>0] print filter_df.utr3_length.min(), \ filter_df.utr3_length.max(), \ filter_df.utr3_length.mean(), \ filter_df.utr3_length.median() # for both 5' and 5' UTRs (calculating the sum of lengths) filter_df = df_tt[df_tt.utr5_length>0][df_tt.utr3_length>0] sum_of_lengths = filter_df.utr5_length + filter_df.utr3_length print sum_of_lengths.min(), \ sum_of_lengths.max(), \ sum_of_lengths.mean(), \ sum_of_lengths.median()
Solution:
# computes, for each transcript_biotype, the number of associated #transcripts (a histogram), and prints the transcript_biotype with the #number of associated transcripts in decreasing order grouped = df_tt.groupby(df_tt.transcript_biotype) grouped_number_by_biotype = grouped.size() # the result is a series grouped_number_by_biotype.sort(ascending=0) # sort the series print grouped_number_by_biotype
Solution:
# computes, for each transcript_biotype, the average transcript length, and prints the results in increasing order grouped = df_tt.groupby(df_tt.transcript_biotype) grouped_number_by_biotype = grouped.aggregate(pd.DataFrame.mean) sorted_number_by_biotype = grouped_number_by_biotype.sort("transcript_length") print sorted_number_by_biotype.transcript_length
Solution:
# computes, for protein_coding transcripts, the average length of the 5'UTR, CDS and 3' UTR filter_df = df_tt[df_tt.transcript_biotype=="protein_coding"] print filter_df.utr5_length.mean(), \ filter_df.cds_length.mean(), \ filter_df.utr3_length.mean()
Solution:
# computes, for each transcript_biotype and considering only canonical transcripts, the average number of exons filter_df = df_tt[df_tt.canonical_flag=="T"] grouped = filter_df.groupby(filter_df.transcript_biotype) grouped_number_by_biotype = grouped.aggregate(pd.DataFrame.mean) print grouped_number_by_biotype.exon_count
Solutions:
import pandas as pd import matplotlib.pyplot as plt df_exon = pd.read_csv("exon_table.txt")
Solution:
df_length = df_exon.exon_chrom_start - df_exon.exon_chrom_end + 1 print df_length.min(), df_length.max(), df_length.mean(), df_length.median()
Solution:
name = raw_input("write a trascript name :") filtered = df_exon[df_exon.transcript_name == name] filtered_length = filtered.exon_chrom_start - filtered.exon_chrom_end + 1 merged = df.merge(filtered, filtered_length) merged.sort(["exon_chrom_start"])
Solutions:
ts = pd.read_csv("expression_timeseries.txt", sep="\t")
Solution:
# number of genes print ts.shape[0] # number of time steps (just count how many columns end by 'h') print len([c for c in ts.columns if c[-1] == "h"])
Solution:
# if there are NaNs, then dropna will remove at least one row print "got nans?", ts.shape[0] != ts.dropna().shape[0] # how many rows with at least one NaN are there? (just to double check) print ts.shape[0] - ts.dropna().shape[0] # fill in the missing values ts = ts.ffill()
Note
In the following solutions we will assume that ffill() has already been called.
Solution: there are actually two expression profiles for the
"ZFX"
gene:>>> expr_zfx = ts[ts["Gene Symbol"] == "ZFX"] >>> expr_zfx SPOT Gene Symbol 0h 0.5h 3h 6h 12h 0 1 ZFX -0.092 0.126 0.109 0.043 0.074 1739 1740 ZFX -0.487 0.298 -0.058 -0.158 -0.155
Let’s plot the first; the other can be dealt with similarly:
# first is a Series first = expr_zfx.iloc[0] numbers = first[2:] numbers.plot(kind="line") plt.savefig("output.png")
Solution: let’s take the mean over the time steps:
means = ts[ts.columns[2:]].mean(axis=1)
Here the first part extracts only those columns that encode expression measurements (from the third onwards), while
axis=1
specifies that the average should be taken by averaging over columns, rather than over rows as we are used to. The latter case corresponds toaxis=0
, and is the default.Now:
variances = ts[ts.columns[2:]].var(axis=1)
does the same thing for the variance.
Let’s plot the mean and the variance:
means.plot(kind="line") plt.savefig("means.png") # clear the screen between plots plt.clf() # same thing for variance here
Warning
The above exercise is slightly bugged. In order to be solved “the Pandas way”, it requires more advanced material that is not provided by the notes. I gave a fully Pandas-approved solution.
Solution:
# we know that the only columns we care about are from the third # onwards -- for simplicity for column in ts.columns[2:]: index_of_max = ts[column].argmax() print column, ts.iloc[index_of_max]
Unfortunately most of the gene names are
NaN
; the solution is correct nevertheless.Solution: same as below, except that the inner loop keeps track of the index of the maximally correlated gene.
Very elementary solution, it will take a long time to complete:
ts = pd.read_csv("expression_timeseries.txt", sep="\t") num_rows = ts.shape[0] # set this to 100 or so to make it faster corr = [] for i in range(num_rows): row_i = ts.iloc[i][2:].astype(float) corr_i = [] for j in range(num_rows): row_j = ts.iloc[j][2:].astype(float) corr_i.append(row_i.corr(row_j)) corr.append(corr_i) plt.imshow(pd.DataFrame(corr).values, cmap="gray", interpolation="nearest") plt.savefig("corr.png")
Left to the reader.