Import dependencies:
%matplotlib inline
import csv
import seaborn
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display, HTML
from sklearn.decomposition import PCA
Here, we will import the CSV downloaded from EDD and process it into desired format.
First, let's define an auxiliary function:
def is_number(s):
try:
float(s)
return True
except ValueError:
return False
Then, we load the data into a data frame:
df = pd.read_csv('PCAP using EDD.csv')
feature_relationships = {'ATOB_ECOLI':'AtoB',
'ERG8_YEAST':'PMK',
'IDI_ECOLI':'idi',
'KIME_YEAST':'MK',
'MVD1_YEAST':'PMD',
'Q40322_MENSP':'LS',
'Q8LKJ3_ABIGR':'GPPS',
'Q9FD86_STAAU':'HMGR',
'Q9FD87_STAAU':'HMGS'}
We now replace the EDD standard labels for proteins by more familiar names (i.e. the ones in Alonso-Gutierrez et al):
line_names = df['Line Name'].unique()
measurement_types = df['Measurement Type'].unique()
time_points = [ candidate_point for candidate_point in df.columns.values if is_number(candidate_point)]
feature_names = ['AtoB','HMGS','HMGR','MK','PMK','PMD','idi','GPPS','LS']
target_names = ['Limonene',]
df = df[['Line Name','Measurement Type'] + time_points]
df = pd.pivot_table(df,values=time_points,index=['Line Name'],columns='Measurement Type',aggfunc=np.sum)
df.rename(columns=feature_relationships,inplace=True)
display(df)
And, finally, we convert the data into a feature matrix and and objective column for use with scikit-learn:
feature_indices = [(time_points[0],feature_name) for feature_name in feature_names]
target_indices = [(time_points[0],target_name) for target_name in target_names]
X = df.as_matrix(columns=feature_indices).tolist()
y = df.as_matrix(columns=target_indices).transpose().tolist()[0]
First do PCA and transform points for initial experiments first and then transform points for the second experiment using initial transformation:
limonene_pca = PCA(n_components=2)
transformed_points = limonene_pca.fit_transform(X[0:27])
new_transformed_points = limonene_pca.transform(X[27:])
We then invert the x axis so it looks like the original figure instead of a specular image of it:
transformed_x = [-1*point[0] for point in transformed_points]
transformed_y = [point[1] for point in transformed_points]
new_transformed_x = [-1*point[0] for point in new_transformed_points]
new_transformed_y = [point[1] for point in new_transformed_points]
Plot first and second components (first experiment in blue, second experiment in red):
y_scaled = [item/max(y)*400 for item in y]
plt.scatter(transformed_x,transformed_y,marker='+',s=y_scaled[0:27],linewidths=1)
plt.scatter(new_transformed_x,new_transformed_y,color='red',marker='+',s=y_scaled[27:],linewidths=1)
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.show()
which is Fig. 4 in Alonso-Gutierrez et al.
Alonso-Gutierrez, Jorge, et al. "Principal component analysis of proteomics (PCAP) as a tool to direct metabolic engineering." Metabolic engineering 28 (2015): 123-133.