In this assignment you will use commonly-used tools and approaches for [transcript|prote]omics data analysis
Part I: Proteomics
This study used shotgun proteomics to study the proteomes of mouse embryonic stem cell sub-cellular compartments.
Whole embryonic stem cell lysates were fractionated into nuclear and cytoplasmic components using low-speed centrifugation. We then used 1-Dimensional gel-based fractionation to separate the nuclear and cytoplasmic samples into 5 fractions each (Song et al, 2012). The data provided here correspond to one each of these fractions for the nuclear (Nuc) sample and the cytoplasmic (Cyto) samples.
Assignment Tasks
Data files and formats
Download the MGF files, and write Bash or R code to determine how many spectra are in the Nuc file and how many in the Cyto file
Write Bash or R code to count the total number of ions in each file, and to calculate the mean number of ions per spectrum
MS search engines
Search the Nuc and Cyto files (against mouse sequence database) using the X! Tandem search engine.
Search both files again, using a different (human) sequence database
Download protein and peptide tables for each set of search results
Write Bash or R code that computes the numbers of total and unique peptides and proteins
Biological Interpretation
Using the GPM interface, analyze the Gene Ontology Cellular Compartments represented in the identified proteins
Plot a graph comparing the Nuc and Cyto cellular compartments
Discuss differences between the Nuc and Cyto results in terms of protein sub-cellular localization
Using the protein tables downloaded above, analyze the biological trends represented in the data (pathways, processes etc) using DAVID
Briefly (500 words maximum) discuss your findings and relate the analysis to the nuclear and cytoplasmic compartments that have been analysed
Assignment Assessment
This assignment focuses on using tools for proteomic and transcriptomic analyses. The assignment will be assessed according to:
Your ability to use the tools correctly
Providing supporting code or documentation for your analyses
Your understanding and ability to interpret and discuss the results
Song J, Saha S, Gokulrangan G, Tesar PJ, Ewing RM. DNA and chromatin modification networks distinguish stem cell pluripotent ground states. Molecular & cellular proteomics : MCP. 2012; 11(10):1036-47 (PubMed).
Part II: Transcriptomics
B. The Gene Expression Omnibus is a huge repository of gene-expression studies. For this task you will perform statistical analysis
on a given study in GEO and identify differentially expressed genes
Download some data from GEO
analyze using GEOQuery
identify differentially expressed genes
take those genes and run through DAVID r similar
Analyze using Gene Ontology
make conclusions
Part III: Databases and Resources
A. The BioGRID database is a repository of protein interaction information.
In this task you will query the biogrid database, extract interacting partners for your query protein and construct a network
Starting with one of the query proteins, search Biogrid, extract interactors. For selected interactors search again and add these
interactors to your list
Filter the protein interactions according to the technique that was used to analyze them.
Are there technique-specific interactors?
Filter the PPIs according to confidence.
Take your list of interacting proteins and visualize the network using cytoscape (use scripting or R to construct Sif)
Compare this network to the information in STRING
Are there more or less interactors...
Compare this network to TMM network of co-expression, are any of the protein interacting partners also co-expressed?
Use cytoscape
load yeast alactose utilization network
derive network properties (nodes, hubs, cluster)
select 2 hubs - find out whether the two proteins are functionally related
explore different modes of visualization
use mcode to cluster - how many clusters
find a hub node, identify and write about why it is important
clustering coefficients
EXTRA NOTES:
PART 1 PROCESSING AND ANALYSING OMICS DATA
A. Mass spectrometry proteomics data
* A large survey of Affinity-Purification Mass-Spectrometry (AP-MS) experiments was performed in human cells (Ewing et al, 2007).
* Interacting proteins were identified for ~400 different bait proteins
* In this task, you will take some of the raw data (mass spectra) from this study and identify corresponding proteins
* Selected mass-spectrometry files are provided. Each of these files corresponds to a single affinity-purification mass-spectrometry experiment, and the bait protein used is indicated by the first part of the file name, which is the Gene Symbol.
* In addition, control experiments were performed so that non-specific interacting protens could be identified. These are the files with Gene Symbol "na".
* Select one of the baits (EIF4A2, TCF1, CTNNBIP1, PSMD13) and search the mass-spectrometry files for that bait using Tandem X! search engine. In addition, search the control files.
* Appropriate initial settings for the search engine are ...
Take some proteomics data from a spreadsheet (epi / esc?) , compute statisitcs - identify differrential expression, do some analysis
Use scaffold with ap-ms file or protein expression file:
using spectral counting identify those protens that are differential, contaminants,
look at a spectrum - find something out
take the proteins and do some analysis
look at peptides, what is % coverage, count number of peptides for a given protein, which peptide was identified the most - is it tryptic