Assignment overview

In this assignment you will use commonly-used tools and approaches for [transcript|prote]omics data analysis

Part I: Proteomics

This study used shotgun proteomics to study the proteomes of mouse embryonic stem cell sub-cellular compartments.

Whole embryonic stem cell lysates were fractionated into nuclear and cytoplasmic components using low-speed centrifugation. We then used 1-Dimensional gel-based fractionation to separate the nuclear and cytoplasmic samples into 5 fractions each (Song et al, 2012). The data provided here correspond to one each of these fractions for the nuclear (Nuc) sample and the cytoplasmic (Cyto) samples.

Assignment Tasks

Data files and formats
- Download the MGF files, and write Bash or R code to determine how many spectra are in the Nuc file and how many in the Cyto file
- Write Bash or R code to count the total number of ions in each file, and to calculate the mean number of ions per spectrum
MS search engines
- Search the Nuc and Cyto files (against mouse sequence database) using the X! Tandem search engine.
- Search both files again, using a different (human) sequence database
- Download protein and peptide tables for each set of search results
- Write Bash or R code that computes the numbers of total and unique peptides and proteins
Biological Interpretation
- Using the GPM interface, analyze the Gene Ontology Cellular Compartments represented in the identified proteins
- Plot a graph comparing the Nuc and Cyto cellular compartments
- Discuss differences between the Nuc and Cyto results in terms of protein sub-cellular localization
- Using the protein tables downloaded above, analyze the biological trends represented in the data (pathways, processes etc) using DAVID
- Briefly (500 words maximum) discuss your findings and relate the analysis to the nuclear and cytoplasmic compartments that have been analysed

Assignment Assessment

This assignment focuses on using tools for proteomic and transcriptomic analyses. The assignment will be assessed according to:

Your ability to use the tools correctly
Providing supporting code or documentation for your analyses
Your understanding and ability to interpret and discuss the results

Data

Data Files

Search engine settings

Pre-computed search results

MS/MS spectra from nuclear fraction

MS/MS spectra from cytoplasmic fraction

Database: SwissProt;

Taxonomy: Mus musculus

ESC_Nuc_fraction.mgf searched on X! Tandem

ESC_Cyto_fraction.mgf searched on X! Tandem

References

Song J, Saha S, Gokulrangan G, Tesar PJ, Ewing RM. DNA and chromatin modification networks distinguish stem cell pluripotent ground states. Molecular & cellular proteomics : MCP. 2012; 11(10):1036-47 (PubMed).

Part II: Transcriptomics

B. The Gene Expression Omnibus is a huge repository of gene-expression studies. For this task you will perform statistical analysis on a given study in GEO and identify differentially expressed genes Download some data from GEO analyze using GEOQuery identify differentially expressed genes take those genes and run through DAVID r similar Analyze using Gene Ontology make conclusions

Part III: Databases and Resources

A. The BioGRID database is a repository of protein interaction information. In this task you will query the biogrid database, extract interacting partners for your query protein and construct a network Starting with one of the query proteins, search Biogrid, extract interactors. For selected interactors search again and add these interactors to your list Filter the protein interactions according to the technique that was used to analyze them. Are there technique-specific interactors? Filter the PPIs according to confidence. Take your list of interacting proteins and visualize the network using cytoscape (use scripting or R to construct Sif) Compare this network to the information in STRING Are there more or less interactors... Compare this network to TMM network of co-expression, are any of the protein interacting partners also co-expressed? Use cytoscape load yeast alactose utilization network derive network properties (nodes, hubs, cluster) select 2 hubs - find out whether the two proteins are functionally related explore different modes of visualization use mcode to cluster - how many clusters find a hub node, identify and write about why it is important clustering coefficients EXTRA NOTES: PART 1 PROCESSING AND ANALYSING OMICS DATA A. Mass spectrometry proteomics data * A large survey of Affinity-Purification Mass-Spectrometry (AP-MS) experiments was performed in human cells (Ewing et al, 2007). * Interacting proteins were identified for ~400 different bait proteins * In this task, you will take some of the raw data (mass spectra) from this study and identify corresponding proteins * Selected mass-spectrometry files are provided. Each of these files corresponds to a single affinity-purification mass-spectrometry experiment, and the bait protein used is indicated by the first part of the file name, which is the Gene Symbol. * In addition, control experiments were performed so that non-specific interacting protens could be identified. These are the files with Gene Symbol "na". * Select one of the baits (EIF4A2, TCF1, CTNNBIP1, PSMD13) and search the mass-spectrometry files for that bait using Tandem X! search engine. In addition, search the control files. * Appropriate initial settings for the search engine are ... Take some proteomics data from a spreadsheet (epi / esc?) , compute statisitcs - identify differrential expression, do some analysis
Use scaffold with ap-ms file or protein expression file: using spectral counting identify those protens that are differential, contaminants, look at a spectrum - find something out take the proteins and do some analysis look at peptides, what is % coverage, count number of peptides for a given protein, which peptide was identified the most - is it tryptic