Mass-Spectrometry Search Engines

Mass-spectra are matched to peptides so that proteins can be identified using a search engine. In this tutorial we will use a command-line version of the Tandem search engine to match mass-spectra to protein sequences. There are also some online web-based search-engines such as Mascot however the online version of Mascot has a limit on the number of spectra that may be searched.

Search engines work by predicting the possible peptide fragments that would be created after protein digestion by a specific enzyme. For example, Trypsin will always cleave a protein after an arginine or lysine unless followed by a proline. The search engine compares the unknown peptide fragment with the predicted fragments from the protein sequence database to generate a list of potential protein candidates. A scoring method is used to identify the best match.

Using the Mascot Search Engine

In the Data file field, click Browse and navigate to file of interest.
In the Database dropdown list, select the appropriate database.
In the Data format dropdown list, select the appropriate file format.
Click Start Search.
Note that the online free version of Mascot will only process up to 1200 spectra (i.e. very small mass-spectrometry files)

Using the GPM Search Engine (tandem) via linux command line

(Requires access to linux server with tandem programme installed)
Login to iridis5_a.soton.ac.uk
Download this zip file to your directory on iridis5_a.soton.ac.uk
- Unzip the file:
```
unzip tandem-linux-workshop.zip
```
- This will create a directory:
```
tandem-linux-workshop/
```
- cd to the directory:
```
tandem-linux-workshop/bin/
```
- Run:
```
module load tandem 
```
- Test that tandem runs by running:
```
tandem.exe input.xml
```
You'll see several xml files in the bin directory; for the purposes of this tutorial you do not need to edit these
Take a look at the default_input.xml file (this is where the parameters for ther tandem programme are set)
To search your mass-spectra data:
- copy your mass-spectra (DTA or MGF) file to this directory:
```
tandem-linux-workshop/bin/
```
- rename your mass-spectra file to:
```
test_spectra
```
- search the file against human sequence database by running:
```
tandem input-human.xml
```
- search the file against human sequence database by running:
```
tandem input-mouse.xml
```
- once the search has completed, you'll find a file in the same directory called
```
output*.xml
```
  (where * is a time/date stamp)
- this output file has all of the search results
The output*xml files contain lots of information about the peptides and proteins that have been identified

Here's some code to extract the proteins identified and their expect values (the statistical significance of the match):

awk 'BEGIN{OFS=","} /^<protein/ {print $5,$2}' output.*.xml | sed 's/\(expect=\|label=\|\"\| \)//g;' | sort -t , -k 2,2n

The expect values are actually log(expect) - so the large more negative numbers indicate a more significant p-value

Using the GPM Search Engine (tandem) on windows

Download this zip file to a local desktop on your Windows PC
Unzip the file:
```
tandem-windows-workshop.zip
```
Start the cmd programme on windows and navigate to the directory:
```
tandem-windows-workshop/bin/
```
Run tandem:
```
tandem.exe input.xml
```

You'll see several xml files in the bin directory; for the purposes of this tutorial you do not need to edit these

Using notepad take a look at the default_input.xml file (this is where the parameters for ther tandem programme are set)

To search your mass-spectra data:

copy your mass-spectra (DTA or MGF) file to this directory:
```
tandem-windows-workshop/bin/
```
rename your mass-spectra file to:
```
test_spectra
```
search the file against human sequence database by running:
```
tandem.exe input-human.xml
```
search the file against human sequence database by running:
```
tandem.exe input-mouse.xml
```
once the search has completed, you'll find a file in the same directory called
```
output*.xml
```
(where * is a time/date stamp)
this output file has all of the search results

The output*xml files contain lots of information about the peptides and proteins that have been identified

Using the GPM Search Engine via the Web interface

UNFORTUNATELY THE ONLINE VERSION OF TANDEM IS NO LONGER AVAILABLE
Open the GPM search form here
In the spectra field, click Browse and navigate to file of spectra (typically a DTA or MGF format peak list file)
Set other required parameters
- Sequence search Database (typically the relevant ENSEMBL database)
- Modifications. Most data will not require additional modifications in this tutorial
- Predefined Methods. The default parameter will work for most of the tutorial, selecting the Ion Trap method is required uhere indicated.
Click Find proteins
Once search is complete, Display gene/excel and go to download search results
Other views of the data are also possible on the GPM site. Click go tab for Gene Ontology analysis