Mass-Spectrometry Search Engines

Mass-spectra are matched to peptides so that proteins can be identified using a search engine. In this tutorial we will use a command-line version of the Tandem search engine to match mass-spectra to protein sequences. There are also some online web-based search-engines such as Mascot however the online version of Mascot has a limit on the number of spectra that may be searched.

Search engines work by predicting the possible peptide fragments that would be created after protein digestion by a specific enzyme. For example, Trypsin will always cleave a protein after an arginine or lysine unless followed by a proline. The search engine compares the unknown peptide fragment with the predicted fragments from the protein sequence database to generate a list of potential protein candidates. A scoring method is used to identify the best match.

Using the Mascot Search Engine

  • In the Data file field, click Browse and navigate to file of interest.
  • In the Database dropdown list, select the appropriate database.
  • In the Data format dropdown list, select the appropriate file format.
  • Click Start Search.
  • Note that the online free version of Mascot will only process up to 1200 spectra (i.e. very small mass-spectrometry files)

Using the GPM Search Engine (tandem) via linux command line

  • (Requires access to linux server with tandem programme installed)
  • Login to iridis5_a.soton.ac.uk
  • Download this zip file to your directory on iridis5_a.soton.ac.uk
    • Unzip the file:
      unzip tandem-linux-workshop.zip
    • This will create a directory:
      tandem-linux-workshop/
    • cd to the directory:
      tandem-linux-workshop/bin/
    • Run:
      module load tandem 
    • Test that tandem runs by running:
      tandem.exe input.xml
  • You'll see several xml files in the bin directory; for the purposes of this tutorial you do not need to edit these
  • Take a look at the default_input.xml file (this is where the parameters for ther tandem programme are set)
  • To search your mass-spectra data:
    • copy your mass-spectra (DTA or MGF) file to this directory:
      tandem-linux-workshop/bin/
    • rename your mass-spectra file to:
      test_spectra
    • search the file against human sequence database by running:
      tandem input-human.xml
    • search the file against human sequence database by running:
      tandem input-mouse.xml
    • once the search has completed, you'll find a file in the same directory called
      output*.xml
      (where * is a time/date stamp)
    • this output file has all of the search results
  • The output*xml files contain lots of information about the peptides and proteins that have been identified
  • Here's some code to extract the proteins identified and their expect values (the statistical significance of the match):
    • awk 'BEGIN{OFS=","} /^<protein/ {print $5,$2}' output.*.xml | sed 's/\(expect=\|label=\|\"\| \)//g;' | sort -t , -k 2,2n
  • The expect values are actually log(expect) - so the large more negative numbers indicate a more significant p-value

Using the GPM Search Engine (tandem) on windows

  • Download this zip file to a local desktop on your Windows PC
  • Unzip the file:
    tandem-windows-workshop.zip
  • Start the cmd programme on windows and navigate to the directory:
    tandem-windows-workshop/bin/
  • Run tandem:
    tandem.exe input.xml
  • You'll see several xml files in the bin directory; for the purposes of this tutorial you do not need to edit these
  • Using notepad take a look at the default_input.xml file (this is where the parameters for ther tandem programme are set)
  • To search your mass-spectra data:
      • copy your mass-spectra (DTA or MGF) file to this directory:
        tandem-windows-workshop/bin/
      • rename your mass-spectra file to:
        test_spectra
      • search the file against human sequence database by running:
        tandem.exe input-human.xml
      • search the file against human sequence database by running:
        tandem.exe input-mouse.xml
      • once the search has completed, you'll find a file in the same directory called
        output*.xml
        (where * is a time/date stamp)
      • this output file has all of the search results
    • The output*xml files contain lots of information about the peptides and proteins that have been identified
  • Using the GPM Search Engine via the Web interface

    • UNFORTUNATELY THE ONLINE VERSION OF TANDEM IS NO LONGER AVAILABLE
    • Open the GPM search form here
    • In the spectra field, click Browse and navigate to file of spectra (typically a DTA or MGF format peak list file)
    • Set other required parameters
      • Sequence search Database (typically the relevant ENSEMBL database)
      • Modifications. Most data will not require additional modifications in this tutorial
      • Predefined Methods. The default parameter will work for most of the tutorial, selecting the Ion Trap method is required uhere indicated.
    • Click Find proteins
    • Once search is complete, Display gene/excel and go to download search results
    • Other views of the data are also possible on the GPM site. Click go tab for Gene Ontology analysis