Getting started

Getting started

Download a mini FASTA database here to run the following examples. The database is really tiny, it’s just a single sequence which looks like this (copy and paste works just as well):

>mini
CCGGCGAGCACTGCGTCCCGGGTCTGCTCTCCCACCCCGCTGATCTCAACCAGCTTGGCAATAACAGGCGCAGCACCGCGCCTGAT
CGCCAAGTTCAGTGCAGCCTGAGCGCCCTGCGCGGCCCCCTGAGGCTGAGCATTGCCGATCGCGGGCCCCTGCTCACCCTGGACGG
CCTCGGCGACCACAGCAGCGCCAGCTCCGATAGCAGCCTCCTGCTCCTGCTCCTGCTCCACAGCTCCACGACCGCGACCGCGACCG
CGACCACGAATGCCAGCAGCTCCACGCCCACGCTCCATAGCAGCAACAAAAGGGCTTGAGTGAGAACGTTAGAAGTAACGTTAAAA

Building the index

Before peptides can be queried with GPF, an index file of the genomic DNA sequence must be built. In order to do this, call gpfindex and specify the input and output file names:

$ ./gpfindex mini.fasta mini.fasta.gpfindex

There are a couple of optional parameters which can be passed to gpfindex, all of which have sensible default values. By default, gpfindex uses a tag size of 5 amino acids and trypsin for determination of enzymatic cleavage sites. Run the program without any parameters to see all options.

Querying the index

Query the index with a peptide with the following command:

$ ./gpfquery mini.fasta.gpfindex DHECQQLHAHAP

…which produces the following output (to redirect the output to a CSV file, use the —csvOutputPath parameter):

Processing query 1 of 1...  done.
GPF search took 0.2 milliseconds.
Query,AA N-term,Peptide,AA C-term,Mass,Assembly,Intron length,Splice site
DHECQQLHAHAP,RDRDR,DHECQQLHAHAP,*QQQK,1384.5936,"{mini/mini}+259:36",,

The following table explains the columns in the CSV output:

Column Description
Query The peptide you have been searching for. Note: There may be multiple result rows for the same query peptide.
AA N-term The amino acids flanking the GPF aligned peptide at the N-terminus.
Peptide The peptide resulting from GPF search. This peptide is usually a bit different from your query, unless you run the program with --similaritySearch no, which only looks for perfect matches (and is therefore faster). However, the mass of this peptide matches the mass of the query peptide within the user-specified mass accuracy.
AA C-term The amino acids flanking the GPF aligned peptide at the C-terminus.
Mass The monoisotopic mass of the DNA-aligned GPF peptide.
Assembly A textual representation of the alignment.
The part in curly braces describes the genomic DNA identifier and the contig, separated by a slash.
The + or – sign denotes the reading direction.
The rest is a list of offset:length tuples, offsets are zero-based in respect to the contig. If there’s more than one tuple, it’s a spliced alignment.
Offsets are always oriented towards the peptide’s N-terminus, the pair of tuples in a spliced alignment always starts with the N-terminal part.
Intron length Length of the intron if this is a spliced alignment.
Splice site Splice site dinucleotide pair if this is a spliced alignment.
By default, GPF will look for GT/AG and GC/AG alignments.

A quick look into the six frame translation shows that the peptide is a perfectly tryptic peptide, with a tryptic cleave site at the N-terminus and a stop codon at the C-terminus:

------------------------------------------------------------------------------------------------------------------------
240  .    250  .    260  .    270  .    280  .    290  .    300  .    310  .    320  .    330  .    340  .    350  .    
CGACCGCGACCGCGACCGCGACCACGAATGCCAGCAGCTCCACGCCCACGCTCCATAGCAGCAACAAAAGGGCTTGAGTGAGAACGTTAGAAGTAACGTTAAAA
GCTGGCGCTGGCGCTGGCGCTGGTGCTTACGGTCGTCGAGGTGCGGGTGCGAGGTATCGTCGTTGTTTTCCCGAACTCACTCTTGCAATCTTCATTGCAATTTT
 R  P  R  P  R  P  R  P  R  M  P  A  A  P  R  P  R  S  I  A  A  T  K  G  L  E  *  E  R  *  K  *  R  *  
  D  R  D  R  D  R  D  H  E  C  Q  Q  L  H  A  H  A  P  *  Q  Q  Q  K  G  L  S  E  N  V  R  S  N  V  K 
T  T  A  T  A  T  A  T  T  N  A  S  S  S  T  P  T  L  H  S  S  N  K  R  A  *  V  R  T  L  E  V  T  L  K
R  G  R  G  R  G  R  G  R  I  G  A  A  G  R  G  R  E  M  A  A  V  F  P  S  S  H  S  R  *  F  Y  R  *  F
  V  A  V  A  V  A  V  V  F  A  L  L  E  V  G  V  S  W  L  L  L  L  L  A  Q  T  L  V  N  S  T  V  N  F 
 S  R  S  R  S  R  S  W  S  H  W  C  S  W  A  W  A  G  Y  C  C  C  F  P  K  L  S  F  T  L  L  L  T  L  
 

GPF is also capable of aligning partially correct peptides, so let’s trying shuffling the C-terminal half of DHECQQLHAHAP and try DHECQQAAHHLP, which results in the same peptide alignment:

Processing query 1 of 1...  done.
GPF search took 0.1 milliseconds.
Query,AA N-term,Peptide,AA C-term,Mass,Assembly,Intron length,Splice site
DHECQQAAHHLP,RDRDR,DHECQQLHAHAP,*QQQK,1384.5936,"{mini/mini}+259:36",,

Because the N-terminal part of the peptide was left intact, GPF could still find the peptide. As long as a single tag from the query peptide (5 amino acids by default, I/L not distinguished) can be found in the genomic DNA, GPF will attempt to deduce a peptide. Furthermore, one enzymatic cleavage site is sufficient to find a peptide, this means that also semitryptic peptides can be found.

Let’s search for another peptide, QAQHRFS:

Query,AA N-term,Peptide,AA C-term,Mass,Assembly,Intron length,Splice site
QAQHRFS,PAWQ*,QAQHRFS,AA*AP,872.4250,"{mini/mini}+64:15,93:6",14,GC|AG

------------------------------------------------------------------------------------------------------------------------
0    .    10   .    20   .    30   .    40   .    50   .    60   .    70   .    80   .    90   .    100  .    110  .    
CCGGCGAGCACTGCGTCCCGGGTCTGCTCTCCCACCCCGCTGATCTCAACCAGCTTGGCAATAACAGGCGCAGCACCGCGCCTGATCGCCAAGTTCAGTGCAGCCTGAGCGCCCTGCGCG
GGCCGCTCGTGACGCAGGGCCCAGACGAGAGGGTGGGGCGACTAGAGTTGGTCGAACCGTTATTGTCCGCGTCGTGGCGCGGACTAGCGGTTCAAGTCACGTCGGACTCGCGGGACGCGC
 P  A  S  T  A  S  R  V  C  S  P  T  P  L  I  S  T  S  L  A  I  T  G  A  A  P  R  L  I  A  K  F  S  A  A  *  A  P  C  A 
  R  R  A  L  R  P  G  S  A  L  P  P  R  *  S  Q  P  A  W  Q  *  Q  A  Q  H  R  A  *  S  P  S  S  V  Q  P  E  R  P  A  R
   G  E  H  C  V  P  G  L  L  S  H  P  A  D  L  N  Q  L  G  N  N  R  R  S  T  A  P  D  R  Q  V  Q  C  S  L  S  A  L  R  
   A  L  V  A  D  R  T  Q  E  G  V  G  S  I  E  V  L  K  A  I  V  P  A  A  G  R  R  I  A  L  N  L  A  A  Q  A  G  Q  A  
  P  S  C  Q  T  G  P  R  S  E  W  G  A  S  R  L  W  S  P  L  L  L  R  L  V  A  G  S  R  W  T  *  H  L  R  L  A  R  R  P
 R  R  A  S  R  G  P  D  A  R  G  G  R  Q  D  *  G  A  Q  C  Y  C  A  C  C  R  A  Q  D  G  L  E  T  C  G  S  R  G  A  R 

The result is a spliced alignment which includes a frame shift. Also, the peptide is semi-tryptic. The splice site dinucleotide pair is highlighted in gray. GPF can also find spliced peptide where the intron split happens within a coding nucleotide triplet:

Query,AA N-term,Peptide,AA C-term,Mass,Assembly,Intron length,Splice site
ALPIAAPAPI,AP*G*,ALPIAAPAPI,AASCS,932.5690,"{mini/mini}+135:16,190:14",39,GC|AG

------------------------------------------------------------------------------------------------------------------------
120  .    130  .    140  .    150  .    160  .    170  .    180  .    190  .    200  .    210  .    220  .    230  .    
GCCCCCTGAGGCTGAGCATTGCCGATCGCGGGCCCCTGCTCACCCTGGACGGCCTCGGCGACCACAGCAGCGCCAGCTCCGATAGCAGCCTCCTGCTCCTGCTCCTGCTCCACAGCTCCA
CGGGGGACTCCGACTCGTAACGGCTAGCGCCCGGGGACGAGTGGGACCTGCCGGAGCCGCTGGTGTCGTCGCGGTCGAGGCTATCGTCGGAGGACGAGGACGAGGACGAGGTGTCGAGGT
 A  P  *  G  *  A  L  P  I  A  G  P  C  S  P  W  T  A  S  A  T  T  A  A  P  A  P  I  A  A  S  C  S  C  S  C  S  T  A  P 
  P  P  E  A  E  H  C  R  S  R  A  P  A  H  P  G  R  P  R  R  P  Q  Q  R  Q  L  R  *  Q  P  P  A  P  A  P  A  P  Q  L  H
G  P  L  R  L  S  I  A  D  R  G  P  L  L  T  L  D  G  L  G  D  H  S  S  A  S  S  D  S  S  L  L  L  L  L  L  L  H  S  S  
A  G  Q  P  Q  A  N  G  I  A  P  G  Q  E  G  Q  V  A  E  A  V  V  A  A  G  A  G  I  A  A  E  Q  E  Q  E  Q  E  V  A  G  
  G  R  L  S  L  M  A  S  R  P  G  R  S  V  R  S  P  R  P  S  W  L  L  A  L  E  S  L  L  R  R  S  R  S  R  S  W  L  E  V
 G  G  S  A  S  C  Q  R  D  R  A  G  A  *  G  P  R  G  R  R  G  C  C  R  W  S  R  Y  C  G  G  A  G  A  G  A  G  C  S  W 

The DNA fragments highlighted in green denote the coding nucleotide triplet which is split by an intron. After intron splicing, these fragments get assembled to GCG, resulting in an alanine residue (A). Note that technically, this is not the same alanine residue as the one left to the C-terminal exon part, which is not found by GPF because the resulting intron splice site dinucleotide pair would be GG/CA, which is not what we were looking for in this query.

Although all queries shown so far resulted in alignments on the forward strand, alignments on the reverse strand are identified just as well. However, care must be taken when interpreting results from the reverse strand, because offsets are stored from the N-terminal side of an alignment, and in the case of spliced peptides, the exon part of the most N-terminal exon is stored first. Take a close look at the following example CSV output and the corresponding genome view to wrap your head around this:

Query,AA N-term,Peptide,AA C-term,Mass,Assembly,Intron length,Splice site
RSHSSPFRGRG,*RYF*,RSHSSPFRGRG,AAGIR,1242.6326,"{mini/mini}-325:21,289:12",15,GT|AG

------------------------------------------------------------------------------------------------------------------------
240  .    250  .    260  .    270  .    280  .    290  .    300  .    310  .    320  .    330  .    340  .    350  .    
CGACCGCGACCGCGACCGCGACCACGAATGCCAGCAGCTCCACGCCCACGCTCCATAGCAGCAACAAAAGGGCTTGAGTGAGAACGTTAGAAGTAACGTTAAAA
GCTGGCGCTGGCGCTGGCGCTGGTGCTTACGGTCGTCGAGGTGCGGGTGCGAGGTATCGTCGTTGTTTTCCCGAACTCACTCTTGCAATCTTCATTGCAATTTT
 R  P  R  P  R  P  R  P  R  M  P  A  A  P  R  P  R  S  I  A  A  T  K  G  L  E  *  E  R  *  K  *  R  *  
  D  R  D  R  D  R  D  H  E  C  Q  Q  L  H  A  H  A  P  *  Q  Q  Q  K  G  L  S  E  N  V  R  S  N  V  K 
T  T  A  T  A  T  A  T  T  N  A  S  S  S  T  P  T  L  H  S  S  N  K  R  A  *  V  R  T  L  E  V  T  L  K
R  G  R  G  R  G  R  G  R  I  G  A  A  G  R  G  R  E  M  A  A  V  F  P  S  S  H  S  R  *  F  Y  R  *  F
  V  A  V  A  V  A  V  V  F  A  L  L  E  V  G  V  S  W  L  L  L  L  L  A  Q  T  L  V  N  S  T  V  N  F 
 S  R  S  R  S  R  S  W  S  H  W  C  S  W  A  W  A  G  Y  C  C  C  F  P  K  L  S  F  T  L  L  L  T  L  
 

But the validation?

GPF can be used for two scenarios:

  1. Align partially correct de novo predicted peptides to the genome.
  2. Locate peptides in the genome.

In the first case, GPF will output many different aligned peptides for every query peptide. The decision which peptide is the presumably correct one for a specific MS/MS scan is not within the scope of GPF but can be done using a database search program like OMSSA. Peptide validation can be achieved via a modified target/decoy strategy as described in this paper.

If you just want to quickly determine the position of peptides in a genome (scenario 2), you can call gpfquery with --similaritySearch no and --distinguishIL yes.