Performing mass spectrometry-based proteomics in organisms with minimal reference protein databases

PubPub Team; Second Person; Another Human

Don’t need background? Jump to “The method.”

The problem

Bottom-up, tandem mass spectrometry-based proteomics is a key technology for detecting both protein sequences and post-translational modifications like phosphorylation, sulfation, lipidation, or glycosylation. However, using this technique requires a database containing all protein sequences expected to exist in a biological sample set ((ref?)).

**How proteomic information can be generated or inferred, and which types of information depend on each other to be useful.**
–
Experimental mass spectrometry data (lower trapezoid) are decoded using hints generated by genomics and transcriptomics data (upper trapezoid). These two data types converge during the proteomics data analysis process, wherein experimental fragmentation mass spectra are compared to theoretical mass spectra (generated from genomic and transcriptomic sequencing experiments).

Why do we need a protein database?

Modern mass spectrometry-based protein identification techniques involve shattering peptides to generate patterns called fragmentation spectra. For the most part, each spectrum is unique, like a fingerprint. Fingerprints are only useful when we have something to which we can compare them. In that sense, a protein database is like a fingerprint database—it lets us 1) match each experimental spectrum (fingerprint at a crime scene) to a known/predicted peptide (fingerprint in a database), and 2) it tells us what larger protein that peptide came from (whose finger left the print).

This is my gist.

Before we do mass spec, we treat our experimental sample with a protease that chews all the proteins into smaller fragments, or peptides. Next, the peptides are run through the mass spectrometer, generating a pattern of unique fragmentation spectra. How do we interpret these spectra? When we have a protein database for the organism we’re studying, we can computationally predict all the peptide sequences that will result from digesting all possible proteins in the organism (the “proteome”), and generate what their fragmentation spectra would look like. By comparing these theoretical spectra to those from our experimental sample, we can decode the signal and deduce which of the reference peptides are actually in our sample. This process can be high-throughput, letting us identify many proteins very quickly.

Bah
1. test
  1. test
    1. test
      1. tr
      2. t
    2. etes

test

Many organisms lack reference databases

Well-studied "model organisms" are highly represented in public sequencing repositories and a quick trip to the NCBI or Uniprot will likely yield good-quality reference proteomes assembled by other researchers. But for non-model organisms, reference databases are scarce. The method described here allows researchers to perform mass spectrometry-based proteomics experiments and to build a new reference protein database to help interpret the mass spec data in parallel work streams.

At Arcadia, we want to find interesting components of tick saliva, especially those that interact with the human body. We used this new method to generate a data set from the saliva of lone star ticks, but we hope this approach will be broadly useful in enabling proteomics in any organism for which there is a paucity of reference genomic, transcriptomic, or proteomic data.

What’s new?

Using mass spectrometry for proteomic analysis is straightforward for organisms with pre-existing reference databases, but most non-model organisms lack such information. Our approach lets scientists simultaneously gather new proteomic data from mass spectrometry while doing RNA sequencing to create a protein database to compare it to.

Notably, while many transcriptomics studies rely on short-read RNA sequencing, our method uses long-read sequencing. This can be advantageous in resolving long repetitive genomic regions, speeding up genome assemblies, yielding more complete contigs, and providing insights into the full structures of transcripts without assembly.

Ultimately, this method generated a robust, long-read, transcriptome-based proteome database that compares reasonably well to pre-existing data. Our approach enabled detection of approximately 10% more PSMs and peptides than were represented in the prior database, and favored detection of longer protein sequences, which may enable a more complete understanding of function.

It may be helpful to check out our full description of the tick saliva data set we generated through this approach.

The strategy

We set out to create a comprehensive method for learning about the proteome in tissues from non-model organisms. We decided to use mass spectrometry to detect proteins in our sample of interest. Because specific protein sequences in mass spec data can generally only be identified by comparing to a reference, we knew we’d also need a reference protein database. There is a paucity of genomic, transcriptomic, and proteomic data for many non-model organisms, so we decided to split our method into two parallel workstreams (Figure 2) after initial sample collection: one includes RNA sequencing to develop a reference protein database; the other includes performing proteomic mass spectrometry. The two work streams come together for the final step, data analysis, as the mass spec data can only be interpreted using the transcriptome-based protein database.

We encountered a few key decision points in designing our approach, which are described in depth below (click here to skip to the step-by-step description of the overall method). Let us know if you try this and tweak any of these procedural options—we’d be curious to hear how it may influence the quality or nature of the resulting data.

Protein identification — mass spectrometry vs. immunoprecipitation or Edman degradation

We hope that mass spectrometry will be advantageous in this context because it lets us analyze cell-free secretions. Importantly, it is suited for the detection of non-encoded molecules/modifications, which can include protein post-translational modifications (e.g. phosphorylation, sulfation, lipidation, glycosylation, etc.), non-ribosomal peptides, and small molecules (metabolomics). Other protein identification tools like immunoprecipitation and Edman degradation are also available options, but these methods can be low-throughput and require non-trivial amounts of purified protein (which can be difficult to obtain in some settings).

mRNA enrichment — poly-A enrichment vs. rRNA depletion

Ribosomal RNA (rRNA) tends to dominate in the total RNA mixture extracted from samples (~80% of total RNA composition) and occludes the protein-coding messenger RNA (mRNA) transcripts that we’re interested in profiling. Thus, we needed a way to enrich mRNA. One approach involves the negative enrichment of rRNA, using capture techniques hinged on complementary nucleotides specifically designed for each species's rRNA sequences. The other, more common approach is the positive enrichment of mRNA via oligo-(dT) primers that target mRNA containing poly-A tails. rRNA negative enrichment advantageously enables the detection of non-coding RNA and mRNA without poly-A tails, but comes with the added burden of troubleshooting rRNA probe design for non-model organisms. Since this was our first shot at transcriptome profiling, we took the path of least resistance and performed mRNA poly-A based enrichment using oligo-(dT) probes instead.

RNA sequencing — Long-read vs. short-read

Sequencing technology selection was our most crucial decision point. Illumina powers the dominant platform and enables the assembly of genomes and transcriptomes via highly accurate nucleotide fragments hundreds of base pairs in length (short-read sequencing). In contrast, the dominant long-read sequencing platforms supported by Pacific Biosciences (PacBio) and Oxford Nanopore Technologies (ONT) are able to assay contiguous nucleotide fragments in the multi-kilobase and megabase range, respectively. PacBio and ONT have lagged behind Illumina over the last decade due to low accuracy basecalls and lack of sequencing depth, but recent technological improvements have brought their platforms' sequencing accuracy within competitive range of Illumina's.

Long-read sequencing data can be advantageous in several ways, including for the purposes of resolving long repetitive genomic regions, speeding up genome assemblies, yielding more complete contigs, and providing insights into the full structures of transcripts without assembly. Our interest in full transcript structures brought us to PacBio's relatively mature HiFi Iso-seq methodology as a first choice. In addition, we figured it would provide a great complement to the Mulenga lab's short-read dataset collected on the same tick species.

The method

The following is a high-level overview of our approach. You can view a detailed, step-by-step protocol on protocols.io.

+ FIGURE 2 (METHOD OVERVIEW)

Sample collection

Our efforts began with the excision of salivary glands from unfed female Amblyomma americanum ticks. While our interest lies in ticks, this method should work with tissue from any organism.

RNA extraction and quality control

We pooled about 10 ticks worth of salivary gland tissue and obtained total RNA using a standard extraction kit.

We collected electropherograms to calculate RNA integrity number (RIN), which is a ratio of the 28S:18S ribosomal RNA (rRNA) subunit peak areas and a proxy for RNA quality.

A note on electropherograms from arthropod RNA:

We were surprised to find only one peak corresponding to the 18S subunit where we would normally see two peaks: one corresponding to the 18S subunit and one to the 28S subunit.

Some quick literature searches suggested that this is a commonly observed phenomenon with arthropod RNA. It's thought that arthropods’ 28S subunit can fragment (due to structural instability) during sample preparation, yielding two peaks that overlap with the 18S subunit’s peak [1][2].

mRNA enrichment

Next, we needed to enrich mRNA from the total RNA mixture, as rRNA tends to dominate. We used positive enrichment of mRNA via oligo-(dT) primers, which target mRNA containing poly-A tails.

RNA sequencing

We submitted our samples to the UC Berkeley QB3 genomics core for size-selection (5 kb), PacBio's library preparation, Sequel II HiFi sequencing, and Iso-seq analysis.

Tandem mass spectrometry-based proteomics

In parallel to the RNA processing and sequencing steps, we prepared tryptic peptides from A. americanum salivary gland lysate and analyzed them by data-dependent LC-MS/MS using a high resolution-high resolution strategy on an Orbitrap mass spectrometer.

Transcriptomic and proteomic data analysis

We identified coding sequences in our transcriptome data using TransDecoder, CPAT, and ANGEL [3][4][5]. We combined our resultant output and collapsed sequences down by CD-HIT clustering with a similarity setting of 100% (c=1.0) [6][7] and used these CD-HIT-collapsed sequences for subsequent proteomics mapping. For functional analysis, we further clustered these sequences down using CD-HIT with a similarity setting of 95% (c=0.95). Representative sequences for each of these 95% cut-off clusters were submitted for Interproscan analysis [8] and BUSCO analysis [9]. We assigned fragmentation spectra with a basic proteomic search. We further clustered sequences using CD-HIT at 65% similarity cut-off (c=0.65) in order to ____.

To see a representative output from this method, check out our tick saliva data set.

What’s next?

We developed this method to gain insight into the tick saliva proteome, and are now analyzing that data set. We’d like to try a version of this method that instead employs top-down proteomics, which lets us detect intact proteins instead of digested peptides, and could reveal ______.

If you decide to try this or a similar method in your own research, we’d love to hear how it goes. Let us know if you have any questions!

Share your thoughts!

Please feel free to add line-by-line comments anywhere within this text, provide overall feedback by commenting in the box at the bottom of the page, or use the URL for this page in a tweet about this work. All such tweets are pulled into the feed below.