From: ncbi-seminar-admin@ncbi.nlm.nih.gov on behalf of Geer, Lewis (NIH/NLM/NCBI) [lewisg@mail.nih.gov] Sent: Wednesday, July 10, 2002 6:16 PM To: List PROTIG; 'ncbi-seminar@ncbi.nlm.nih.gov' Subject: NCBI seminar 11am, Jul 22 on high-throughput proteomics methods NCBI SEMINAR ANNOUNCEMENT Monday, July 22nd at 11am, Natcher, 6th floor south conference room (45/6AS29) Statistical models for high-throughput proteomics Alexey Nesvizhskii Institute for Systems Biology, Seattle, WA A major goal of proteomics research is to catalogue and quantify the proteins and protein complexes present in cells grown under a variety of conditions. Tandem mass spectrometry (MS/MS) has been particularly useful for determining the protein components of complex mixtures. Proteins in a sample are first digested into smaller peptides, usually by the enzyme trypsin, and subjected to reverse phase chromatography. Peptides are then ionized and fragmented to produce signature MS/MS spectra that are used for identification. Most frequently, peptide identifications are made by searching MS/MS spectra against a sequence database to find the best matching database peptide. A current challenge for high-throughput proteomics is to use database search results from large numbers of MS/MS spectra in order to derive a list of identified peptides and their corresponding proteins. This task necessarily entails distinguishing correct peptide assignments from false identifications among database search results. In my talk, I will discuss the challenges of dealing with enormous amount of data generated in high-throughput proteomics experiments. Then, I will describe a robust and accurate statistical approach (based on the Expectation-Maximization algorithm and Bayesian probability theory) to compute probabilities that peptide identifications made by tandem mass spectrometry and database search algorithms such as SEQUEST are correct. Each peptide assignment to a spectrum is evaluated with respect to all other assignments in the data set, including necessarily some incorrect assignments. By employing database search scores and the properties of the assigned peptides, the method learns to distinguish correctly from incorrectly assigned peptides in the data set and computes for each peptide assignment to a spectrum a probability of being correct. Computed probabilities that peptides are correctly assigned to spectra are then used as an input to a statistical model to estimate the likelihood for the presence of proteins corresponding to those peptides in a sample. Our approach enables high-throughput analysis of proteomics data by eliminating or significantly reducing the need to manually validate database search results. In addition, it can facilitate the benchmarking of various experimental procedures and serve as a common standard by which the results of different experimental groups can be compared. Contact: Lewis Geer 301 435 5888 lewisg@mail.nih.gov