Peptides mass spectrometry submissions

Introduction
Submission requirements

Metadata spreadsheet
Data files
Results spreadsheet

Deposit instructions

Introduction

GEO can accept high-throughput proteomic data generated by mass spectrometry technologies.

We aim to capture results and conclusion-level information with sufficient data and descriptive information that would enable understanding of the experiment and analysis of the underlying data, including:

Lists of identified proteins

Lists of identified peptides used in protein identification

Any additional information such as scores, significance or quality information

Relevant peak lists

Standardized input and output from search engines

Relevant descriptive information about the biological samples, instrumentation, and informatics

Our procedures for submission and display of proteomic data are currently under development. Nevertheless, we can still accept and issue accession numbers for this data type once all required files have been provided. These accession numbers are stable and will not change, so you can cite them in your manuscript. Data may be held private until published and reviewer access to private data is supported.

Submission requirements

A standard submission has three required components as summarized in the following table; follow the details links for important information about each component:

Metadata spreadsheet
(details) 'Metadata' refers to descriptive information and protocols for the overall experiment and individual Samples, as well as references to associated files.
Metadata is supplied by completing all fields of the Metadata template spreadsheet within the NCBI_Peptides_Submission_Template Excel file.
Metadata content guidelines are provided within the template spreadsheet and in the table below.

Data files
(details) Raw data files and any supporting peptide identification output files that describe the link between the raw data and the results.

Results spreadsheet
(details) The list of proteins discovered for each Sample, the peptides used to identify those proteins, and the spectra used to identify the peptides. Modifications can also be specified.
The required format is shown in the Results spreadsheet within the NCBI_Peptides_Submission_Template Excel file, and in the table below.

Metadata spreadsheet

The metadata spreadsheet fields and content guidelines are as follows:

SERIES
This section describes the overall experiment.

title Unique title (less than 120 characters) that describes the overall study.

summary A thorough description of the goals and objectives of this study. The abstract from the associated publication may be suitable. Include as much text as necessary to thoroughly describe the study.

overall design Indicate how many Samples are analyzed, if replicates are included, are there control and/or reference Samples, etc...

type Keyword(s) that generally describe the type of study. Examples include: time course, dose response, disease state analysis, tissue comparison, stress response, genetic modification, etc.

contributor Each contributor is listed on a separate line as "Firstname,Initial,Lastname", for example, "John,H,Smith" or "Jane,Doe"

PROTOCOLS
This section includes protocols and fields which are common to all Samples.
Protocols which are applicable to specific Samples should be included in the SAMPLES section instead.

growth protocol The conditions that were used to grow or maintain organisms or cells prior to protein preparation.

treatment protocol The treatments applied to the biological material prior to protein extraction.

extract protocol The protocol used to extract and prepare the protein.

digestion protocol The enzyme used to digest the sample, duration of digestion, whether in gel or in solution, temperature.

separation method The method(s) used to separate the protein mixtures. (i.e. Column chromatography, gel electrophoresis, capillary electrophoresis) with sufficient supporting details. In general, follow the appropriate MAIPE guidelines.

mass spectrometer The characteristics, ion sources, fragmentation method, and major components of the instrument used. In general, follow the MAIPE: Mass Spectrometry guidelines.

quantification protocol Describe any protocol used to quantify the peptides/proteins.

data processing Provide details of any post-processing performed upon the raw data.

platform The generic instrument type.

SAMPLES
This section lists and describes each of the biological Samples under investigation.

sample ID Unique identifier for each biological sample. This is a local ID that will not appear on the final records.

title Unique title that describes the Sample. We suggest that you use the convention: [biomaterial]-[condition(s)]-[replicate number], e.g., Muscle_exercised_60min_rep2.

source name Briefly identify the biological material and the experimental variable(s), e.g., vastus lateralis muscle, exercised, 60 min.

organism Organism from which the biological material was derived. Use standard NCBI Taxonomy nomenclature.

characteristics List all available characteristics of the biological source, including factors not necessarily under investigation, e.g., Strain: C57BL/6, Gender: female, Age: 45 days, Tissue: bladder tumor, Tumor stage: Ta. Multiple 'characteristics' columns can be included.

description Additional information not provided in the other fields, or paste in broad descriptions that cannot be easily dissected into the other fields.

SAMPLE FILES
This section lists all of the files associated with the experiment and their relationship to each other.
Each Sample may have multiple rows, one for each file.

sample ID Unique identifier for each biological sample. This is a local ID that will not appear on the final records.

results file File that lists all of the proteins, peptides, and matching spectra for each Sample. See the Results spreadsheet for the required format. Each Sample must have only one Results file.

fraction An ordinal number for the gel slice, or an "x,y" coordinate for 2D gels.

raw file The name of the file containing the instrument generated (raw) data for each fraction.

raw file type The raw data type (e.g., mzData, mzXML, mzML, mgf, pkl, sqt). Note: Separate dta files are not accepted.

peptide identification output file The name of the peptide identification search output file for each raw file, matching spectra to peptides. Can be from a protein sequence library search, a spectral library search, or other means of matching spectra to peptides. There may be more than one per raw file.

peptide identification file type The algorithm or method that generated the peptide identification output file, e.g. OMSSA, Mascot, X!Tandem, Sequest, NIST MS, PEAKS, manual inspection, etc.

Data files

There are two types of required supplementary data files required with each submission:

Raw data:

The raw data containing the MS1 and MS2 information from the instrument. The preferred raw data format is mzXML or mzML that contains both the MS1 and MS2 data from a single fraction. Alternatively, text based formats such as MGF or PKL maybe accepted if the original data is no longer available. We can not accept the binary data file from the instrument (e.g. .raw or .wiff) since it is proprietary and we are unable to process it.

Peptide identification output:

The peptide identification output files from any program used to match the MS2 spectra to the peptides. We accept Mascot DAT files, OMSSA ASN.1 or XML formatted files, or any search engine output that has been converted to PepXML. If no search engine was used, or the format is not yet supported, then the Results spreadsheet must include spectra references (see below).

Results spreadsheet

A Results file must list the proteins discovered in one Sample in the experiment. A separate file must be generated for each Sample. For each protein, the peptides must be listed, and for each peptide, the matching spectra must be listed. If matching spectra are omitted, then every matching spectrum in the peptide identification output files is assumed to be correct. The spectrum_list is a comma separated list of spectrum_file_name:id (where 'id' is the spectrum number). The spectrum file extension may be omitted from the file name. The required format is shown in the Results spreadsheet within the NCBI_Peptides_Submission_Template Excel file and in the following table:

Protein	Peptide	Spectrum_list
CATA_MOUSE	FSTVAGESGSADTVRDPR	07FEB15_ABRF_FT_100a:2171, 07FEB15_ABRF_FT_100a:2177, 07FEB15_ABRF_FT_100a:2183
	GPLLVQDVVFTDEMAHFDR	07FEB15_ABRF_FT_100a:3653, 07FEB15_ABRF_FT_100a:3660
	GPLLVQDVVFTDEMAHFDRER	07FEB15_ABRF_FT_100a:3231, 07FEB15_ABRF_FT_100a:3495, 07FEB15_ABRF_FT_100a:3499
	LCENIAGHLKDAQLFIQK	07FEB15_ABRF_FT_100a:2967, 07FEB15_ABRF_FT_100a:2968
	LFAYPDTHR	07FEB15_ABRF_FT_50a:2395
	LVNADGEAVYCK	07FEB15_ABRF_FT_100a:2151, 07FEB15_ABRF_FT_100a:2157, 07FEB15_ABRF_FT_100a:2161
	VWPHKDYPLIPVGK	07FEB15_ABRF_FT_100a:2768, 07FEB15_ABRF_FT_100a:2774, 07FEB15_ABRF_FT_50a:2808
CATD_HUMAN	AIGAVPLIQGEYMIPCEK	07FEB15_ABRF_FT_100a:3305, 07FEB15_ABRF_FT_100a:3310
	FDGILGMAYPR	07FEB15_ABRF_FT_100a:3258, 07FEB15_ABRF_FT_10a:3109, 07FEB15_ABRF_FT_10a:3111
	ISVNNVLPVFDNLMQQK	07FEB15_ABRF_FT_100a:3771, 07FEB15_ABRF_FT_100a:3775
	LVDQNIFSFYLSR	07FEB15_ABRF_FT_100a:3705, 07FEB15_ABRF_FT_100a:3711, 07FEB15_ABRF_FT_100a:3716
	QVFGEATKQPGITFIAAK	07FEB15_ABRF_FT_100a:2882
	VSTLPAITLK	07FEB15_ABRF_FT_100a:2857
HBA3_PANTR	VGAHAGZYGAEALER	07FEB15_ABRF_FT_100a:2245, 07FEB15_ABRF_FT_25a:2301, 07FEB15_ABRF_FT_50a:2289
	VLSPADKTNVK	07FEB15_ABRF_FT_100a:1892
KCRM_HUMAN	FEEILTR	07FEB15_ABRF_FT_5a_070216183448:2394
	FKLNYKPEEEYPDLSK	07FEB15_ABRF_FT_100a:2690

If the peptide identification output files are in a supported format, then modification information need not be listed. Modifications are listed using the UNIMOD ID number. Fixed modifications for given residues are listed separately and are assumed to apply to all residues of that type. Each modified peptide string is given for each applicable spectrum. In the modified peptide strings each residue is followed by a UNIMOD ID in parenthesis if it is modified and fixed modifications need not be listed.

Example table listing fixed modifications:

Modification	Residues
5	K, R, C
34	C, R

Example table listing variable modifications:

Peptide	Mod String	Spectrum File	Spectrum ID
LSVEALNSLTGEFK	LSV(18)EALNSL(24)TGEFK	07FEB15_ABRF_FT_100a	3350

Deposit instructions

Zip or tar all files into one archive and transfer to us using the 'other' option on the Direct deposit page. If you find that your files are too large to transfer in this manner, please email us at geo@ncbi.nlm.nih.gov and we will send you FTP instructions.

This submission procedure is currently under development and is subject to change. However, the accession numbers we assign to your data are stable and will not change, so there will be no need to resubmit your data once development is complete. If you have any questions or concerns about these instructions, please do not hesitate to contact us at geo@ncbi.nlm.nih.gov.

Submitting mass spectrometry proteomics data to GEO