GEO sequence submission

Introduction
Categories of sequence submissions
Deposit instructions

Introduction

GEO accepts various categories of sequence data generated by next-generation sequencing methodologies (e.g., Illumina, 454 Life Sciences and SOLiD Applied Biosystems).

We accept data for studies that examine gene expression, gene regulation, epigenetics, or other studies where measuring molecular abundance is central to the experimental design (see below for links to example records).

Data provision and standards:
The GEO database supports and encourages provision of all elements of a study with a view to facilitating comprehensive interpretation of an experiment (see draft MINSEQE proposal).

GEO sequence submission procedures are designed to encourage provision of all the following elements:

thorough descriptions of the biological samples under investigation, and procedures to which they were subjected

thorough descriptions of the technical protocols used to generate and process the data

processed data files (e.g., filtered sequence reads, detection counts, ChIP-seq peak lists)

original short read format sequence files which will be uploaded to NCBI's Short Read Archive sequence database

Administration:
All standard GEO administration and processing procedures apply to sequence submissions. These include:

Unique and stable GEO accession numbers are issued to experiments; these accessions can be cited in manuscripts

GEO accession numbers are typically issued within 2-5 business days after completion of submission

Data can be held private until publication

Reviewers can have password-controlled access to private records

Submitters can update their records at any time

More information on these aspects is provided in our FAQ.

Categories of sequence submissions

We accept

We do not accept

Studies concerning gene expression, gene regulation, epigenetics, or other study where measuring molecular abundance is central to the experimental design.

Examples include:

mRNA expression profiling (example)

ChIP-seq (example)

bisulfite sequencing (example)

small RNA discovery and profiling (example)

SAGE (see Web submission instructions)

All accepted data should have a quantitative component, e.g., a sequence abundance count. If you have questions about whether GEO can accept your data type, please do not hesitate to contact us at geo@ncbi.nlm.nih.gov.

whole genome sequencing

resequencing projects

survey sequencing, whole exome, etc

For information on how to submit these types of data to NCBI, please contact the Short Read Archive database at sra@ncbi.nlm.nih.gov.

Deposit instructions

Sequence data should be submitted using a modified GEOarchive format which is composed of a Metadata spreadsheet, Processed data files, and Raw data files:

Metadata spreadsheet

'Metadata' refers to descriptive information and protocols for the overall experiment and individual Samples, and references to external processed and raw data file names. Information is supplied by completing all fields of a metadata template spreadsheet:

Illumina metadata spreadsheet (template and example)

454 metadata spreadsheet (template and example)

Guidelines on the content of each field is provided within the spreadsheets.

NOTE: These templates may change slightly in coming months, so please download template immediately prior to when you intend to use it.

Processed data file(s)

Requirements for processed data files are not yet fully standardized and will depend on the nature of the experiment. Examples include any sequence alignment or mapping information you generate, and tab-delimited files containing filtered, unique sequence reads or peak lists and detection counts, preferably processed as described in any accompanying manuscript. More than one processed data file may be accepted per Sample.
The file names should be referenced as appropriate in the Metadata spreadsheet.
You can include as many columns as necessary in the tables to thoroughly describe your data.

Processed data may be supplied either as an individual file per sample, or a multi-sample matrix file (see example) - this example is shown in a spreadsheet for clarity, please provide your data as plain text, tab-delimited table(s).

Raw data files

The raw data files should be the original short read format sequence files. The names of these files should be referenced as appropriate in the Metadata spreadsheet.

It is very important to provide raw data files with your submission. These files will be uploaded to NCBI's Short Read Archive sequence database which has tools to help users view, browse and download sequence data. Also, without raw data your submission may not meet the requirements of the journal you are publishing with. We understand that the volumes of raw data can be very large and difficult to transfer - please contact us if you need advice with this matter.

Accepted file types and packaging instructions are as follows:

Technology	Accepted File Type(s)	Notes
Illumina (please choose one of these three options)	.srf (preferred)	Users should download the Staden io_lib package in order to get the solexa2srf utility. If the goal is to provide data that can be filtered for quality and displayed, then the “processed” (calibrated) data series should be included in the output. To produce a primary analysis SRF submission file for a lane’s worth of data, change the working directory to the run folder and do: solexa2srf -N <run>:%l:%t: -n %x:%y -o <center_name>:<run>:<lane>.srf s_<lane>_*_seq.txt where <center_name> is the short name of the sequencing center or other individual name, <run> is the flowcell name for the run (for example 080117_EAS56_0068), and <lane> is the desired lane. Each flowcell contains 8 lanes but not all lanes are used for production. Also, some lanes are devoted to other projects. Finally, the size of the SRF file produced by this process can be expected to be about 2 GB. For these reasons, it is desirable to produce one SRF file per lane. The SRF file format is nearly optimal in terms of footprint, so there is nothing to be gained by further compressing them. Therefore, please provide .srf files uncompressed.
	.fastq (see example)	Contains base calls and phred-like quality scores per read. The .fastq files are high level sequence data output from “one-channel secondary analysis” and are appropriate for applications where the main goal is abundance measurement, eg, miRNA profiling or ChIP-seq.
	_seq.txt _prb.txt _sig2.txt (see example)	_seq.txt contains base calls per read _prb.txt contains per channel pseudo-phred quality scores _sig2.txt contains phase-corrected signal intensity values This low level instrument data output from “four-channel primary analysis” is appropriate for most applications. It is important to package these files in the form: <all data from one lane>.tar.gz We cannot process these data if they are packaged incorrectly.
454	.sff	Contains flowgram (base call, phred quality score, flow value). The .sff files should reflect the sequencing run setup. If the entire picotitre plate was used, then one .sff file per run should be submitted. If the picotitre plate was divided into two or more regions, then a .sff file for each region should be submitted. If a .sff file contains more than one run, or more than one region in the run, please break up this file into constituent parts using the sfffile utility from the “Off Rig” software package provided by Roche. The read names found in the .sff file are meaningful and reflect the addressing scheme for the picotitre plate as well as a globally unique run id. Please do not rewrite this name as such addressing information will be lost. The .sff file format is nearly optimal in terms of footprint, so there is little to be gained by further compressing them. Therefore, please provide .sff files uncompressed. Your sequencing data may have been produced by the 454 contract sequencing center (454MSC). Please ask 454MSC to provide .sff files for your project.
AB SOLiD	.srf	Instructions for converting SOLiD system reads to .srf files using solid2srf are provided on the Applied Biosystems solid2srf site. The SRF file format is nearly optimal in terms of footprint, so there is nothing to be gained by further compressing them. Therefore, please provide .srf files uncompressed.
HeliScope	to be determined	Please contact us for instructions if you want to submit HeliScope data

The Metadata spreadsheet, Processed data file(s) and Raw data files should be zipped or tarred together and transferred directly to GEO by selecting the 'GEOarchive' option on the Direct Deposit page. If you find that your file archive is too large to transfer using this option, please contact us for details on where to FTP your data.

These submission procedures and requirements will be refined in coming months. However, the accession numbers we assign to your data are stable and will not change. If you have any suggestions or concerns regarding any of these issues, please email us at geo@ncbi.nlm.nih.gov.

Submitting high-throughput sequence data to GEO