Scientific Supercomputing at the NIH

Genome Alignment with SOAP on Helix

[Usage] [Output] [Notes] [Sample Session] [Documentation]

SOAP is a program for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The program is designed to handle the huge amounts of short reads generated by parallel sequencing using the new generation Illumina-Solexa sequencing technology.

SOAP is compatible with numerous applications, including:

SOAP contains 4 programs, please choose proper program to save RAM while fit your job better:

SOAP is a command-driven program, which supports multi-threaded parallel computing, and has a batch module for multiple query sets. These programs are located under:

/usr/local/soap/soap_1.11

Usage

a) Common options:
Usage: soap [options]

-a <str> query a file, *.fq or *.fa format
-d <str> reference sequences file, *.fa format
-o <str> output alignment file
-s <int> seed size, default=10. [read>18,s=8; read>22,s=10, read>26, s=12]
-v <int> maximum number of mismatches allowed on a read, <=5. default=2bp
-g <int> maximum gap size allowed on a read, default=0bp
-w <int> maximum number of equal best hits to count, smaller will be faster, <=10000
-e <int> will not allow gap exist inside n-bp edge of a read, default=3bp
-z <char> initial quality, default=@ [Illumina is using '@', Sanger Institute is using '!']
-c <int> how to trim low-quality at 3-end?

0: don't trim;
1-10: trim n-bps at 3-end for all reads;
11-20: trim first bp and (n-10)-bp at 3-end for all reads;
21-30: trim (n-20)-bp at 3-end and redo alignment if the original read have no hit;
31-40: trim first bp and (n-30)-bp at 3-end and redo alignment if the original read have no hit;
41-50: iteratively trim (n-40)-bp at 3-end until getting hits;
51-60: if no hit, trim first bp and iteratively trim (n-50)bp at 3-end until getting hits;
default: 0

-f <int> filter low-quality reads containing >n Ns, default=5
-r [0,1,2] how to report repeat hits, 0=none; 1=random one; 2=all, default=1
-t read ID in output file, [name, order in input file], default: name
-n <int> do alignment on which reference chain? 0:both; 1:forward only; 2:reverse only. default=0
-p <int> number of processors to use, default=1

Options for pair-end alignment:

-b <str> query b file
-m <int> minimal insert size allowed, default=400
-x <int> maximal insert size allowed, default=600
-2 <str> output file of unpaired alignment hits

Options for mRNA tag alignment:

-T <int> type of tag, 0:DpnII, GATC+16; 1:NlaIII, CATG+17. default=-1[not mRNA tag]

Options for miRNA alignment:

-A <str> 3-end adapter sequence, default=[not miRNA]
-S <int> number of mismatch allowed in adapter, default=0
-M <int> minimum length of a miRNA, default=17
-X <int> maximum length of a miRNA, default=26
-h help

b) Command lines:

SOAP provides batch model for alignment of multiple query datasets onto the samereference, which will avoid loading reference and constructing indexing hash for multiple times. The <parameter file> contains options for each query:

<parameter file>:
-a q1.fa -o out1.sop -s 12
-a q2.fa -o out2.sop -s 12
...
-a qn.fa -o outn.sop -s 10

Format of Output

One line for One hit. The columns are separated by '\t'.

1)id: id of read;
2)seq: full sequence of read. the read will be converted to the complementary sequence if mapped on the reverse chain of reference;
3)qual: quality of sequence. corresponding to sequence, to be consistent with seq, it will be converted too if mapped on reverse chain;
4)number of hits: number of equal best hits. the reads with no hits will be ignored;
5)a/b: flag only meaningful for pair-end alignment, to distinguish which file the read is belonging to;
6)length: length of the read, if aligned after trimming, it will report the information of trimmed read;
7)+/-: alignment on the direct(+) or reverse(-) chain of the reference;
8)chr: id of reference sequence;
9)location: location of first bp on the reference, counted from 1;
10)types: type of hits.

"0": exact match.
1~100 RefAllele->OffsetQueryAlleleQual": number of mismatches, followed by detailed mutation sites and switch of allele types. Offset is relative to the initial location on reference.

'OffsetAlleleQual': offset, allele, and quality. Example:

"2 A->10T30 C->13A32" means there are two mismatches, one on location+10 of reference, and the other on location+13 of reference. The allele on reference is A and C respectively, while query allele type and its quality is T,30 and A,32.

"100+n Offset": n-bp insertion on read. Example: "101 15" means 1-bp insertion on read, start after location+15 on reference.

"200+n Offset": n-bp deletion on read. Example: "202 16" means 2-bp deletion on query, start after 16bp on reference.

Important Notes

1) Simple rules to set parameter of seed size

a)S*2+3<=Min(Read length);
b)Hash size=4^S, normally S<=12bp;
c)Larger S will be faster

2) Avoid very short sequence in reference

The program will collapse if there are some reference sequences with size shorter than the query reads. So please make sure that short reference sequences were prefiltered before launching the alignment.

Sample Session on Helix

<helix> 229% cd /data/userID/soap_run_1
<helix> 230% /usr/local/soap/soap_1.11/soap -a infile.fasta -d backbonefile.fasta -o outfile.soap -s 12

The infile.fasta contains 4 million reads of 33-bp Solexa sequencing data looking like this:

>HWI-EAS3_2:6:1:860:735
GTTTCCGTAGTGTAGTGGTTATTCTTATTCCGT
>HWI-EAS3_2:6:1:446:8
GGTAGATCTGATGTCTGGTGAGTCGTATGCCGT
>HWI-EAS3_2:6:1:29:559
GCATAAGATTAGAAGGTCGTATGCCGTCTTCTG
..........
..........
>HWI-EAS9_2:6:1:454:45
GTACAGTACTGTGATAACTGATCGTATGCCGTC
>HWI-EAS9_2:6:1:181:407
GTACAGTACTGTGATAACTGAATCGTATGCCGT

The backbonefile.fasta is a 250 million bp genomic sequence in fasta format as well.

This test run on helix used around 3.5G memory for 18 min or so.

Version

1.11

Documentation

http://soap.genomics.org.cn/