Sequence walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences

Sequence walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences Thomas D. Schneider

National Cancer Institute, Frederick Cancer Research and Development Center, Laboratory of Mathematical Biology, PO Box B, Frederick, MD 21702-1201, USA

Received April 8, 1997; Revised and Accepted July 7, 1997

ABSTRACT

A graphical method is presented for displaying how binding proteins and other macromolecules interact with individual bases of nucleotide sequences. Characters representing the sequence are either oriented normally and placed above a line indicating favorable contact, or upside-down and placed below the line indicating unfavorable contact. The positive or negative height of each letter shows the contribution of that base to the average sequence conservation of the binding site, as represented by a sequence logo. These sequence `walkers' can be stepped along raw sequence data to visually search for binding sites. Many walkers, for the same or different proteins, can be simultaneously placed next to a sequence to create a quantitative map of a complex genetic region. One can alter the sequence to quantitatively engineer binding sites. Database anomalies can be visualized by placing a walker at the recorded positions of a binding molecule and by comparing this to locations found by scanning the nearby sequences. The sequence can also be altered to predict whether a change is a polymorphism or a mutation for the recognizer being modeled.

INTRODUCTION

Sequence logos are a graphical method that use letters to quantitatively depict the average sequence conservation and base frequencies in a set of aligned sequences (1 ). Logos have been used to help understand DNA/protein interactions (2 -4 ), RNA/protein interactions (5 ), protein structure (6 -9 ) and English word structure (10 ). However, as useful as they are for characterizing an entire set of sequences, logos only convey a vague idea of how a protein would interact with a specific DNA sequence.

The walker method described here solves this problem by combining letter graphics with a unique weight matrix (11 -14 ) defined by information theory (15 -17 ). Although a walker looks like a logo, it is not the same. In a logo a stack of letters depicts the relative frequencies of bases or amino acids at each position in an aligned set of sequences. The height of the stack is the sequence conservation, measured in bits of information. In contrast, walkers apply to a single sequence, so in a walker only a single letter is drawn for each position on the sequence (Fig. 1 ). The height of the letter is in bits, and represents that base's contribution to the sequence conservation of the entire set of sequences. A walker represents the individuals that make up the logo, with the logo representing the average sequence conservation (18 ). As a walker is moved along a DNA, one can immediately see how the matrix `responds' to particular sequences. This new method is complementary to, and a natural extension of, sequence logos. Walkers allow one to visualize complex genetic regions, to interpret their structure, to understand the effects of sequence changes and to simultaneously engineer overlapping binding sites.

Program	Version	Function
dbmutate	1.30	mutate GenBank database entries
dnaplot	3.40	graph individual infromation across large DNA sequences
exon	1.86	convert exons and CDSs to features for a lister map
lister	8.63	list sequences with translation, features, walkers and hand-defined marks
live	1.14	add a color bar to a lister map to show DNA periodicity
makewalker	3.47	walk an information weight matrix across a sequence
mergemarks	1.04	merge live marks with hand-defined marks
ri	2.37	compute individual information weight matrix and individual information for every site
scan	2.88	scan sequences with an individual information matrix to find sites
xyplo	8.63	general x,y data plotter

Sequence walkers: a graphical method to display how binding proteins interact with DNA or RNA sequences

INTRODUCTION

MATERIALS AND METHODS

RESULTS AND DISCUSSION

Mathematical basis of walkers: individual information

Walkers

Graphical searching

Viewing medium sized regions of sequence

Displaying complex genetic regions

Quantitative genetic engineering

Detecting database anomalies

Distinguishing mutations from polymorphisims

NOTE ADDED IN PROOF

ACKNOWLEDGEMENTS

REFERENCES