CBB-Seminar, Tues. June 15, 11:00 am, 8th floor conference room: Paul Kitts NEW TOOLS TO MINIMIZE THE INCIDENCE AND IMPACT OF VECTOR CONTAMINATION Maintaing high sequence quality in public databases is critical given the increasing role of bioinformatics in the scientific arena. Database sequences that contain segments of foreign origin can confound the interpretation of any analysis performed using a data set that includes such contaminated sequences. For example, a similarity search can produce hits that are based only on shared segments of foreign sequence. Although nucleotide sequence analyses are most susceptible to the effects of contamination, protein analyses can also be affected since foreign sequence segments are occasionally found within a coding region. The most common source of contamination is vectors used to clone and sequence the source DNA/RNA. Improved methods for detecting sequences derived from vectors will therefore help to reduce the incidence and impact of contaminated sequences in public databases. To this end, a specialized database of non-redundant vector sequences has been developed for use in screening nucleotide sequences for contamination. In addition, the similarity search parameters for detection of vector contamination have been optimized. Finally, criteria for automatic catagorization of segments of potenial vector contamination have been defined. The result of this work is a quick, sensitive, and descriminating screen for vector contamination. The new screening system is currently being used to identify nucleic acid sequences with vector contamination before they are added to GenBank. Another application for the new screen would be to suppress the effects of vector contamination already present in public databases by masking vector sequences prior to a similarity search.