National Biospecimen Network

You are here: Home > NBN Blueprint > Bioinformatics and Data Management


	Table of Contents
MODULES
	Why the National Biospecimen Network?
	Management of Ethical and Legal Considerations
	Biospecimen and Data Collection and Distribution
	Bioinformatics and Data Management
	Communications
	Governance and Business Models
	National Biospecimen Network and Public Health
	Demonstration Project

Full NBN Blueprint Report
(PDF Document - 7,237 kb)

Public Comments

Bioinformatics and Data Management

The National Biospecimen Network (NBN) will require an information system to track biospecimens, support the collection and dissemination of key clinical and biological data associated with them, provide analytical capabilities for genomic and proteomic based research, and manage information needed for NBN administration. This module outlines the requirements and architectural issues for the NBN information system, and proposes approaches to establishing such a system using a combination of new development and adaptation of existing systems.

4.1 Introduction
The field of biomedical informatics encompasses the full range of technologies needed for clinical, biomedical, and genetics research, including applications for computational biology such as the study of probabilistic methods for gene analysis or techniques for studying protein folding. This module focuses on the particular requirements of the NBN for a multidimensional system that can capture and manage data serving a broad range of applications that begins at the bedside and progresses through the genetics laboratory.
The NBN’s utility is maximized with a well-designed and powerful information system, and biospecimens will prove more useful for research when accompanied by annotation of relevant clinical data. By design, not only the quality but also the accessibility of the biospecimens and associated data must be considered to be of the utmost importance, as acquiring data without a way to use or share the information is pointless.¹ The Tissue Access Working Group and the NBN Design Team recognized that the development of the NBN information system as an openaccess tool using scalable and extensible architecture is at the core of its efficient functioning and serves to buttress the NBN’s purpose of supporting the integration and exchange of information for cancer research.
Identifying and implementing the best information technology (IT) architecture for the NBN will be a critical success factor and must be a priority in the NBN planning and budgeting process. This module reviews the architectural requirements for an informatics superstructure that supports searchable Internet-based approaches to an open-access but data-secure system; the collection and dissemination of biospecimens and data; reliable information exchange between NBN components and existing clinical information systems, subject to appropriate standards; data mining tools to facilitate scientific discovery; and management of the business functions of the NBN. This system must include an optimal design for the protection of patient confidentiality that allows patients reasonable access to aggregated research results.
4.2 Bioinformatics Support of Existing Biospecimen Resources
There are a number of genomics-based resources that depend on robust IT systems. This section describes a few of the better known examples that can help guide the development of the NBN infrastructure or that have characteristics and/or the capability to cooperate in a linked system should the effort to build the NBN be focused on integrating existing (unconnected) systems. These are listed in alphabetical order, with more extensive discussions for some examples in the appendices.
Ardais Corporation
Ardais Corporation is a privately held clinical genomics company whose goal is to accelerate biomedical research by applying actual human disease, in the form of human tissue samples, as the discovery model in pharmaceutical research (see Appendix Q). The Ardais Biomaterials and Information for Genomic Research (BIGR™) System encompasses a unique repository, called the BIGR™ Library, with more than 170,000 research-quality tissue samples representing a broad diversity of disease. The samples are collected through the National Clinical Genomics Initiative, a strategic collaboration between Ardais and four leading U.S. medical centers.
Ardais has estimated that 45 percent of its staff are working in bioinformatics. As part of their comprehensive IT/bioinformatics system, Ardais identified a number of specific needs including:

Structured collection of and quality control of clinical data
Clinical data oncology specific to researcher needs
Sample tracking
Case linking
Web-based access
Experimental design
Browsing
Remote deployment
Multisite coordination

To satisfy IT and bioinformatics requirements, Ardais created the BIGR™ System as a discovery platform for application to drug discovery and development. The system includes a centralized, shared clinical genomics repository that encompasses tissue samples, molecular derivatives, and associated clinical information accessed by an array of bioinformatics tools. The BIGR™ Library is a scalable, Web-deployable, Java-based system architecture that is managed from a central database, with strict controls based on individual user privileges. The Library is multiinstitutional and will allow Web-based access to the repository on a researcher’s desktop computer. Clinical and demographic data are gathered and reviewed to ensure consistency and comparability, enabling researchers to navigate through a highly structured, consistent, and comparable library of materials and associated data. It is possible to search samples using patient attributes, diagnosis, tissue type, appearance, sample composition, pair-ability of diseased and matched normal from the same patient, in addition to viewing supporting data, including digital images.
First Genetic Trust
First Genetic Trust (FGT) is a business that develops IT solutions to address data, privacy, confidentiality, and ethical challenges in genomics and proteomics (see Appendix R). It is focused on supporting genetic research as a trusted third party; by providing a highly secure Web-based IT infrastructure for genetic banking; as a cornerstone of an integrated research solution for patient recruitment and informed consent; and for medical and genetic data acquisition, transfer, storage, and analysis. The patient, physician, investigators, administrators, and laboratory personnel can dynamically interface via the Web for patient education, information regarding the scope of the proposed research, and the consent process. The physician has similar access to aggregation of phenotypical clinical data and to obtaining clinical samples.
To address privacy and confidentiality protection, FGT has developed the enTRUST Genetic Banking System, with a Web-based architecture, and a highly secure, distributed genetic banking system. This system uses “Virtual Vault,” Hewlett Packard’s military-grade operating system, which leverages standard security technology for encryption and intrusion detection and exceeds both Health Insurance Portability and Accountability Act (HIPAA) and European Directive requirements for data collection, consent, accuracy, and security. A patient is assigned an encrypted electronic identifier that serves as a virtual private identity that is stored in one dataset; phenotypic or clinical information is stored in a second dataset; and genotypic data are stored in a third dataset. The three datasets are linked through the patient’s virtual private identity.
FGT aggregates data via a Web-based architecture that interfaces with existing datasets. Data are accumulated, cleaned, aggregated, and stored in a repository. A common architecture in the system provides for distributed, centralized sample banking. The FGT research management tools are all Web-based. They include consent and reconsent modules (including information feedback to the patient, such as genetic counseling), clinical and genomic data capture, the ability to configure specific studies, sample logistics and banking, remote clinical data capture, study contract storage, and bioinformatics. Data representation standards support data exchange and mining, including aggregation of complex studies.
Since FGT is a trusted third-party banking technology provider, data access rights and policies are determined by the sponsor of the banking initiative. Public and “managed data access” models can both be accommodated. Histopathological image data are not currently available, but it is technologically feasible to provide them. The design protocol can be written to automatically aggregate clinical updates and secular outcomes.
NCI Center for Bioinformatics
The NCI Center for Bioinformatics (NCICB) has developed a comprehensive enterprise architecture for biomedical data management and several analytical tools to facilitate translational research, much of which could be leveraged for use in the NBN. Several of these components are described here. All are free, open-source software and are in production today (see ncicb.nci.nih.gov).
Bioinformatics Core Infrastructure. The NCICB infrastructure backbone is called caCORE; it consists of technology tools and services as well as an overall data management architecture (see ncicb.nci.nih.gov/core). It sets up common data elements and structured interactions between these elements through object models. caCORE has a common ontologic representative environment. It begins with a comprehensive cross-mapping of the various discipline vocabularies, first classified into a metathesaurus and then redistributed into a common framework, thus creating a National Cancer Institute (NCI) cancer thesaurus. From this thesaurus, 3,000-plus trial-structured data reporting elements have been extracted. Because the trial elements are based on a standard metadata repository, they can be defined, shared, and manipulated by the various communities.
This “multitier” system architecture is derived from modern software engineering designs and is implemented using the Unified Modeling Language (UML) and Java 2 Enterprise Edition. The features of the caCORE that make it a candidate for adoption by the NBN include:

A multitier design that allows for the addition of multiple independent database sources that do not have to be physically colocated. This would enable a flexible deployment topology across multiple NBN sites.
A modeling paradigm designed to be understood by nonprogrammers, which uses cancer bioinformatics infrastructure “objects” (CaBIO).
Well-defined, documented application programming interfaces (APIs), which allow programming teams that did not develop the original architecture to use the full power of the system to write their own applications.
Built-in support for many biomedical data types, including the human genome sequences and features, single nucleotide polymorphisms, gene expression patterns and sequences, therapeutic agents, clinical trial protocols, and many others.
The Cancer Data Standards Repository, part of caCORE, which is a data element (metadata) database and tool suite. Such data elements are constructed from controlled vocabularies and thus provide for semantic consistency over time and across collection sites. The NBN could create and manage the data elements it needs for data collection and sharing using this resource.

Microarray Data Management and Analysis. The NCICB’s caArray project consists of a minimum information about a microarray experiment (MIAME)-compliant microarray (caArray) database and tools for microarray data analysis and visualization. Originally built to support the NCI Director’s challenge initiative, it is currently used in several NCI research programs. The goals of the project are to make microarray data publicly available and to develop and bring together open-source tools to analyze and visualize these data. The caArray database connects to CaBIO, permitting access to a variety of genomic, cancer model, and clinical trials information. The primary interface to caArray is the Gene Expression Data Portal, a facility for uploading and retrieving microarray experiment data as well as performing some types of analysis (gedp.nci.nih.gov). The project has recently released a pathway visualization tool, which allows researchers to view the expression levels of genes on the array via visual pathway diagrams.
Two new applications will be released in September 2003: A gene expression data analysis workbench and a genomic viewer. The data analysis workbench is a desktop tool that will include a richer collection of analysis and visualization functions, including filters and normalizers, clustering tools, color mosaic images, dendrograms, and pathways. WebCGH is a Web-based tool for the analysis and visualization of Comparative Genomic Hybridization (CGH) data. The application will enable users to create whole genome plots using CGH data stored in the caArray database and focus on a chromosome or chromosomal region of interest.
Clinical Trials. The NCICB has constructed a clinical trial protocol management system for the Specialized Programs of Research Excellence (SPORE). The system supports administrative entry and tracking of trial protocols being reviewed and launched by the SPORE program. The NBN could conceivably leverage this system to manage research proposals for biospecimens.
Image and Other Data. Various information-capture modules are placed on top of this infrastructure (e.g., caEXPRESS, caIMAGE, caClinicaltrials, caModelsDB, and caLIMS). For example, caLIMS, the laboratory information management system (LIMS), describes how the data were collected. Image and pathology capture are managed by caIMAGE. A large collection of objects that describe clinical trials and the extant cancer model are provided by caClinicaltrials and caModelsDB. A prototype integration application to facilitate cross-disciplinary reasoning is the Cancer Molecular Analysis Project. It allows users to move from molecular profiles through clinical trials and is applicable to different fields. NCICB has constructed an image file management and delivery system called caIMAGE (caimage.nci.nih.gov) that could potentially be leveraged to support the NBN’s histopathology image data needs. The NCICB system does not include a tissue management infrastructure for tissue inventory and control, as this is not part of its charge. Other organizations, however, may be able to provide off-the-shelf infrastructure for this purpose (e.g., Cooperative Human Tissue Network, Cancer and Leukemia Group B, Daedalus Software).
Security. The vast majority of the data in NCICB are already publicly accessible. For data requiring more secure access, particular authorization is required. The system uses Web-browser interfaces and an infrastructure distributed through a variety of formats that allows users to write external applications to “reach through” to the data. In essence, it is a repository that can be partially accessed with off-the-shelf technology.
Shared Pathology Information System
In 2001, NCI awarded two cooperative agreements to develop a Shared Pathology Informatics Network (SPIN), defined as a model Web-based system to access data related to archived human specimens at multiple institutions. Two groups were funded at $13.5 million over 5 years to work collaboratively on SPIN. Data will be accessed from existing pathology and other medical databases. The ability to automatically access information from medical databases is the first step toward the long-term goal of developing informatics systems to support NCI’s efforts to improve researchers’ access to human specimens and clinical data. The systems to be developed by the network will be able to respond automatically to authorized queries by identifying, obtaining, collating, and returning data for those cases that meet defined search criteria. Patients’ names and other identifying information are to be encrypted or otherwise modified to protect patient privacy and confidentiality and to comply with applicable confidentiality regulations. In addition to improving access to clinical data, the system is expected to provide researchers with the means to quickly identify and determine the availability of archived specimens and data that meet their research needs.
Both projects are well into their second year, and notable progress has been made. Approximately 20 hospitals have now established peer-to-peer data sharing arrangements, and information on over a million pathology specimens is now accessible online. The current program will not actually make the specimens available to researchers, only information about them. SPIN is designed to provide a proof of principle. Actual specimen transfer would be the objective for a follow-on 5-year project under discussion.
The United Kingdom National Cancer Tissue Resource
One of the primary goals of the United Kingdom National Cancer Tissue Resource (NCTR) is to develop an information grid that can automatically and seamlessly incorporate all relevant data from each new patient into the appropriate database, input that patient’s data into existing predictive models, and transmit that information to the clinician in the clinical environment (see Appendix O). As a first step, the NCTR proposed that the University of Cambridge, with the University of Leeds, the University of Glasgow, and the Peterborough Hospital Research Tissue Bank Network, lead the development of a pilot information system to include the development of an “informatics hub,” which will build, maintain, and integrate heterogeneous and distributed databases (see Figure 4-1). Ultimately, these databases will include the specimen bank data, clinical and pathology data, clinical trials and outcome data, and research results. The base data would comprise a minimum dataset for the NCTR.
This hub will connect to present and future NHS information systems, including other clinical trials networks. There will also be a need to develop data mining tools that will synthesize these heterogeneous data sources into diagnostic and prognostic information on which clinicians can base decisions. The NCTR is working closely with commercial software developers to create research informatics platforms. Grid technology is expected to underlie the architecture, and it is recognized that there will need to be a security plan.

Figure 4-1. Adapted from the United Kingdom National Cancer Tissue Resource Bioinformatics Schema
4.3 NBN Bioinformatics System Requirements and Recommendations
There is an understandable tendency to start information systems from scratch; however, during the past 10 years the healthcare IT industry and the clinical and research pathology industries have made substantial investments in the development of sophisticated clinical and pathology information systems. Moreover, the Government and other organizations have developed information systems that may meet many of the requirements of the NBN information system, in whole or in part. Thus, it will be essential to identify the appropriate balance of “buy” versus “build.” Truly critical system requirements should not be forfeited for shrink-wrapped solutions, and modifying commercial systems and paying ongoing licensing fees may dramatically increase the long-term costs of ownership. On the other hand, the NBN may be able to negotiate favorably with IT vendors if it can demonstrate sufficient volume of usage.
The buy/build/modify decision cannot be made until the architecture and requirements are specified. On the other hand, the requirements analysis needs to be informed by a thorough and up-to-date survey of existing systems, both commercial and government held, because requirements are not developed in a vacuum and features and functions that are deemed to be “absolutely required” or “desired” might be influenced by what is readily available (and the associated price). For example, 10 years ago “online instant access at no communications cost to the user” might have been deemed “expensive but desirable”; the Internet has made this a “nocost must-have.” Automated teller machines are an example of how data exchange standards can revolutionize an industry. Similarly, consider how microarray technology has changed the requirements analysis from what it might have been a few years ago.
The size and complexity of the NBN information system will undoubtedly be impressive. There are clearly a number of challenges surrounding the building of big systems. The failure rate and cost of information systems are both exponentially related to size and complexity. Establishing an optimal information system management model is critical to the success of the NBN. The NBN Operations Center and the Board of Governors must understand that the informatics infrastructure itself will need to be adequately staffed and funded and ought to be supported as an informatics research enterprise, which is likely to promote quality improvements and attract the most talent and the associated research support that often accompany them.² It is clearly of paramount importance that a fully developed design be accompanied by realistic budgets, scalable implementation, and vigorous management of the social and political landscape via attendant authority.
4.3.1 Management of the NBN Information System
It was the sense of the Design Team that while general features and common data elements should incorporate input from diverse communities, information architecture design and implementation by committee does not work well for large software projects. It was recommended that a clearly empowered management model serve as the foundation of the NBN information system. Thus, the day-to-day management of the architecture design and development process must be guided by a strong and legitimized hand.³ In essence, it is proposed that one individual have authority over project personnel, budget, design, and architecture and that this person be accountable to the governing principles of the NBN. It is important to establish authority early.
Although the decision-making authority for management and for design must be held by the system manager and architect, they will need to work closely on requirements analysis with:

Bioinformatics counterparts who have developed biospecimen bank systems and who may be linked to this system directly or more loosely coupled via middleware in the spirit of grid computing
Users (principally scientists and research administrators)
Pathologists and laboratory scientists
Biospecimen donors and users
Pure data users
Clinical information system experts
Patients
Bioinformatics experts with knowledge of similar systems

4.3.2 Design of the NBN Information System
It is recommended that the technical design of the NBN information system be created by the overall system manager’s appointed architect. This section provides guidance on starting principles and will outline areas for which requirements will need to be developed. A detailed statement of the design is beyond the scope of this document.
4.3.2.1 Data Standards
The development of the NBN data standards could be challenging, primarily because the data standards and vocabularies that different user communities use are so heterogeneous (both within and between the groups), because the data structures and required reports are so inherently complex, and because of the need for longitudinal data. Any resulting models will have to take into account Health Level Seven (HL7) and UMLs and incorporate what is current at NCI. It is possible that some new data standards will need to be developed (hopefully as a variation of current ones). To illustrate, consider the wide variety of data representation required to encode such disparate data sources as encoded clinical pathology reports, text histopathology reports, demographic and clinical information, insurance records, clinical trials protocols, and microarray research results.
4.3.2.2 Minimum Dataset and Data Location
Assuring the availability of a minimal dataset should be an important element of the NBN system design. In addition, because modern computer communications systems make it possible to access data stored in many locations from many other (different) locations, it is unnecessary to build a single comprehensive database that would contain all NBN data. Nevertheless, connecting disparate information systems will likely be an evolutionary process, greatly facilitated by the more widespread adoption of data standards, such as the Systematized Nomenclature of Medicine (SNOMED), which has recently been adopted by the Department of Health and Human Services (HHS) as a standard.⁴ A common lexicon of terms will not only create a new standard in health care and biomedical research but also enable researchers to mine databases with greater efficiency and added confidence that all available relevant information has been detected.
4.3.2.3 Architecture
It is proposed that the architecture be a standards-based distributed system, with a central database at first that will evolve over time to be highly distributed, with the central database being limited to pointers and storage of any relatively stable, highly used administrative data. It was noted that sites that may be major data contributors (e.g., community hospitals) may be least equipped to handle a distributed data architecture.
The NBN architecture would specify standards at Open System Interconnection (OSI) Level Seven only and rely on HL7 standards wherever possible, developing new ones as needed.⁵ Thus, the NBN information system would be built in a way that will not specify how a local site or Business Unit stores its data, what operating systems, hardware, or software it uses, or even what type of internal communications modalities it uses. Because it is certain that there will be data links to heterogeneous and distributed databases of ancillary clinical and pathological information, it would be wasteful over the long term to create duplicate local data stores and other capabilities except of selected, highly used data. On the other hand, for the first year or two, unless the system is well established, it is likely to be relatively centralized, and there will inevitably be some duplication.
However, it is strongly recommended that the NBN information system require that the data arrive in a certain format, and be packaged in a precise way, via the Internet. The NBN systems will know how to “open” the package and interpret the contents. Practically, this means that one of the first orders of business for the Bioinformatics Unit within the Operations Center will be the creation of a data model.
There may be times when the NBN will make recommendations with respect to software or hardware, based on experience. For example, it may learn that a certain software toolkit provides reliable data translations from Hospital Information System X to NBN standards. In addition, the NBN may develop specialized software (or even hardware) for use at local sites or within the Business Units, perhaps on a fee basis. Use of NBN-recommended or developed software, to the extent that it provides system-wide efficiencies, may be encouraged by incentives (e.g., a guarantee that the data will be formatted correctly or NBN will fix it; or perhaps a lower charge for another service). It is most likely that the NBN will develop software for local use if the NBN is asking for specialized data (e.g., on a rare disease) or for data for which standards do not yet exist (e.g., microarray data).
All communications will be Internet protocol-based, possibly with grid storage and job distribution capabilities strongly enabled; only later is it likely that the NBN will take advantage of parallel processing for advanced computation. The importance of a multitiered architecture has been emphasized by Working Group members.⁶
The system should probably be designed as a tightly coupled and extensible set of independent modules, each one associated with a specific set of functions. The design should fully support “Plug & Play” operations, where new modules (as well as new versions of existing modules) can be immediately deployed, as long as they interact with (and possibly extend) well-defined interfaces. Identifying these modules—and assuring their true functionality—will be one of the first jobs of the system architect. For example, all modules should be designed with robust and flexible key structures that allow them to be used in unanticipated fashions.
Vertical (functional) modules would allow for the easy addition of new functionality, e.g., a new data analysis algorithm or a new remote data capture module; horizontal (behavioral) modules can be used to drive the global behavior of the system by integrating the different functionalities via well-defined contracts and workflow patterns (Table 4-1). In practice, even the vertical functions, if properly developed, could be used throughout the system. For example, a strong education component could be included in the Specimen and Data Acquisition Business Unit as part of the informed consent process, in the Advanced Analysis Business Unit to bring a new technician up to speed, and in the Patient Relations Business Unit to help donors understand how their sample type is being used.
The NCICB platform, with its rich APIs and multitier architecture, could be integrated with other systems that provide the additionally needed functionality, such as a sample inventory management and tracking system. The NCICB platform currently does not include an impenetrable security and encryption mechanism; however, the Ardais model suggests that private patient data could remain at the primary specimen collection sites and that only deidentified clinical information needs to be transmitted into the biospecimen informatics network.
One important virtue of the modular approach is that well-designed (but extensible) security functions that protect confidentiality can be used throughout the NBN, enhancing overall security while allowing site-specific customization (e.g., e-consenting forms that might vary across hospitals). Essential functions are designed once to serve the NBN securely as a whole.
4.3.3 System Functions Supported by Bioinformatics
The NBN must rely on its bioinformatics system to support the overall integration and exchange of data for all of the other business units. Figure 4-2 provides a schematic view of the system functions mapped to the business functions outlined in 6. Governance and Business Models, and this section provides a brief discussion of these system functions.
Table 4-1. Candidate “Plug & Play” Modules

Horizontal Modules
(local functions)

Authentication and authorization
Sample banking
Case report form design
Cohort/study management
Informed consent management
Data analysis
Reporting
Education
Questionnaire
Data warehousing
Sample tracking
Remote data capture
Informed consent form design
Recontact
Clinical stratification
Knowledge management
LIMS integration
Study enrollment
E-signature
Genetic data banking

Vertical Modules
(global behaviors)

Workflow management
Study participant withdrawal
Role assignment
Sample request management
Verification of data upload

Figure 4-2. Mapping of Business Units and Their Key Functions to Bioinformatics
4.3.4 System Functions Supported by Bioinformatics
The NBN must rely on its bioinformatics system to support the overall integration and exchange of data for all of the other business units. Figure 4-2 provides a schematic view of the system functions mapped to the business functions outlined in 6. Governance and Business Models, and this section provides a brief discussion of these system functions.
4.3.4.1 Research Administration and Support
Managing the research process will be greatly facilitated by the application of bioinformatics at every level. Tasks will include:

Developing an e-consenting and reconsenting function, which will require the development of data standards for consenting information and associated electronic signature
Developing a database of research activities, researchers, and grants creating a Web site about the NBN and its capabilities
Administering the NBN Biospecimen Utilization Review Committee to provide equitable access to biospecimens and data
Developing approaches to report activities to the NBN Operations Center and the Board
Facilitating the identification of candidate specimens via user-friendly interfaces (e.g., users should not need to know Medical Subject Heading terms when looking for data about specific types of tumors)
Developing a mechanism where data or analyses derived from NBN specimens can be added to the system for the benefit of future research

4.3.4.2 Biospecimen and Data Acquisition
The system will be required to manage the data for acquisition, basic analysis, storage, and distribution. In particular, the system will need to be able to identify biospecimens; add/delete/modify information; and track their location, availability, size, state (fixed, frozen, or both), and many other factors to be defined as NBN standards in 3. Biospecimen and Data Collection and Dissemination. To support those broad activities, specific functionality will include the following:

Informed e-consent and reconsenting management, tying the specific elements of the consent to the actual samples. Computerized informed consent forms should be defined.
Integration with LIMS for tissue storage, retrieval, transformation, and shipping/receiving tracking
Management of pathology reports and other longitudinal clinical data, including coding reports according to NBN standards, and ensuring that a true longitudinal picture of the patient can be created and studied
Support for representing and monitoring the initial staging procedure and the standards for that procedure
Tissue preparation (technology application) details and output
E-signed workflow management to ensure that appropriate procedures are followed
Developing a mechanism to communicate results to clinical providers

4.3.4.3 Storage and Distribution/Basic Analysis
It is anticipated that there will be regional storage and allocation of specimens, and that at these sites some preliminary analyses will be performed. At a minimum, systems will need to be created to:

Receive, store, retrieve, and ship samples, and provide confirmation of receipt
Manage inventory
Conform with International Air Transport Association regulations

4.3.4.4 Advanced Analysis
A large amount of semistructured and unstructured bioinformatics data is expected to be generated from microarray analyses and other technologies. Important tasks in the bioinformatics arena will include the following:

Implementing and enhancing existing standards for gene expression microarray data, tissue microarray data, and proteomics data—probably by participation in a larger process—will be an important part of the bioinformatics work of this project. The MIAME standards are Object Management Group approved, and several tools exist for creating XML documents for data exchange from a MIAME-compliant database.⁷
Establishing detailed specifications for how to add or link these kinds of data to the NBN system and update, delete, and archive data will be required.
Developing integration with LIMS or LIMS-like systems, via Logical Observation Identifiers Names and Codes (LOINC)®-based HL7 methods, will also be an important consideration.⁸
Developing a mechanism where data and/or analyses generated from NBN biospecimens by one user can be archived and shared with other potential users.

4.3.4.5 Patient Relations
It is to be emphasized that the NBN will maintain a firewall between the donor’s identifying information and research materials. The bioinformatics system will act as an honest information broker. If patients are supposed to be able to request that their unused specimens be withdrawn from the repositories, then the NBN must record the specimens in a way that they can be linked to the original donor. If research results have implications for care, the community of NBN donors will be contacted, but never individuals (see 2. Management of Ethical and Legal Issues). Thus, the system will need to provide an effective way to communicate key research findings to patients (biospecimen providers) and their families (see also 5. Communications). This will involve:

Developing a way to communicate to patients (and their families, as appropriate) findings with implications for treatment and counseling (genetic, participation in trials).
Developing education and outreach materials for all the constituencies, including study participants, investigators, and the public at large.

4.3.4.6 Bioinformatics and Data Management
The NBN must be able to manage the information system tasks that cut across all the business units and that will be a shared (virtual) resource. The Bioinformatics and Data Management Business Unit supports a series of very general information system tasks that are discussed here.
Reporting. The design of the reports ultimately drives all system requirements for database applications. The NBN data architecture must allow the ability to:

Facilitate researcher access to information.
Enable or support the creation of longitudinal “virtual” studies to follow a cohort of patients through clinical outcome. (This is useful for early testing of a biomarker or diagnostic tool, but it requires modestly sophisticated information. Validation may require more sophisticated information because results might ultimately be tested in a clinical trial).
Permit operational and management reporting.

Database and Data Model. It is likely that there will be a central core data facility, with a standard dictionary. SNOMED is a widely accepted standard, making its use within the NBN a yet more plausible choice. The NBN may have some dictionary requirements that go beyond SNOMED, and it is recommended that, whatever auxiliary functions are developed, the NBN dictionary cross-walk with other dictionaries. Similarly, the data system will need to have developed associated business logic and development and reporting tools. Parts of the system will be distributed. The distribution structure will evolve as technologies change and clinical data become more available (and more massive).
Candidates for some local stores include: Selected clinical (patient) data, trials data (research on these and related specimens, including data and bibliographic), and analytic data (histology and histochemical, microarray, and other genetic and chemical analyses). A detailed data model will need to be developed that includes at a minimum existing clinical, public health, insurer/Centers for Medicare and Medicaid Services, pathology, histology (possibly image), research, and bibliographic information. The model will have to allow patients to be followed longitudinally and linked to family members. It may also be desirable to collect data related to specific ethnic groups. It remains to be determined whether animal data related to the area of research should also be connected. It will be important to determine a hierarchy of information that would be required in a first version versus what is desired in what will probably be a series of later, more sophisticated versions.
Security/Controlled Access. The bioinformatics system should have strong authentication that supports the confidentiality and access rules determined by the governance body (which may in turn be advised by confidentiality and legal experts). Differential levels of data identification will be maintained in parallel, offering varying levels of access depending on the type of user (e.g., system administrator, NBN staff, researchers, patients).
User Support. User support will be a priority. There will be many types of users including researchers, the other business units, NBN Operations staff, and informatics staff at sites. User support will include helping with implementation and establishing minimal configurations (including needed communications infrastructures), updates, and user help. User support may also include a Web site, automated voice response, manuals, a help desk, and help with remote configurations. The system will only be as good as the support. The NBN will need to consider IT staffing at selected sites at a level commensurate with that of some private sector firms.
4.3.5 Quality Assurance
Demonstrating that quality measures have been set and attained (or that remedial actions have been taken) will be one of the most challenging aspects of both the program management and the bioinformatics tasks. Developing quality measurement tasks will include:

Working with the QA Unit to define measurable objectives and pertinent regulations and procedures that the NBN must meet in supporting the bioinformatics infrastructure
Developing an executive information system that reports on how each of the business units is doing in meeting their scientific, financial, and management objectives, and ensuring that the system is adding value
Assuring regulatory compliance

4.4 Summary of Key Findings and Recommendations
The NBN will require an intelligent information system to track biospecimens, key associated clinical and biological data, and the information needed for research administration. The development of the system will be a major project that must be done simultaneously with the development of the NBN as a whole, rather than a being a standalone or later add-on. It will be essential to identify the appropriate balance of “buy” versus “build.” Key features of the NBN information system are as follows:

While recommendations about the general architecture and common data elements require broad input, to maximize efficiency, a central architect should be designated to build and manage the bioinformatics infrastructure. It is proposed that this central architect have authority over project personnel, budget, design, and architecture, and also be accountable to the governing principles of the NBN.
The NBN bioinformatics system should be standards-based (e.g., SNOMED, HL7, or MIAME for data; Internet for communications) to enable data and information exchange among system components and the researchers who use them. In designing the bioinformatics system, a standards-based approach will allow flexibility to employ individualized approaches, while reducing the difficulty involved in developing a comprehensive system that links diverse components of the NBN. The NBN should not have specific requirements by the operational units for databases or hardware systems.
Assuring the availability of a minimum dataset and location should be an important element of the NBN system design.
Modularized components (“plug and play”) will be used (and developed) to the extent possible, allowing the NBN data architecture to build upon best practices from other repositories.
Critical benchmarks of success for the NBN data system include ease of data entry and retrieval, highly responsive user support, and commitment to the NBN Quality Assurance process.
While Bioinformatics and Data Management will represent one of the business units of the NBN, it will also traverse and integrate all the business units and will be a shared (virtual) resource underpinning all of the NBN operations over time.
Development and implementation will start with a centralized database, and then move to more a decentralized yet interconnected model as the system matures, capabilities are strengthened, and requirements clarified.

Next Module: Communications
Top of Page

Footnotes

¹ Maurer S.M., Firestone R.B., and Scriver C.R. (2000). Science’s neglected legacy: Large, sophisticated databases cannot be left to chance and improvisation. Nature, Vol. 405 (May 11): 117-120.

² See 6. Governance and Business Models for discussion of the proposed management structure for the NBN.

³ Brooks F.P., Jr. (1975). The Mythical Man-Month. New York: Addison-Wesley.

⁴ Department of Health and Human Services. (2003). HHS Launches New Efforts to Promote Paperless Health Care System. Press Release (July 1); or www.hhs.gov/news/press/2003pres/20030701.html

⁵ “Level Seven” refers to the highest level of the International Standards Organization’s communications model for OSI—the application level. The application level addresses the definition of the data to be exchanged, the timing of the interchange, and the communication of certain errors to the application. The seventh level supports such functions as security checks, participant identification, availability checks, exchange mechanism negotiations, and most importantly, data exchange structuring (from www.hl7.org).

⁶ Additional information on grid architecture may be found at the following Web sites: www.grids-center.org/grids/grids_primer.asp; https://gsd.msfc.nasa.gov/FD40/papers/SpaceOps2002/spaceops02_p_t2_83.pdf.

⁷ Tissue Microarray Standards have been published recently. See Berman J.J., Edgerton M.E., and Friedman B.A. (2003). The tissue microarray data exchange specification: A community-based, open source tool for sharing tissue microarray data. BMC Med. Inform. Decis. Mak. 3:5.

⁸ The LOINC database provides a universal code system for reporting laboratory and other clinical observations. Its purpose is to identify observations in electronic messages such as HL7 observation messages, so that when hospitals, health maintenance organizations, pharmaceutical manufacturers, researchers, and public health departments receive such messages from multiple sources, they can automatically file the results in the right slots of their medical records, research, and/or public health systems. For each observation, the database includes a code (of which 25,000 are laboratory test observations), a long formal name, a “short” 30-character name, and synonyms. The database comes with a mapping program called Regenstrief LOINC Mapping Assistant (RELMATM) to assist the mapping of local test codes to LOINC codes and to facilitate browsing of the LOINC results. Both LOINC and RELMA are available at no cost from http://www.regenstrief.org/loinc/. The LOINC medical database carries records for 30,000 different observations. LOINC codes are being used by large reference laboratories and federal agencies, e.g., the CDC and the Department of Veterans Affairs, and are part of the HIPAA attachment proposal. McDonald C.J., Huff S.M., Suico J.G., et al. (2003). LOINC, A Universal Standard for Identifying Laboratory Observations: A 5-Year Update. Clinical Chemistry, Vol. 49:624-633.

Next Module: Communications
Top of Page

U.S. Department of Health and Human Services
National Institutes of Health
National Cancer Institute