Bioinformatics and Data Management
The National Biospecimen Network (NBN) will require an information system to track
biospecimens, support the collection and dissemination of key clinical and biological data
associated with them, provide analytical capabilities for genomic and proteomic based research,
and manage information needed for NBN administration. This module outlines the requirements
and architectural issues for the NBN information system, and proposes approaches to
establishing such a system using a combination of new development and adaptation of existing
systems. |
4.1 Introduction
The field of biomedical informatics encompasses the full range of technologies needed for
clinical, biomedical, and genetics research, including applications for computational biology
such as the study of probabilistic methods for gene analysis or techniques for studying protein
folding. This module focuses on the particular requirements of the NBN for a multidimensional
system that can capture and manage data serving a broad range of applications that begins at the
bedside and progresses through the genetics laboratory.
The NBN’s utility is maximized with a well-designed and powerful information system, and
biospecimens will prove more useful for research when accompanied by annotation of relevant
clinical data. By design, not only the quality but also the accessibility of the biospecimens and
associated data must be considered to be of the utmost importance, as acquiring data without a
way to use or share the information is pointless.1 The
Tissue Access Working Group and the NBN Design Team recognized that the development of the NBN information system as
an openaccess tool using scalable and extensible architecture is at the core of its efficient functioning and
serves to buttress the NBN’s purpose of supporting the integration and exchange of information
for cancer research.
Identifying and implementing the best information technology (IT) architecture for the NBN will
be a critical success factor and must be a priority in the NBN planning and budgeting process.
This module reviews the architectural requirements for an informatics superstructure that
supports searchable Internet-based approaches to an open-access but data-secure system; the
collection and dissemination of biospecimens and data; reliable information exchange between
NBN components and existing clinical information systems, subject to appropriate standards;
data mining tools to facilitate scientific discovery; and management of the business functions of
the NBN. This system must include an optimal design for the protection of patient confidentiality
that allows patients reasonable access to aggregated research results.
4.2 Bioinformatics Support of Existing Biospecimen Resources
There are a number of genomics-based resources that depend on robust IT systems. This section
describes a few of the better known examples that can help guide the development of the NBN
infrastructure or that have characteristics and/or the capability to cooperate in a linked system
should the effort to build the NBN be focused on integrating existing (unconnected) systems.
These are listed in alphabetical order, with more extensive discussions for some examples in the
appendices.
Ardais Corporation
Ardais Corporation is a privately held clinical genomics company whose goal is to accelerate
biomedical research by applying actual human disease, in the form of human tissue samples, as
the discovery model in pharmaceutical research (see Appendix Q). The Ardais Biomaterials and
Information for Genomic Research (BIGR™) System encompasses a unique repository, called
the BIGR™ Library, with more than 170,000 research-quality tissue samples representing a
broad diversity of disease. The samples are collected through the National Clinical Genomics
Initiative, a strategic collaboration between Ardais and four leading U.S. medical centers.
Ardais has estimated that 45 percent of its staff are working in bioinformatics. As part of their
comprehensive IT/bioinformatics system, Ardais identified a number of specific needs including:
- Structured collection of and quality control of clinical data
- Clinical data oncology specific to researcher needs
- Sample tracking
- Case linking
- Web-based access
- Experimental design
- Browsing
- Remote deployment
- Multisite coordination
To satisfy IT and bioinformatics requirements, Ardais created the BIGR™ System as a discovery
platform for application to drug discovery and development. The system includes a centralized,
shared clinical genomics repository that encompasses tissue samples, molecular derivatives, and
associated clinical information accessed by an array of bioinformatics tools. The BIGR™
Library is a scalable, Web-deployable, Java-based system architecture that is managed from a
central database, with strict controls based on individual user privileges. The Library is multiinstitutional
and will allow Web-based access to the repository on a researcher’s desktop
computer. Clinical and demographic data are gathered and reviewed to ensure consistency and
comparability, enabling researchers to navigate through a highly structured, consistent, and
comparable library of materials and associated data. It is possible to search samples using patient
attributes, diagnosis, tissue type, appearance, sample composition, pair-ability of diseased and
matched normal from the same patient, in addition to viewing supporting data, including digital
images.
First Genetic Trust
First Genetic Trust (FGT) is a business that develops IT solutions to address data, privacy,
confidentiality, and ethical challenges in genomics and proteomics (see Appendix R). It is
focused on supporting genetic research as a trusted third party; by providing a highly secure
Web-based IT infrastructure for genetic banking; as a cornerstone of an integrated research
solution for patient recruitment and informed consent; and for medical and genetic data
acquisition, transfer, storage, and analysis. The patient, physician, investigators, administrators,
and laboratory personnel can dynamically interface via the Web for patient education,
information regarding the scope of the proposed research, and the consent process. The physician
has similar access to aggregation of phenotypical clinical data and to obtaining clinical samples.
To address privacy and confidentiality protection, FGT has developed the enTRUST Genetic
Banking System, with a Web-based architecture, and a highly secure, distributed genetic banking
system. This system uses “Virtual Vault,” Hewlett Packard’s military-grade operating system,
which leverages standard security technology for encryption and intrusion detection and exceeds
both Health Insurance Portability and Accountability Act (HIPAA) and European Directive
requirements for data collection, consent, accuracy, and security. A patient is assigned an
encrypted electronic identifier that serves as a virtual private identity that is stored in one dataset;
phenotypic or clinical information is stored in a second dataset; and genotypic data are stored in
a third dataset. The three datasets are linked through the patient’s virtual private identity.
FGT aggregates data via a Web-based architecture that interfaces with existing datasets. Data are
accumulated, cleaned, aggregated, and stored in a repository. A common architecture in the
system provides for distributed, centralized sample banking. The FGT research management
tools are all Web-based. They include consent and reconsent modules (including information
feedback to the patient, such as genetic counseling), clinical and genomic data capture, the
ability to configure specific studies, sample logistics and banking, remote clinical data capture,
study contract storage, and bioinformatics. Data representation standards support data exchange
and mining, including aggregation of complex studies.
Since FGT is a trusted third-party banking technology provider, data access rights and policies
are determined by the sponsor of the banking initiative. Public and “managed data access”
models can both be accommodated. Histopathological image data are not currently available, but
it is technologically feasible to provide them. The design protocol can be written to automatically
aggregate clinical updates and secular outcomes.
NCI Center for Bioinformatics
The NCI Center for Bioinformatics (NCICB) has developed a comprehensive enterprise
architecture for biomedical data management and several analytical tools to facilitate
translational research, much of which could be leveraged for use in the NBN. Several of these
components are described here. All are free, open-source software and are in production today
(see ncicb.nci.nih.gov).
Bioinformatics Core Infrastructure. The NCICB infrastructure backbone is called caCORE; it
consists of technology tools and services as well as an overall data management architecture (see
ncicb.nci.nih.gov/core). It sets up common data elements and structured interactions between
these elements through object models. caCORE has a common ontologic representative
environment. It begins with a comprehensive cross-mapping of the various discipline
vocabularies, first classified into a metathesaurus and then redistributed into a common
framework, thus creating a National Cancer Institute (NCI) cancer thesaurus. From this
thesaurus, 3,000-plus trial-structured data reporting elements have been extracted. Because the
trial elements are based on a standard metadata repository, they can be defined, shared, and
manipulated by the various communities.
This “multitier” system architecture is derived from modern software engineering designs and is
implemented using the Unified Modeling Language (UML) and Java 2 Enterprise Edition. The
features of the caCORE that make it a candidate for adoption by the NBN include:
- A multitier design that allows for the addition of multiple independent database sources
that do not have to be physically colocated. This would enable a flexible deployment
topology across multiple NBN sites.
- A modeling paradigm designed to be understood by nonprogrammers, which uses cancer
bioinformatics infrastructure “objects” (CaBIO).
- Well-defined, documented application programming interfaces (APIs), which allow
programming teams that did not develop the original architecture to use the full power of
the system to write their own applications.
- Built-in support for many biomedical data types, including the human genome sequences
and features, single nucleotide polymorphisms, gene expression patterns and sequences,
therapeutic agents, clinical trial protocols, and many others.
- The Cancer Data Standards Repository, part of caCORE, which is a data element
(metadata) database and tool suite. Such data elements are constructed from controlled
vocabularies and thus provide for semantic consistency over time and across collection
sites. The NBN could create and manage the data elements it needs for data collection
and sharing using this resource.
Microarray Data Management and Analysis. The NCICB’s caArray project consists of a
minimum information about a microarray experiment (MIAME)-compliant microarray (caArray)
database and tools for microarray data analysis and visualization. Originally built to support the
NCI Director’s challenge initiative, it is currently used in several NCI research programs. The
goals of the project are to make microarray data publicly available and to develop and bring
together open-source tools to analyze and visualize these data. The caArray database connects to
CaBIO, permitting access to a variety of genomic, cancer model, and clinical trials information.
The primary interface to caArray is the Gene Expression Data Portal, a facility for uploading and
retrieving microarray experiment data as well as performing some types of analysis
(gedp.nci.nih.gov). The project has recently released a
pathway visualization tool, which allows
researchers to view the expression levels of genes on the array via visual pathway diagrams.
Two new applications will be released in September 2003: A gene expression data analysis
workbench and a genomic viewer. The data analysis workbench is a desktop tool that will
include a richer collection of analysis and visualization functions, including filters and
normalizers, clustering tools, color mosaic images, dendrograms, and pathways. WebCGH is a
Web-based tool for the analysis and visualization of Comparative Genomic Hybridization (CGH)
data. The application will enable users to create whole genome plots using CGH data stored in
the caArray database and focus on a chromosome or chromosomal region of interest.
Clinical Trials. The NCICB has constructed a clinical trial protocol management system for the
Specialized Programs of Research Excellence (SPORE). The system supports administrative
entry and tracking of trial protocols being reviewed and launched by the SPORE program. The
NBN could conceivably leverage this system to manage research proposals for biospecimens.
Image and Other Data. Various information-capture modules are placed on top of this
infrastructure (e.g., caEXPRESS, caIMAGE, caClinicaltrials, caModelsDB, and caLIMS). For
example, caLIMS, the laboratory information management system (LIMS), describes how the
data were collected. Image and pathology capture are managed by caIMAGE. A large collection
of objects that describe clinical trials and the extant cancer model are provided by caClinicaltrials
and caModelsDB. A prototype integration application to facilitate cross-disciplinary reasoning is
the Cancer Molecular Analysis Project. It allows users to move from molecular profiles through
clinical trials and is applicable to different fields. NCICB has constructed an image file
management and delivery system called caIMAGE (caimage.nci.nih.gov) that could potentially
be leveraged to support the NBN’s histopathology image data needs. The NCICB system does
not include a tissue management infrastructure for tissue inventory and control, as this is not part
of its charge. Other organizations, however, may be able to provide off-the-shelf infrastructure
for this purpose (e.g., Cooperative Human Tissue Network, Cancer and Leukemia Group B,
Daedalus Software).
Security. The vast majority of the data in NCICB are already publicly accessible. For data
requiring more secure access, particular authorization is required. The system uses Web-browser
interfaces and an infrastructure distributed through a variety of formats that allows users to write
external applications to “reach through” to the data. In essence, it is a repository that can be
partially accessed with off-the-shelf technology.
Shared Pathology Information System
In 2001, NCI awarded two cooperative agreements to develop a Shared Pathology Informatics
Network (SPIN), defined as a model Web-based system to access data related to archived human
specimens at multiple institutions. Two groups were funded at $13.5 million over 5 years to
work collaboratively on SPIN. Data will be accessed from existing pathology and other medical
databases. The ability to automatically access information from medical databases is the first step
toward the long-term goal of developing informatics systems to support NCI’s efforts to improve
researchers’ access to human specimens and clinical data. The systems to be developed by the
network will be able to respond automatically to authorized queries by identifying, obtaining,
collating, and returning data for those cases that meet defined search criteria. Patients’ names and
other identifying information are to be encrypted or otherwise modified to protect patient privacy
and confidentiality and to comply with applicable confidentiality regulations. In addition to
improving access to clinical data, the system is expected to provide researchers with the means to
quickly identify and determine the availability of archived specimens and data that meet their
research needs.
Both projects are well into their second year, and notable progress has been made.
Approximately 20 hospitals have now established peer-to-peer data sharing arrangements, and
information on over a million pathology specimens is now accessible online. The current
program will not actually make the specimens available to researchers, only information about
them. SPIN is designed to provide a proof of principle. Actual specimen transfer would be the
objective for a follow-on 5-year project under discussion.
The United Kingdom National Cancer Tissue Resource
One of the primary goals of the United Kingdom National Cancer Tissue Resource (NCTR) is to
develop an information grid that can automatically and seamlessly incorporate all relevant data
from each new patient into the appropriate database, input that patient’s data into existing
predictive models, and transmit that information to the clinician in the clinical environment (see
Appendix O). As a first step, the NCTR proposed that the University of Cambridge, with the
University of Leeds, the University of Glasgow, and the Peterborough Hospital Research Tissue
Bank Network, lead the development of a pilot information system to include the development of
an “informatics hub,” which will build, maintain, and integrate heterogeneous and distributed
databases (see Figure 4-1). Ultimately, these databases
will include the specimen bank data, clinical and pathology data, clinical trials and outcome data, and research
results. The base data would comprise a minimum dataset for the NCTR.
This hub will connect to present and future NHS information systems, including other clinical
trials networks. There will also be a need to develop data mining tools that will synthesize these
heterogeneous data sources into diagnostic and prognostic information on which clinicians can
base decisions. The NCTR is working closely with commercial software developers to create
research informatics platforms. Grid technology is expected to underlie the architecture, and it is
recognized that there will need to be a security plan.
Figure 4-1. Adapted from the United Kingdom National Cancer Tissue Resource Bioinformatics Schema
4.3 NBN Bioinformatics System Requirements and Recommendations
There is an understandable tendency to start information systems from scratch; however, during
the past 10 years the healthcare IT industry and the clinical and research pathology industries
have made substantial investments in the development of sophisticated clinical and pathology
information systems. Moreover, the Government and other organizations have developed
information systems that may meet many of the requirements of the NBN information system, in
whole or in part. Thus, it will be essential to identify the appropriate balance of “buy” versus
“build.” Truly critical system requirements should not be forfeited for shrink-wrapped solutions,
and modifying commercial systems and paying ongoing licensing fees may dramatically increase
the long-term costs of ownership. On the other hand, the NBN may be able to negotiate
favorably with IT vendors if it can demonstrate sufficient volume of usage.
The buy/build/modify decision cannot be made until the architecture and requirements are
specified. On the other hand, the requirements analysis needs to be informed by a thorough and
up-to-date survey of existing systems, both commercial and government held, because
requirements are not developed in a vacuum and features and functions that are deemed to be
“absolutely required” or “desired” might be influenced by what is readily available (and the
associated price). For example, 10 years ago “online instant access at no communications cost to
the user” might have been deemed “expensive but desirable”; the Internet has made this a “nocost
must-have.” Automated teller machines are an example of how data exchange standards can
revolutionize an industry. Similarly, consider how microarray technology has changed the
requirements analysis from what it might have been a few years ago.
The size and complexity of the NBN information system will undoubtedly be impressive. There
are clearly a number of challenges surrounding the building of big systems. The failure rate and
cost of information systems are both exponentially related to size and complexity. Establishing
an optimal information system management model is critical to the success of the NBN. The
NBN Operations Center and the Board of Governors must understand that the informatics
infrastructure itself will need to be adequately staffed and funded and ought to be supported as an
informatics research enterprise, which is likely to promote quality improvements and attract the
most talent and the associated research support that often accompany them.2 It is clearly of
paramount importance that a fully developed design be accompanied by realistic budgets,
scalable implementation, and vigorous management of the social and political landscape via
attendant authority.
4.3.1 Management of the NBN Information System
It was the sense of the Design Team that while general features and common data elements
should incorporate input from diverse communities, information architecture design and
implementation by committee does not work well for large software projects. It was
recommended that a clearly empowered management model serve as the foundation of the NBN
information system. Thus, the day-to-day management of the architecture design and
development process must be guided by a strong and legitimized hand.3 In essence, it is proposed
that one individual have authority over project personnel, budget, design, and architecture and
that this person be accountable to the governing principles of the NBN. It is important to
establish authority early.
Although the decision-making authority for management and for design must be held by the
system manager and architect, they will need to work closely on requirements analysis with:
- Bioinformatics counterparts who have developed biospecimen bank systems and who
may be linked to this system directly or more loosely coupled via middleware in the spirit
of grid computing
- Users (principally scientists and research administrators)
- Pathologists and laboratory scientists
- Biospecimen donors and users
- Pure data users
- Clinical information system experts
- Patients
- Bioinformatics experts with knowledge of similar systems
4.3.2 Design of the NBN Information System
It is recommended that the technical design of the NBN information system be created by the
overall system manager’s appointed architect. This section provides guidance on starting
principles and will outline areas for which requirements will need to be developed. A detailed
statement of the design is beyond the scope of this document.
4.3.2.1 Data Standards
The development of the NBN data standards could be challenging, primarily because the data
standards and vocabularies that different user communities use are so heterogeneous (both within
and between the groups), because the data structures and required reports are so inherently
complex, and because of the need for longitudinal data. Any resulting models will have to take
into account Health Level Seven (HL7) and UMLs and incorporate what is current at NCI. It is
possible that some new data standards will need to be developed (hopefully as a variation of
current ones). To illustrate, consider the wide variety of data representation required to encode
such disparate data sources as encoded clinical pathology reports, text histopathology reports,
demographic and clinical information, insurance records, clinical trials protocols, and microarray
research results.
4.3.2.2 Minimum Dataset and Data Location
Assuring the availability of a minimal dataset should be an important element of the NBN
system design. In addition, because modern computer communications systems make it possible
to access data stored in many locations from many other (different) locations, it is unnecessary to
build a single comprehensive database that would contain all NBN data. Nevertheless,
connecting disparate information systems will likely be an evolutionary process, greatly
facilitated by the more widespread adoption of data standards, such as the Systematized
Nomenclature of Medicine (SNOMED), which has recently been adopted by the Department of
Health and Human Services (HHS) as a standard.4 A common
lexicon of terms will not only
create a new standard in health care and biomedical research but also enable researchers to mine
databases with greater efficiency and added confidence that all available relevant information has
been detected.
4.3.2.3 Architecture
It is proposed that the architecture be a standards-based distributed system, with a central
database at first that will evolve over time to be highly distributed, with the central database
being limited to pointers and storage of any relatively stable, highly used administrative data. It
was noted that sites that may be major data contributors (e.g., community hospitals) may be least
equipped to handle a distributed data architecture.
The NBN architecture would specify standards at Open System Interconnection (OSI) Level
Seven only and rely on HL7 standards wherever possible, developing new ones as needed.5
Thus, the NBN information system would be built in a way that will not specify how a local site
or Business Unit stores its data, what operating systems, hardware, or software it uses, or even
what type of internal communications modalities it uses. Because it is certain that there will be
data links to heterogeneous and distributed databases of ancillary clinical and pathological
information, it would be wasteful over the long term to create duplicate local data stores and
other capabilities except of selected, highly used data. On the other hand, for the first year or
two, unless the system is well established, it is likely to be relatively centralized, and there will
inevitably be some duplication.
However, it is strongly recommended that the NBN information system require that the data
arrive in a certain format, and be packaged in a precise way, via the Internet. The NBN systems
will know how to “open” the package and interpret the contents. Practically, this means that one
of the first orders of business for the Bioinformatics Unit within the Operations Center will be
the creation of a data model.
There may be times when the NBN will make recommendations with respect to software or
hardware, based on experience. For example, it may learn that a certain software toolkit provides
reliable data translations from Hospital Information System X to NBN standards. In addition, the
NBN may develop specialized software (or even hardware) for use at local sites or within the
Business Units, perhaps on a fee basis. Use of NBN-recommended or developed software, to the
extent that it provides system-wide efficiencies, may be encouraged by incentives (e.g., a
guarantee that the data will be formatted correctly or NBN will fix it; or perhaps a lower charge
for another service). It is most likely that the NBN will develop software for local use if the NBN
is asking for specialized data (e.g., on a rare disease) or for data for which standards do not yet
exist (e.g., microarray data).
All communications will be Internet protocol-based, possibly with grid storage and job
distribution capabilities strongly enabled; only later is it likely that the NBN will take advantage
of parallel processing for advanced computation. The importance of a multitiered architecture
has been emphasized by Working Group members.6
The system should probably be designed as a tightly coupled and extensible set of independent
modules, each one associated with a specific set of functions. The design should fully support
“Plug & Play” operations, where new modules (as well as new versions of existing modules) can
be immediately deployed, as long as they interact with (and possibly extend) well-defined
interfaces. Identifying these modulesand assuring their true functionalitywill be one of the
first jobs of the system architect. For example, all modules should be designed with robust and
flexible key structures that allow them to be used in unanticipated fashions.
Vertical (functional) modules would allow for the easy addition of new functionality, e.g., a new
data analysis algorithm or a new remote data capture module; horizontal (behavioral) modules
can be used to drive the global behavior of the system by integrating the different functionalities
via well-defined contracts and workflow patterns (Table 4-1). In practice, even the
vertical
functions, if properly developed, could be used throughout the system. For example, a strong
education component could be included in the Specimen and Data Acquisition Business Unit as
part of the informed consent process, in the Advanced Analysis Business Unit to bring a new
technician up to speed, and in the Patient Relations Business Unit to help donors understand how
their sample type is being used.
The NCICB platform, with its rich APIs and multitier architecture, could be integrated with other
systems that provide the additionally needed functionality, such as a sample inventory
management and tracking system. The NCICB platform currently does not include an
impenetrable security and encryption mechanism; however, the Ardais model suggests that
private patient data could remain at the primary specimen collection sites and that only
deidentified clinical information needs to be transmitted into the biospecimen informatics
network.
One important virtue of the modular approach is that well-designed (but extensible) security
functions that protect confidentiality can be used throughout the NBN, enhancing overall
security while allowing site-specific customization (e.g., e-consenting forms that might vary
across hospitals). Essential functions are designed once to serve the NBN securely as a whole.
4.3.3 System Functions Supported by Bioinformatics
The NBN must rely on its bioinformatics system to support the overall integration and exchange
of data for all of the other business units. Figure 4-2
provides a schematic view of the system
functions mapped to the business functions outlined in 6. Governance and Business
Models, and this section provides a brief discussion of these system functions.
Table 4-1. Candidate “Plug & Play” Modules
Horizontal Modules (local functions) |
- Authentication and authorization
- Sample banking
- Case report form design
- Cohort/study management
- Informed consent management
- Data analysis
- Reporting
- Education
- Questionnaire
- Data warehousing
- Sample tracking
- Remote data capture
- Informed consent form design
- Recontact
- Clinical stratification
- Knowledge management
- LIMS integration
- Study enrollment
- E-signature
- Genetic data banking
|
Vertical Modules (global behaviors) |
- Workflow management
- Study participant withdrawal
- Role assignment
- Sample request management
- Verification of data upload
|
Figure 4-2. Mapping of Business Units and Their Key Functions to Bioinformatics
4.3.4 System Functions Supported by Bioinformatics
The NBN must rely on its bioinformatics system to support the overall integration and exchange
of data for all of the other business units. Figure 4-2 provides a
schematic view of the system
functions mapped to the business functions outlined in 6. Governance and Business
Models, and this section provides a brief discussion of these system functions.
4.3.4.1 Research Administration and Support
Managing the research process will be greatly facilitated by the application of bioinformatics at
every level. Tasks will include:
- Developing an e-consenting and reconsenting function, which will require the
development of data standards for consenting information and associated electronic
signature
- Developing a database of research activities, researchers, and grants creating a Web site
about the NBN and its capabilities
- Administering the NBN Biospecimen Utilization Review Committee to provide equitable
access to biospecimens and data
- Developing approaches to report activities to the NBN Operations Center and the Board
- Facilitating the identification of candidate specimens via user-friendly interfaces (e.g.,
users should not need to know Medical Subject Heading terms when looking for data
about specific types of tumors)
- Developing a mechanism where data or analyses derived from NBN specimens can be
added to the system for the benefit of future research
4.3.4.2 Biospecimen and Data Acquisition
The system will be required to manage the data for acquisition, basic analysis, storage, and
distribution. In particular, the system will need to be able to identify biospecimens;
add/delete/modify information; and track their location, availability, size, state (fixed, frozen, or
both), and many other factors to be defined as NBN standards in 3. Biospecimen and Data
Collection and Dissemination. To support those broad activities, specific functionality will
include the following:
- Informed e-consent and reconsenting management, tying the specific elements of the
consent to the actual samples. Computerized informed consent forms should be defined.
- Integration with LIMS for tissue storage, retrieval, transformation, and shipping/receiving
tracking
- Management of pathology reports and other longitudinal clinical data, including coding
reports according to NBN standards, and ensuring that a true longitudinal picture of the
patient can be created and studied
- Support for representing and monitoring the initial staging procedure and the standards
for that procedure
- Tissue preparation (technology application) details and output
- E-signed workflow management to ensure that appropriate procedures are followed
- Developing a mechanism to communicate results to clinical providers
4.3.4.3 Storage and Distribution/Basic Analysis
It is anticipated that there will be regional storage and allocation of specimens, and that at these
sites some preliminary analyses will be performed. At a minimum, systems will need to be
created to:
- Receive, store, retrieve, and ship samples, and provide confirmation of receipt
- Manage inventory
- Conform with International Air Transport Association regulations
4.3.4.4 Advanced Analysis
A large amount of semistructured and unstructured bioinformatics data is expected to be
generated from microarray analyses and other technologies. Important tasks in the bioinformatics
arena will include the following:
- Implementing and enhancing existing standards for gene expression microarray data,
tissue microarray data, and proteomics dataprobably by participation in a larger
processwill be an important part of the bioinformatics work of this project. The
MIAME standards are Object Management Group approved, and several tools exist for
creating XML documents for data exchange from a MIAME-compliant database.7
- Establishing detailed specifications for how to add or link these kinds of data to the NBN
system and update, delete, and archive data will be required.
- Developing integration with LIMS or LIMS-like systems, via Logical Observation
Identifiers Names and Codes (LOINC)®-based HL7 methods, will also be an important
consideration.8
- Developing a mechanism where data and/or analyses generated from NBN biospecimens
by one user can be archived and shared with other potential users.
4.3.4.5 Patient Relations
It is to be emphasized that the NBN will maintain a firewall between the donor’s identifying
information and research materials. The bioinformatics system will act as an honest information
broker. If patients are supposed to be able to request that their unused specimens be withdrawn
from the repositories, then the NBN must record the specimens in a way that they can be linked
to the original donor. If research results have implications for care, the community of NBN
donors will be contacted, but never individuals (see 2. Management of Ethical and Legal
Issues).
Thus, the system will need to provide an effective way to communicate key research findings to
patients (biospecimen providers) and their families (see also 5. Communications). This
will involve:
- Developing a way to communicate to patients (and their families, as appropriate) findings
with implications for treatment and counseling (genetic, participation in trials).
- Developing education and outreach materials for all the constituencies, including study
participants, investigators, and the public at large.
4.3.4.6 Bioinformatics and Data Management
The NBN must be able to manage the information system tasks that cut across all the business
units and that will be a shared (virtual) resource. The Bioinformatics and Data Management
Business Unit supports a series of very general information system tasks that are discussed here.
Reporting. The design of the reports ultimately drives all system requirements for database
applications. The NBN data architecture must allow the ability to:
- Facilitate researcher access to information.
- Enable or support the creation of longitudinal “virtual” studies to follow a cohort of
patients through clinical outcome. (This is useful for early testing of a biomarker or
diagnostic tool, but it requires modestly sophisticated information. Validation may
require more sophisticated information because results might ultimately be tested in a
clinical trial).
- Permit operational and management reporting.
Database and Data Model. It is likely that there will be a central core data facility, with a
standard dictionary. SNOMED is a widely accepted standard, making its use within the NBN a
yet more plausible choice. The NBN may have some dictionary requirements that go beyond
SNOMED, and it is recommended that, whatever auxiliary functions are developed, the NBN
dictionary cross-walk with other dictionaries. Similarly, the data system will need to have
developed associated business logic and development and reporting tools. Parts of the system
will be distributed. The distribution structure will evolve as technologies change and clinical data
become more available (and more massive).
Candidates for some local stores include: Selected clinical (patient) data, trials data (research on
these and related specimens, including data and bibliographic), and analytic data (histology and
histochemical, microarray, and other genetic and chemical analyses). A detailed data model will
need to be developed that includes at a minimum existing clinical, public health, insurer/Centers
for Medicare and Medicaid Services, pathology, histology (possibly image), research, and
bibliographic information. The model will have to allow patients to be followed longitudinally
and linked to family members. It may also be desirable to collect data related to specific ethnic
groups. It remains to be determined whether animal data related to the area of research should
also be connected. It will be important to determine a hierarchy of information that would be
required in a first version versus what is desired in what will probably be a series of later, more
sophisticated versions.
Security/Controlled Access. The bioinformatics system should have strong authentication that
supports the confidentiality and access rules determined by the governance body (which may in
turn be advised by confidentiality and legal experts). Differential levels of data identification will
be maintained in parallel, offering varying levels of access depending on the type of user (e.g.,
system administrator, NBN staff, researchers, patients).
User Support. User support will be a priority. There will be many types of users including
researchers, the other business units, NBN Operations staff, and informatics staff at sites. User
support will include helping with implementation and establishing minimal configurations
(including needed communications infrastructures), updates, and user help. User support may
also include a Web site, automated voice response, manuals, a help desk, and help with remote
configurations. The system will only be as good as the support. The NBN will need to consider
IT staffing at selected sites at a level commensurate with that of some private sector firms.
4.3.5 Quality Assurance
Demonstrating that quality measures have been set and attained (or that remedial actions have
been taken) will be one of the most challenging aspects of both the program management and the
bioinformatics tasks. Developing quality measurement tasks will include:
- Working with the QA Unit to define measurable objectives and pertinent regulations and
procedures that the NBN must meet in supporting the bioinformatics infrastructure
- Developing an executive information system that reports on how each of the business
units is doing in meeting their scientific, financial, and management objectives, and
ensuring that the system is adding value
- Assuring regulatory compliance
4.4 Summary of Key Findings and Recommendations
The NBN will require an intelligent information system to track biospecimens, key associated
clinical and biological data, and the information needed for research administration. The
development of the system will be a major project that must be done simultaneously with the
development of the NBN as a whole, rather than a being a standalone or later add-on. It will be
essential to identify the appropriate balance of “buy” versus “build.” Key features of the NBN
information system are as follows:
- While recommendations about the general architecture and common data elements
require broad input, to maximize efficiency, a central architect should be designated to
build and manage the bioinformatics infrastructure. It is proposed that this central
architect have authority over project personnel, budget, design, and architecture, and also
be accountable to the governing principles of the NBN.
- The NBN bioinformatics system should be standards-based (e.g., SNOMED, HL7, or
MIAME for data; Internet for communications) to enable data and information exchange
among system components and the researchers who use them. In designing the
bioinformatics system, a standards-based approach will allow flexibility to employ
individualized approaches, while reducing the difficulty involved in developing a
comprehensive system that links diverse components of the NBN. The NBN should not
have specific requirements by the operational units for databases or hardware systems.
- Assuring the availability of a minimum dataset and location should be an important
element of the NBN system design.
- Modularized components (“plug and play”) will be used (and developed) to the extent
possible, allowing the NBN data architecture to build upon best practices from other
repositories.
- Critical benchmarks of success for the NBN data system include ease of data entry and
retrieval, highly responsive user support, and commitment to the NBN Quality Assurance
process.
- While Bioinformatics and Data Management will represent one of the business units of
the NBN, it will also traverse and integrate all the business units and will be a shared
(virtual) resource underpinning all of the NBN operations over time.
- Development and implementation will start with a centralized database, and then move to
more a decentralized yet interconnected model as the system matures, capabilities are
strengthened, and requirements clarified.
|