Principals and Alternates Meeting March 25, 2008

CENDI PRINCIPALS AND ALTERNATES MEETING
National Archives and Records Administration
College Park, MD
March 25, 2008

Minutes

Keeping Up with a Changing Environment

Cooperation, Lessons Shared, and the Advance of Federal Technologies
Impacts and Changes on the Horizon for Scientific Communications and STI
NARA Showcase

Welcome

Mr. Ryan, CENDI Chair, opened the meeting at 9:10 am. He thanked NARA for hosting the meeting.

NARA’s Chief Information Officer, Martha Morphy, welcomed CENDI. The relationships and understanding between NARA and other CENDI agencies are very valuable and support the records transfer process as well as NARA’s other work with the agencies. She sees the CENDI meeting as a great opportunity for many NARA staff and managers to participate. The work that CENDI does is important.

Keeping Up with a Changing Environment

“Cooperation, Lessons Shared, and the Advance of Federal Technologies”
Susan Turnbull, Senior Program Advisor, General Services Administration

In her position as co-chair of the Emerging Technology Subcommittee of the CIO Council’s Architecture and Infrastructure Committee, Ms. Turnbull conducts monthly Collaborative Expedition Workshops. This has been ongoing for the last seven years and a process has evolved for addressing some issues that agencies continue to face today. The workshops originated within her office at GSA when she put together a study on the digital divide in the spring of 2001. This seemed like a good opportunity to turn to a collaborative process. The activity set in motion the philosophy and infrastructure for the Expedition Workshops.

At that time, there were three emerging trends identified, that remain important today:XML metadata, N11 services (911, 311, 411, and networked improvement communities or communities of practice (CoPs). Today, Networked Improvement Communities are co-organizers of the monthly Expedition workshops.

One goal of the series is to mature emerging technologies together, to reduce stove-piping that fragments and reduces the return on efforts. The intelligence community has emphasized the need for this approach, as insularity becomes a vulnerability.
The Emerging Technology Subcommittee that Susan co-chairs, uses the metaphor of moving from ‘stovepipes” to wind-chimes” where the right balanc of alignment and autonomy is retained..

Also, the workshops are intended to open dialogue across science and service agencies, in order to enhance early discernment around promising capabilities as they move from research to operational settings.

One initiative of the Emerging Technology SC is Strategy Mark-up Language (StratML),, an XML mark-up for strategic plans of organizations. There are numerous federal agency strategic plans as well as strategic plans from other organizations and international governments that have already been rendered in StratML. The common mark-up makes them searchable and displayable in a common form. The CENDI goals and objectives document is included. StratML URL is http://xml.gov/stratml/index.htm

The Emerging Technology Life Cycle process is another initiative of the Emerging Technology SC. An instance document can be submitted at http://et.gov. Typically, a Community of Interest, comprised of federal agency representatives and others will form around it to explore the possibilities and mature the capability.

A problem-centered approach helps to focus each monthly workshop. Ms. Turnbull shared a list of some of the high level questions that have been the focus of previous workshops. This approach supports dialog across science and service agencies, and people leave with a larger sense of the whole and a better capacity for discernment around emerging technologies. Approximately 60-80 people attend each month.

Increasingly, they are being asked to do specific workshops by CIOS and others. The workshops are public events, held monthly at the National Science Foundation.. All presentations are fully archived and frequently, audio files are also archived on the GSA web site. A blog was set up and started being utilized back in 2002. Since 2004, they have been using a collaborative work environment that includes wiki, discussion forum, shared file repository, and portal

In the winter of 2004, Ms. Turnbull witnessed and participated in a self-organizing group of 150 global volunteers in the aftermath of the Tsunami. This brought to the fore the power of collaboration and technologies that support new forms of collaboration.

In 2005, Ms. Turnbull was named as one of three leads, commissioned by OMB and CIO Council to produce a meta-model for data and information sharing in the federal government in 180 days. The Federal Data Reference Model was developed by designated staff of agency CIOs Using a Collaborative workspace provided by GSA. Over 300 documents were created and shared, input was received from over 500 people through five public workshops, and all comments were reconciled. GSA’s public wiki complemented the closed workspace used by the agency representatives. During this time, the Geospatial CoP also developed a geospatial profile for OMB in the public wiki.

Collaborative workspaces augment people’s natural ability to dialog and share. With a quality collaborative workspace, they were able to archive all the conversations and produce a product with a lot more people involved. They are using “best of breed” open services and GSA has also paid for some software. These scaffolding tools are available.

Workshops are now being co-organized with diverse communities including Federal IT research and development community (Subcommittee for Networking and IT R&D - http://nitrd.gov). They are also moving toward real national scenarios. To date, they have hosted 71 workshops and added more than 20 CoPs.

Throughout this process, Ms. Turnbull has learned several lessons from organizing via CoPs. First, don’t lead with the technology; it is important to hit the right balance between needs and technology. An environment must be created that allows people to see the whole picture and avoid insularity. A sustained dialog is important, to advance trust and increase the quality of the conversations.. The “cultural DOTS” should be connected. - Dialog, Openness, and Transparency that in turn, enable needed relationships to be built and sustained. The workshops benefit from plausible scenarios that draw on expertise from a number of different agencies. Many agencies can play a role around a given scenario and assume strategic leadership roles, while thinking out loud together to gain greater understanding around a complex problem. A shared purpose serves as a powerful organizing force behind successful public workshops.

Ms. Turnbull highlighted other projects where agencies are collaborating and new ideas are being investigated. These include the Federal Funding and Transparency Act site with a public comment wiki http://usaspending.gov.. The MAX Collaborative Workspace (https://max.omb.gov) is a wiki-like tool available to Executive Branch agencies to improve sharing and collaboration. Federal employees and agency-sponsored contractors can get accounts. The users are growing by about 100 each week. The Ontolog Community of Practice (http://ontolog.cim3.net) has over 600 people from 26 countries. Since the Fall, they have been participating with Ontolog CoP and NASA around issues at the intersection of Knowledge Management, Ontology and Decsion-support. The virtual conferences support multiple ways of participating, including teleconference with slides or participation via the NASA Island in Second Life where the “avatars” gather in an auditorium that includes display of slides.. This approach (virtual worlds) shows promise for scenario-based learning that includes - rapid modeling and simulation, recording and replay to build community capacity in uncertain settings such as disaster response. There is a NIH Neuroscience Information Framework Beta Test site, that links to Science.gov and lightly integrates disparate information resources under an umbrella search capability.

There are many instances of using social networking to enhance government activities. NASA Click Workers are volunteers from around the world who are mapping Mars. The contributions are exceeding what the scientists would be able to do alone. Engaging young people is also a significant aspect of this program.

We have extraordinary talent but how can we be smarter about organizing ourselves? You don’t give up your boundaries or autonomy to become more agile and effective through lightly aligning with others. This happens at the level of individual employees and at the organization level. Cultural differences can create barriers, but these can be overcome through sustained dialog which improves the trust and discernment across the barriers.

Ms. Turnbull invited CENDI members to attend the Expedition Workshops. She gave a list of the upcoming topics which range from identity management to the science of innovation. The next workshop will draw on a White House Homeland Security Council national preparedness scenario with a strong focus on identity management. It will be held on April 30 as part of the National Institute of Standards and Technology’s (NIST) Interoperability Week

Ms. Turnbull also offered the workshops as a venue for exploration of issues that matter to CENDI members. It is important to frame the workshops around broad national and global challenges. CENDI will add Ms. Turnbull to its observers’ list.

Discussion:

Ms. Frierson mentioned that the USDA is seeking to develop models that advance existing approaches for peer reviewed journal publishing. This suggestion also surfaced in an Expedition workshop held in September 2007 to plan future workshops.. A workshop on peer review is planned for July 15 drawing on challenges and initiatives from NSF, NIH, and possibly the Patent and Trademark Office. A peer to patent pilot is underway at USPTO to open up the process to a broad base of experts invited to participate in the patent review process. Ms. Carroll expressed interest in a future workshop on complexity.

“Impacts and Changes on the Horizon for Scientific Communications and STI”
Dr. Clifford Lynch, Director, Coalition for Networked Information (CNI)
Special Interest Topic: Digitization at Member Agencies

Dr. Lynch identified several scholarly communication issues that are on CNI’s radar screen. The Open Access Movement, the notion that scientific and scholarly communication should be openly accessible is still a point of significant discussion. This particularly applies to journal articles funded with public money or those from foundations that see public access as part of their mission. Policies, such as the NIH deposit requirement, have been made and now organizations, academics and publishers are scrambling to deal with non-trivial workflow implications. In addition, the Harvard Faculty Senate of the School of Arts and Sciences passed a faculty resolution which calls for default deposit of journal articles in a Harvard repository, opening up the research to the public without barriers. Again, the implementation and staff implications are non-trivial. Similar resolutions are being discussed at other major research universities.

The Scope 3 Proposal from CERN (European Organization for Nuclear Research) proposes to buy out the entire literature of high energy physics in order to make it publicly accessible. This changes the subscription model that is in place with the publishers; the publishers would be paid for the work they do, but the end result becomes publicly accessible. CERN has put a considerable amount of money on the table and has developed a framework for this initiative. It is being worked through the US university research library community.

The second issue CNI is following is a general set of changes to the research process known as cyberscience or e-science, involving the use of networked instrumentation, large observational and simulation databases, high performance computing and networking, visualization technologies, and virtual organizations to conduct science. This agenda has moved forward through NSF(led by the NSF Office of Cyberinfrastructure) and received major funding from the European Union (EU), the United Kingdom (UK) and some Asian countries. People are beginning to assess how cyberscience will impact the whole communications chain. Data management, data preservation, curation, and policies around data use and re-use are being discussed. A set of questions is arising about how this relates to the traditional journal publishing system. How do we ensure the integrity of this data and how is it evaluated?

There is a big discussion about how you relate data sets to journal articles. One model is that the data sets become ancillary information to the journal article and it becomes the responsibility of the publisher to preserve them. However, the publishers are concerned about the preservation commitment for the data compared to the article itself, and the scope of peer review of this information. Should the peer review reflect on the ancillary materials? The second model would have the data going into a repository separate from the journal article. The repository would be cited in the article through an accession number. This approach has been enforced and institutionalized in areas such as genomics and crystallography. In these cases, the repositories are disciplinary in nature, and are either international or part of international networks of continental or national repositories. There is a third possible model in which the institutions that host the researchers take responsibility, perhaps because there is no disciplinary repository infrastructure. Institutional repository development is now moving into data management, including the storage of software and data sets.

Paralleling the movement for open access is a movement towards open data: questions are being asked about who owns data and what the expectations are for sharing. This is highly variable by disciplinary culture and funder practice. Some disease- or problem-oriented organizations have terms in their grants that require data sharing even before the research results published because they are focused on the most expeditious way to solve the problem. Data management issues are occurring in the humanities and social sciences are well. Complications are showing up with regard to data re-use, including issues of the privacy of human subjects.

NSF recently issued a DataNet call for proposals. The funding agenda is to underwrite the data curation part of the cyberinfrastructure, establish large-scale collaborations to curate and preserve data, and to build capacity and conduct research pilots. NSF isn’t eager to pay for the long-term curation of data resulting from its grants. The universities are nervous and uncomfortable about a large unfunded mandate in this area. In cases where there are no obvious national repositories in place, universities might collaborate and expertise across institutions through some kind of cooperative. This has resulted, of course, in discussions about the impact on overhead.

The first round of the Data Net call resulted in pre-proposals in early January 2008. About two dozen responses were received. Approximately 10 submitters were invited to submit full proposals with awards expected this summer. Another round will occur in the next fiscal year. The aggregate funding is about 100 million dollars over five years.

NSF also has a funding program launched last year for scientific data sharing communities. The goal was to identify scientific communities that want to share data and need help getting started through $250,000 for workshops, tools, etc. These awards may prove to be very high leverage in certain communities.

A ”blue ribbon” task force (chaired by Brian Lavoie from OCLC and Fran Berman from the San Diego Supercomputer Center) funded by NSF, Mellon, and the UK’s Joint Information Systems Committee (JISC), with staff support from the Council for Library and Information Resources (CLIR), has a remit to work on sustainable models for data resources. The models are, however, limited. “Public good” should be considered but we don’t have good criteria for defining the core scholarly and scientific resources. We also do not have good models for assessing the risk, if data are not saved and must be recreated. There are some activities that can be funded out of user fees, but, often, this just moves the money around in a closed system, increasing the overhead and building up barriers.

The National Academies have convened an annual e-journal summit for the last seven to eight years. The summit began when journals were making their transition from paper to electronic. The subjects have evolved to include issues of licensing, preservation, business models and the impact of the open access movement. One of the major segments this year was about research integrity, including peer review in a digital world, the ability to touch or doctor digital images that occur in publications, and how to deal with highly controversial articles. Some publishers are now saying that reviewers may demand underlying data as part of the peer review process, and they may look for a second laboratory to validate the findings.

There were hall conversations about the fragility of the peer review system resulting from several phenomena, including the substantial growth in the amount of published material, particularly non-US. Many of these materials have mechanical problems that add to the regular peer review issues. Journals are finding it harder to get reviewers, and there is concern that the quality of reviews are deteriorating. This is an emerging issue that could potentially get quite serious. Court cases surrounding the openness of peer review notes will add to the pressure on peer review. The issues around clinical trials and pharmaceutical companies may become a flash point similar to the NIH deposit mandate. The peer review process may be reaching a crisis point.

Now that there are large corpora of scholarly publications in digital form, Dr. Lynch is seeing a renaissance of interest in large scale data mining and language processing in biomedicine and other disciplines. This results in a whole new collection of issues. What does it mean for the literature to be open? Can I only get an individual article or can I download the whole corpus to compute on it? Does it mean that the publisher is obligated to provide a landing pad for these data mining tools? This issue is starting to pop up in license agreements and in open access conversations.

Increasingly, authors and publishers are realizing that the audience for their work is not just human but machine. It is helpful if you embed microformats and other types of mark-up to support the extraction of things that are described by ontologies and vocabularies. How much is this kind of semantic mark-up the responsibility of the author versus the editorial process? What are the ground rules for adding mark-up after the articles have been published? This calls the stability of the literature into question.

Large scale computation and open literature has also resulted in plagiarism detection systems like Turnitin.com that are used in K-12 and some higher education institutions. There is some evidence that plagiarism is happening more and that conventions regarding citation and copyright are not being adhered to worldwide. This results in a new problem for publishers, and there is work to integrate plagiarism detection systems into editorial workflow management.

The discussions about the literature will increasingly impact collections as well. The selection of record series may soon be dealt with by computer because there isn’t enough time for people to read. This approach is already showing up in litigation systems, resulting in a whole set of ways to deal with enormous corpora of evidence of various kinds. Dr. Lynch believes that this will become a major factor in how we think about archives and special collections in the future.

In February, CNI convened a workshop on name management and authorial identity, and the report should be available in May or June. Some publishers are considering assigning identities to authors. For example, Elsevier would assign an actionable name when you first enter its publication system. There are similar initiatives throughout higher education, resulting in the unique identification of people from potential applicants to the alumni with whom the universities wants to stay connected.

The Library of Congress commissioned and just received in January 2008 a major report on the future of bibliographic control, which talks a good deal about internationalization of name authority, and about the infrastructural value of name authority files and systems. Journals are starting to realize that there are actually a lot of names that, when Romanized, turn into the same character string, and authors may want to be known by their real names. Some journals are now allowing authors to have the Romanized name and the home character set in Unicode.

Journal publishers are starting to discuss the need for full name authority files. ISI Thomson and others have these name authorities, plus their institutions. As people become more interested in biblio and web metrics, these authority files will become more important. In addition, there is an initiative to have researchers go through the ISI database and claim their papers. This has raised questions about potential fraud. Also, as authors are beginning to hold rather than transfer copyrights, there will be a need to find authors over a longer period of time.

Cliff referred the audience to the CNI web site at www.cni.org as a source for additional information, and invited everybody to use the site to sign up for the CNI-announce mailing list; many of the documents he mentioned can be found by checking the archives of that list.

NARA Showcase

Federal Register and Digital Signatures – Jim Hemphill and Stephen Frattini

The Federal Register and NARA were created at the same time. With the New Deal, agencies became involved in regulation, but it was hard for those being regulated to know about the regulations. The Federal Register system was established to gather and disseminate federal regulations from the Executive Branch agencies in an organized way. They produce the book every single day. This has been done for 72 years and, until about 10 years ago, the process wasn’t much different than in 1932.

While they have moved to an electronic editing and publishing system, there is a wide diversity in the information technologies and skills of the user base. The question is how to grab something that is moving fast and get it into a system, while doing a publication on schedule when there is no room for delay or error.

The Federal Register System is part of the Federal Regulatory System that spans approximately 300 agencies and 1000 offices. It is just a piece of a much bigger process over which they often have little control. Some agencies have the most sophisticated technologies and software while others are relatively unsophisticated. The materials to be incorporated also vary widely. How do you get the material into a usable format with content type? These are legal documents; they must be tracked and cannot be changed by the process.

NARA’s implementation of the Federal Register System is one of the few agencies using public key infrastructure (PKI), receiving digitally signed and encrypted documents from anywhere in the government. GPO is their certificate authority. The Federal Register publishes the Federal Record and GPO prints and distributes it. The eDOCS document management system, which began in 1996, is now receiving over 550 digitally signed documents each month from 38 agencies. Only a small number of agencies still hand deliver documents. All documents are processed as electronic materials; even manuscripts are converted into electronic format. Editing occurs on only a single copy and then the document is transmitted electronically to GPO.

The material recently became available in an electronic public inspection desk which alerts people a day earlier than the actual publication of the Register. This is a significant step in making the government transparent. However, challenges, including the need to scrub the track changes that occur from the intense collaborative process of rule making, conversion difficulties with different forms of content such as maps and tables, and GPO’s continued transition from print to public access, remain.

Federal Records Center: New Services in the Records Center Program – Meg Phillips

The Federal Records Centers (FRCs) include 17 centers to store records that are in agency legal custody. In addition to storage, the FRCs perform about 13 million reference activities per year. Efforts are underway to provide an integrated, broader range of record keeping services for the agencies, including both paper and electronic.

They have been performing production scanning at the centers in Ft. Worth and St. Louis since 2006. As these services get fully ramped up, the FRC is adding these services to Atlanta, Georgia, and Riverside, California as well. The production scanning process is extremely flexible. Agencies can give a sample of the work to be done and the FRC will give a cost. They can do all kinds of levels of metadata, indexing, etc. NARA standards can be met, if needed. Custom services are available for special formats, such as micrographics, photographs, books and cards.

The SmartScan product is geared toward doing small batches that are currently being stored in paper. Agencies can request same day conversion from paper. The Electronic Records Vault provides climate control to slow deterioration on electronic media, protection from magnetic fields, particulate filtering, etc. This service is available at Suitland, Maryland, and Ft. Worth, Texas. The levels of service range from box tracking (as if it were paper) to item by item, with barcode tracking at the level of the CD or magnetic tape. Media disintegration services are available at the same locations to shred electronic media and provide certified witnesses.

The FRC is researching online repositories for e-records storage and services. They hope to offer truly modern services for born digital records. This would involve managing the records and not just the media. It would involve a portal-based model for providing shared access to a compliant e-collection. A pilot project is currently underway with the Electronic Records Archive. One of the four pilot agencies for the ERA project will be sending its temporary records to see what still needs to be done through the ERA before offering the FRC service. This pilot will also help to determine the true cost per unit per unit of time so they can determine how to price this service.

Archival Research Catalog – Pam Wright

The Archival Research Catalog (ARC) is the online catalog of NARA’s nationwide holdings in the Washington, DC, and regional archives, and in the Presidential Libraries. It serves as the central source and the first stop for research at the Archives. It has been on the web since 2002. The goal is to have 60 percent of the total holdings described and included this year and to add five percent each year. As many as 20,000 series were added to the catalog this year.

There are approximately 127,000 digital images in ARC. Its Digital Harvest Project and partnerships are being used to increase the number of images that are available digitally. One of the current partners is Footnote.com with about 20 million images available on its web site. Footnote gives the metadata to NARA. NARA uploads it to ARC and then links out to the image on Footnote. The goal is to make ARC the hub and point to others. The overall NARA strategy is available at www.archives.gov/digitization/

The ARC had over one million visits last year. About 93 percent are Americans. Genealogists and family historians are the primary users, followed by researchers and then academics and students. The majority of the outreach and web page design is geared toward these groups.

In order to make the holdings more accessible, ARC worked with Google to produce site maps. This approach is very easy, requires little maintenance, and can be shared with other search engines because of the “standard” protocol. They piloted the project to ensure that IT resources and reference staff wouldn’t be overwhelmed. ARC now has about 25 site maps with about 25,000 URLs on each. The inconsistency of Google’s relevant ranking is an issue; the algorithms are so sophisticated that it is hard to determine why the results come and go.

Ms. Wright showed the newly redesigned homepage. The old homepage was very text heavy. This site has a simple search box on the homepage, with focused buttons for the top user groups. With this redesign, 7 out of 10 people versus 1 out of 10 come into the database from the homepage. In addition, they have created topical subject pages by creating pre-coordinated searches based on keywords.

In April, they will roll out a new web database interface. They took this opportunity to improve the usability of the interface, conducting three rounds of usability testing. The results look more like a regular search engine that people are accustomed to. In the long-term, ARC will investigate online public access of the future in connection with the ERA.

Discussion

The inclusion of the ARC as a Science.gov database was discussed. It will be necessary to segment the content to focus on science. The ARC could also be added to WorldCat as a way of promoting its use.

National Declassification Initiative Program at NARA – Paul Wester

The Modern Records Program is responsible for records appraisal, and interacts with CENDI agencies in this area. What is the business case for scientific records? What is the value proposition to build into the grants process to ensure long-term records management of this information? Robert Chadduck from NARA is involved with the San Diego Supercomputer Center in this effort. There are also political issues, since public interest groups are very involved.

There are challenges within NARA that arise with the National Security Information Executive Order 12598 which called for automatic declassification by 2006. The agencies spent about seven years getting their house in order. This has resulted in about 160,000 cubic feet of records that have been brought to the archives that now must be dealt with by NARA or by another agency. They must decide what needs to stay secure and what doesn’t.

Mr. Wester described the various components of the declassification initiative. While agencies have been doing declassification themselves, NARA must now deal with the process. The Interagency Referral Center has hundreds of contractor staff representing the agencies involved in working through the records. A training and certification program for the representatives has been developed to help them understand and consider equities from other agencies that might be included in their own.

They are also focused on how information security, information management, and records management integrate. This is being addressed as part of the regular retention schedule decisions in an attempt to break down stovepipes. Current work is with the intelligence agencies, including a large number of audio visual records that must be dealt with in addition to text.

The morning program concluded at approximately 12:00 Noon.