Contents
1. Introduction
This document specifies the NIST/DARPA 1998 evaluation of speech recognition technology on broadcast news in English. The purpose of this evaluation is to foster research on the problem of accurately transcribing broadcast news speech and to measure objectively the state of the art. The evaluation will deal with the following types of television and radio shows:
This program material includes a combination of read speech and spontaneous speech, as well as a combination of recording environments in broadcast studios and in the field. It is expected that this material will provide an impetus to improve core speech recognition capability; to improve the adaptability of recognition systems to new speakers, dialects, and recording environments; and to improve the systems’ abilities to cope with the problems of unknown words, spontaneous speech, and unconstrained syntax and semantics.
The 1998 evaluation will be similar to that conducted in 1997, with the following changes: The primary evaluation task (Hub 4) will consist of the recognition of selected segments from shows, and/or excerpts from long speeches. As described below, there will be two "Spokes" associated with Hub 4 in 1998 -- a "10X System" Spoke, and an "Information Extraction" Spoke. Test materials provided by NIST will be identical for both the primary task and the spokes.
2. Definitions and Terminology
A "show" is a particular television or radio broadcast production, encompassing all of the dates and times of broadcast. Examples include "CNN Headline News" and "NPR All things Considered".
An "episode" is an instance of a show on a particular date (and possibly time), such as "All Things Considered on July 5, 1997" or "CNN Headline News at 1000 EDT on July 5, 1997".
A "story" is a continuous portion of an episode that discusses a single topic or event.
3. Evaluation Test Data
The evaluation test data will be selected from the audio component of a variety of television and broadcast news sources. The evaluation test data will consist of approximately three hours of speech divided into two data sets. The first data set (set1) will be taken from episodes broadcast between October 15 and November 14, 1996 which correspond closely in time and composition to the 1997 Hub-4 Test Set which is also designated as the 1998 Development Test Set. The second set (set2) will be taken from a different variety of shows broadcast in June, 1998. Each of the data sets will be selected by NIST from larger pools collected by the LDC. The two sets will be distributed in separate files and the epoch for each set will be "known". The data will be selected according to the following guidelines:
4. Training Data
The acoustic and language model data distributed for training, development test, and evaluation test usage in previous DARPA/NIST CSR evaluations constitute the "baseline" training data for the 1998 Hub-4 evaluation. These data include Broadcast News, Marketplace, TIMIT, Resource Management, ATIS, Wall Street Journal, North American Business News, Switchboard, Macrophone, and Call Home corpora as well as textual corpora consisting of broadcast news transcripts, newswire text, and language modelling tools and collections the LDC has made available to the CSR community.
Sites may make use of supplemental privately-acquired training material if such material is made available to the research community as follows: Privately-acquired data must be made available to the LDC in a form suitable for publication and unencumbered by intellectual property rights, such that it could be released as an LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if the Government elects to have the data published and an implied statement that the data is legally unencumbered. Delivery of the data to LDC may be done after the evaluation, provided that it is accomplished no later than March 31, 1999.
Sites may not make use of any news-oriented materials for training purposes of any kind (e.g., acoustical, language model, etc.) which are dated during the Set1 epoch (15-OCT-96 - 14-NOV-96) or after 28-FEB-98.
4.1 Acoustic Training Data
The suggested acoustic training data for this evaluation includes the approximately 200 hours of Broadcast News acoustical training data released by the LDC to date as well as the other corpora listed above. These data were recorded from shows that are similar in form and content to shows used for the evaluation test data. Note that the shows, but not particular episodes, in the training and evaluation test data may overlap. The time epochs for the training data and test set2 data are distinct. Note, however, that there will be temporal overlap between the training data and the set1 data. Therefore, sites may not train on recordings or transcripts of news-oriented broadcasts recorded during the set1 epoch (15-OCT-96 - 14-NOV-96) or after 28-FEB-98.
If sites make use of other acoustic data that they acquire privately or from outside sources, including additional untranscribed audio training data distributed by LDC for this evaluation, they must also supply as a contrast condition the evaluation results obtained from the same system trained only on the baseline acoustic training data.
4.2 Language Model Training Data
Text corpora available from the LDC including transcriptions of news broadcasts and newswire text may be used in constructing language models for the evaluation. Text conditioning tools for these texts are available from the LDC.
In addition to the texts for language model training provided by LDC and NIST, sites may make use of supplemental language model data that they acquire privately or from commercial sources so long as these materials are made available to the research community as indicated above. As indicated above, sites should not use anynews-oriented materials dated during the set1 epoch (15-OCT-96 - 14-NOV-96) or after 28-FEB-98 for training.
5. Development Test Data
The development test set for this evaluation consists of the 1997 Hub-4 evaluation test set. These data are similar in form and content and epoch to the "set1" portion of the 1998 evaluation test data.
Use of the 1997 Evaluation Test Set as an additional resource for system development is permitted.
6. Summary of Show Sources
The following lists represent the television and radio programs for which the LDC has negotiated redistribution rights, and which the LDC has recorded for use in Hub 4 training and test sets.
7. Annotation of Data
NIST and the LDC have developed a transcription and annotation system to aid in the development and evaluation of speech recognition technology. The transcription/annotation format has been modified in 1998 to accomodate tags required in the Information Extraction Spoke. The new format, Universal Transcription Format (UTF), also adds tag information to self-identify the appropriate DTD to apply to the transcript. The new format is intended to be extensible to other speech evaluation domains and in the future to be usable as both a reference transcription and system output format. Most of the changes in the format involve the addition of tags/attributes to the 1997 Hub-4 format. So, it should be relatively easy for Hub-4 ASR sites to adapt to the new format. For consistency, all prior Broadcast News corpora (1996 - 1998 training, development test, and evaluation test material) has been reformatted into this new format. These transcripts are available in a single compendium from the LDC via LDC Order Number LDC98E10. A subset of the training and devtest files in the release have been IE-NE tagged. Documentation regarding the UTF format is included in the release. See the README file in the release for further details.
More information on annotation may be obtained from the 1997 Hub-4 annotation specification, which is available at ftp://jaguar.ncsl.nist.gov/csr96/h4/h4annot.ps. Note that a different annotation system has been implemented by the LDC for the second hundred hours of training material.
NIST will provide reference transcriptions and annotations for the evaluation test set after the recognition results have been submitted.
8. Evaluation Conditions
Participating sites are required to conduct a single evaluation over all of the evaluation data. For sites who do not wish to implement their own segementation algorithms, NIST will supply speech segmentation information for the evaluation data using an automatic segmentation utility provided by Carnegie Mellon University. The latest version of the CMUseg Acoustic Segmentation Software is available from the NIST Speech Software Website. Sites may use the NIST-supplied segmentation information, or they may perform their own segmentation and classification. Any recognition approach is allowed, including running a decoder in unsupervised transcription mode. Any audio segment in the evaluation test data may be used to help decode any other segment of audio. (In other words, adaptation techniques may make use of audio across episode and show boundaries.)\
9. Hub-4 Test Conditions (Hub and Spokes)
The Hub-4 Broadcast News Evaluation will include a "Hub" transcription task and two "Spoke" Tasks:
9.1 Transcription Hub
This task is similar to the 1997 Hub 4 English language task, and the primary evaluation condition is to be word error.
9.2 "10X System" Spoke
This task involves submission of results for systems that run in less than or equal to 10X real time on a single processor (i.e., less than or equal to ~30 hours to process the ~3 hour evaluation test set). In the accompanying system description, system developers must document all computational resources used for the system, including processor type(s) and memory resources, and including discussion of processing time-allocation for the various signal-processing, segmentation, and decoding components of the system.
9.3 Information Extraction - Named Entity (IE-NE) Spoke
The Hub-4 Broadcast News Evaluation this year will include a new "Information Extraction" Spoke which will involve the implementation and evaluation of automatic Named Entity tagging as applied to the Hub-4 Reference and recognizer-produced transcriptions. This new spoke is based on the Message Understanding Conference (MUC) Named Entity task which involved the tagging of person, organization, location names and other entities in newswire text.
The Hub-4 IE-NE Spoke will explore the coupling of recognition and entity tagging technologies as an initial step toward creating new robust speech understanding technologies.
The spoke also supports the exploration of Named-Entity-based scoring as an alternative to traditional Word-Error-Rate(WER)-based scoring in evaluating continuous speech recognition performance. Whereas WER scoring is useful in evaluating recognition for dictation-oriented applications, NE-based scoring should be useful in highlighting "content word" errors which are critical in information search, detection, and tracking applications.
9.3.1 Participation Levels:
The Hub-4 Information Extraction Spoke will include 3 levels of participation:
Full-IE-NE: (Evaluation conditions 4-6, see below)
Quasi-IE-NE: (Evaluation conditions 4-5, see below)
Baseline-IE-NE: (Evaluation conditions 1-3, see below).
9.3.2 Tagged Entities:
The Hub-4 98 Evaluation IE-NE Spoke will require identification of the following entities:
This is to be accomplished by applying automatic tagging software to the standard Hub-4 recognizer output files (the CTM-format files used in scoring speech recognition Word Error.) The tagged output format is still being developed and will be specified at a later date. The automatically-generated IE-NE tagged transcripts will be scored against a set of hand-tagged reference transcripts using the SAIC Named Entity Scoring software based on the MUC-7 Named Entity Scorer. See below for details regarding obtaining the software.
The Hub-4 IE-NE Task Definition Version 4.8 document has been derived from the MUC-7 Named Entity Task Definition and refined to accomodate the Hub-4 spoken broadcast news domain. 9.3.3 Data Sets:IE-NE-tagged "Reference" transcripts, and "Baseline" recognizer transcripts will be provided for each of the following data sets:
The IE tags for the reference transcripts for each data set will be generated by human annotators at MITRE and SAIC. For the Evaluation test data, the Reference Transcripts will be redundantly annotated with IE tags by several annotators. Inter-annotator agreement will then be measured and used to determine the reference set of tags for scoring. 9.3.4 Evaluation Conditions:The IE-NE Spoke has 6 conditions which will be evaluated: the first 3 using the BBN baseline NE tagger implemented at NIST, and the last 3 implemented by sites using their own taggers.
9.3.5 Participation Levels/Required Evaluation Conditions:
9.3.6 IE Scoring:
NIST will score the 6 possible transcript/tag combination conditions using the Named Entity Scorer developed by SAIC. See below for details regarding obtaining the software.
9.3.7 Software:
The initial release of the Hub-4 IE-NE Scoring Software (Version 0.6) which was developed by SAIC is now available. The latest publicly available version of the scoring software is also available from the NIST Speech Software Website.
9.3.8 IE-NE Evaluation Schedule
This schedule was revised on October 5 and pertains only to the Hub-4 Full-IE-NE and Quasi-IE-NE evaluation conditions.
November 2, 1998 - Deadline for site commitment to participate
November 23, 1998 - Evaluation test data to be at participating sites, test begins
December 14 (0700 EST) - Deadline for submission of all test results
December 21, 1998 - NIST releases scores
February 1999 - Workshop for Hub 4 participating sites
9.3.9 Contact:
NIST has worked closely with MITRE and SAIC, drawing on their experience in the MUC evaluations, to develop the specifications, corpora, and software for the IE-NE Spoke. However, so that a central flow of information may be maintained, please address any questions and copy all correspondence regarding the IE-NE Spoke to John Garofolo at NIST (jgarofolo@nist.gov). Questions and comments to be directed to all IE-NE participants may be sent to the Hub-4 IE-NE email list (hub4_ie_list@jaguar.ncsl.nist.gov). Please send email to John Garofolo if you'd like to be added to the list.
10. Scoring
Sites will generate decodings that include word time alignments. The same scoring algorithm (SCLITE) used for the 1997 Hub 4 evaluation will be used for the "Transcription Hub" and "10X Spoke" of this evaluation. Word error will be the primary metric. In addition, new complementary "Named Entity" metrics and scoring software will be employed for the "IE-NE Spoke".NIST will tabulate and report word error rates over the entire dataset.
NIST will also tabulate and report Word Error Rates for various subsets of test material to examine performance for different conditions. As in previous evaluations, these will include the effect of the following annotated phenomena on recognition:
Special attention will be given to the F0 condition. This condition is of particular interest because the absence of other complicating factors such as background noise, music and non-native dialects focuses attention on basic speech recognition issues common to all conditions.
Immediately after the evaluation, NIST will provide the complete annotation record for the evaluation test material, to facilitate the analysis of performance by individual sites.
Evaluating sites are encouraged to submit the output of their system for a portion of the development test to NIST prior to the formal evaluation, to verify that the system output is processed properly by the NIST scoring software.
The current version of the NIST SCLITE Speech Recognition Scoring Software is available from the NIST Speech Software Website. Revisions are made periodically and will be announced in email.
10.1 Orthographic Rules
System developers should familiarize themselves with the orthographic transformations and rules used in preparing both the reference transcriptions and system hypothesis transcriptions prior to official scoring by NIST so that they can obtain the most accurate scoring of their systems.
10.1.1 SNOR Format
The transcription format employed for scoring is called SNOR (Standard Normalized Orthographic Representation). The SNOR format is derived from the detailed transcription format used in Hub-4 via a filter which will be made publicly available in August. The SNOR format provides a common format for recognition output. In doing so, it removes lexical details which are not part of the current Hub-4 research focus (such as capitalization, punctuation, etc.) from the transcription format to define and simplify the recognition and scoring process. A SNOR-normalized transcription consists of text strings made up of ASCII characters and has the following contraints:
The human generated reference transcripts are stored in their original detailed format and are translated into SNOR prior to scoring the output of an ASR system. It is important that these transformations are properly included in the design of the recognition systems, so that the system output may be scored optimally.
10.1.2 Orthographic Normalization
After the reference transcripts are converted into SNOR, both the reference and ASR-produced transcripts will be transformed via a NIST supplied transcript filter. The filter, especially updated for 1998, will be available on the 1998 Hub-4 Website in August. The orthographic map file which will be used in the official scoring will not be made available until after the Hub test results are received and scored by NIST. The version of the filter used in the 1997 Hub-4 evaluation, tranfilt-1.8.tar.Z, and 1997 Hub-4 orthographic map file,en971128.glm, are currently available. The sections below on "Multiple Spellings" and "Contractions" describe the mapping process.
10.1.3 Notes on the Handling of Special Orthographic Conditions
This section describes orthographic conditions and speech phenomena which require special processing. Note that some of these conditions are scored as "optionally deletable". In these cases, an ASR system will not be penalized with an error for omitting the output of (or "deleting") the particular word in the system output. However, reference words marked as such will still count towards the total number of reference words during Word Error Rate computations.
11. Multiple Systems Running a Single Test
In order to discourage the running of several systems on a single test to improve one’s chances of scoring well, sites must designate one system as the primary system if more than one system is run on a single test. This designation must be made before the test is begun. Results must be reported for all systems run on any test.
12. System Descriptions
Sites are required to submit a standardized system description
to NIST along with the results for each system run on any test. The format
for these system descriptions is as follows: SITE/SYSTEM NAME
HUB-4 {CORE/CONTRAST} TEST
1) PRIMARY TEST SYSTEM DESCRIPTION:
2) ACOUSTIC TRAINING:
3) GRAMMAR TRAINING:
4) RECOGNITION LEXICON DESCRIPTION:
5) DIFFERENCES FOR EACH CONTRASTIVE TEST:
6) NEW CONDITIONS FOR THIS EVALUATION:
7) REFERENCES:
Evaluating sites will be required to provide a written description at the Workshop of computational resource requirements including processor speed and storage requirements used to produce the evaluation results, and to publish information about the complexity of new algorithms.
13. Sharing of System Output
As in last year's Hub-4 evaluation, we encourage sites to permit the sharing of their system output files with other participants. Since this practice appeared to be broadly accepted last year, this year NIST will assume that participating sites are willing to allow NIST to make their output available to other participants for diagnostic and research purposes. Therefore, if you do NOT want to share your output, please notify David Pallett (dpallett@nist.gov) at NIST in email prior to submitting your results for scoring.
14. Site Commitments
Sites interested in participating in the 1998 Hub 4 evaluation should notify NIST no later than September 29, 1998. NIST will ensure that participating sites receive appropriate training and devtest material in a timely fashion after authorization to do so from the LDC. Sites must be members in good standing with the LDC or have made a test-only agreement with the LDC prior to being given access to the Hub-4 corpora.
Site commitments are used to control evaluation and to manage evaluation resources. It is imperative that sites honor their commitments in order for the evaluation to have beneficial impact. Sites must notify NIST as soon as possible, prior to the distribution of the evaluation data, if it appears that a commitment may not be honored. Defaulting on a commitment may jeopardize permission to participate, and to obtain early distributions of future test data, in subsequent evaluations.
15. Workshop
A workshop will be held in February 1999 for presenting evaluation results and discussing the technology used in the Hub 4 evaluation. Evaluation results will be reported by NIST, and invited and contributed presentations will be made by evaluation participants. Presentations and results at the Workshop will be published in a written publicly-available Proceedings. N.B. Participants will be required to deliver camera-ready copies of their papers (plus release approvals) at least one week prior to the workshop.
16. Schedule
Note that a different schedule applies to participants in the Full-IE-NE and Quasi-IE-NE evaluations. See Section 9.3.8 for the IE-NE Spoke schedule.
September 29, 1998 - Deadline for site commitment to participate
October 6, 1998 - Deadline for sites to submit Devtest results (optional)
October 13, 1998 - Evaluation test data to be at participating sites, test begins
November 10, 1998 (0700 EST) - Deadline for submission of hub primary test results
November 18 , 1998 - NIST releases scores for the hub primary test results
November 20, 1998 (0700 EST) - Deadline for submission of spoke and hub contrast test results
November 25, 1998 - NIST releases scores for the spoke and hub contrast test results
February 1999 - Workshop for Hub 4 participating sites