The 1998 Hub-4 Evaluation Plan for Recognition of Broadcast News, in English

Contents

1. Introduction
2. Definitions and Terminology
3. Evaluation Test Data
4. Training Data 4.1 Acoustic Training Data
4.2 Language Model Training Data 5. Development Test Data
6. Summary of Show Sources
7. Annotation of Data
8. Evaluation Conditions
9. Hub-4 Test Conditions (Hub and Spokes) 9.1 Transcription Hub
9.2 "10X System" Spoke
9.3 Information Extraction - Named Entity (IE-NE) Spoke 10. Scoring 10.1 Orthographic Rules 10.1.1 SNOR Format
10.1.2 Orthographic Normalization
10.1.3 Notes on the Handling of Special Orthographic Conditions 11. Multiple Systems Running a Single Test
12. System Descriptions
13. Sharing of System Output
14. Site Commitments
15. Workshop
16. Schedule

1. Introduction

This document specifies the NIST/DARPA 1998 evaluation of speech recognition technology on broadcast news in English. The purpose of this evaluation is to foster research on the problem of accurately transcribing broadcast news speech and to measure objectively the state of the art. The evaluation will deal with the following types of television and radio shows:

anchored news shows
news magazines
hearings, news conferences and speeches

This program material includes a combination of read speech and spontaneous speech, as well as a combination of recording environments in broadcast studios and in the field. It is expected that this material will provide an impetus to improve core speech recognition capability; to improve the adaptability of recognition systems to new speakers, dialects, and recording environments; and to improve the systems’ abilities to cope with the problems of unknown words, spontaneous speech, and unconstrained syntax and semantics.

The 1998 evaluation will be similar to that conducted in 1997, with the following changes: The primary evaluation task (Hub 4) will consist of the recognition of selected segments from shows, and/or excerpts from long speeches. As described below, there will be two "Spokes" associated with Hub 4 in 1998 -- a "10X System" Spoke, and an "Information Extraction" Spoke. Test materials provided by NIST will be identical for both the primary task and the spokes.

2. Definitions and Terminology

A "show" is a particular television or radio broadcast production, encompassing all of the dates and times of broadcast. Examples include "CNN Headline News" and "NPR All things Considered".

An "episode" is an instance of a show on a particular date (and possibly time), such as "All Things Considered on July 5, 1997" or "CNN Headline News at 1000 EDT on July 5, 1997".

A "story" is a continuous portion of an episode that discusses a single topic or event.

3. Evaluation Test Data

The evaluation test data will be selected from the audio component of a variety of television and broadcast news sources. The evaluation test data will consist of approximately three hours of speech divided into two data sets. The first data set (set1) will be taken from episodes broadcast between October 15 and November 14, 1996 which correspond closely in time and composition to the 1997 Hub-4 Test Set which is also designated as the 1998 Development Test Set. The second set (set2) will be taken from a different variety of shows broadcast in June, 1998. Each of the data sets will be selected by NIST from larger pools collected by the LDC. The two sets will be distributed in separate files and the epoch for each set will be "known". The data will be selected according to the following guidelines:

The data will consist of a single monophonic channel of audio, even if the original program material is distributed in stereo in which case only one of the two broadcast channels is used.
The evaluation data will consist of portions of episodes. These materials will be selected on a story-by-story basis to provide support for experimentation in adaptive language modeling. As in 1997, the selected portions will be spliced together with a variable-duration overlap using a logarithmic fade-in/fade-out algorithm.
The evaluation data will exclude commercials or sports since they contain linguistic and acoustic challenges which are considered to be outside of the evaluation domain.

4. Training Data

The acoustic and language model data distributed for training, development test, and evaluation test usage in previous DARPA/NIST CSR evaluations constitute the "baseline" training data for the 1998 Hub-4 evaluation. These data include Broadcast News, Marketplace, TIMIT, Resource Management, ATIS, Wall Street Journal, North American Business News, Switchboard, Macrophone, and Call Home corpora as well as textual corpora consisting of broadcast news transcripts, newswire text, and language modelling tools and collections the LDC has made available to the CSR community.

Sites may make use of supplemental privately-acquired training material if such material is made available to the research community as follows: Privately-acquired data must be made available to the LDC in a form suitable for publication and unencumbered by intellectual property rights, such that it could be released as an LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if the Government elects to have the data published and an implied statement that the data is legally unencumbered. Delivery of the data to LDC may be done after the evaluation, provided that it is accomplished no later than March 31, 1999.

Sites may not make use of any news-oriented materials for training purposes of any kind (e.g., acoustical, language model, etc.) which are dated during the Set1 epoch (15-OCT-96 - 14-NOV-96) or after 28-FEB-98.

4.1 Acoustic Training Data

The suggested acoustic training data for this evaluation includes the approximately 200 hours of Broadcast News acoustical training data released by the LDC to date as well as the other corpora listed above. These data were recorded from shows that are similar in form and content to shows used for the evaluation test data. Note that the shows, but not particular episodes, in the training and evaluation test data may overlap. The time epochs for the training data and test set2 data are distinct. Note, however, that there will be temporal overlap between the training data and the set1 data. Therefore, sites may not train on recordings or transcripts of news-oriented broadcasts recorded during the set1 epoch (15-OCT-96 - 14-NOV-96) or after 28-FEB-98.

If sites make use of other acoustic data that they acquire privately or from outside sources, including additional untranscribed audio training data distributed by LDC for this evaluation, they must also supply as a contrast condition the evaluation results obtained from the same system trained only on the baseline acoustic training data.

4.2 Language Model Training Data

Text corpora available from the LDC including transcriptions of news broadcasts and newswire text may be used in constructing language models for the evaluation. Text conditioning tools for these texts are available from the LDC.

In addition to the texts for language model training provided by LDC and NIST, sites may make use of supplemental language model data that they acquire privately or from commercial sources so long as these materials are made available to the research community as indicated above. As indicated above, sites should not use anynews-oriented materials dated during the set1 epoch (15-OCT-96 - 14-NOV-96) or after 28-FEB-98 for training.

5. Development Test Data

The development test set for this evaluation consists of the 1997 Hub-4 evaluation test set. These data are similar in form and content and epoch to the "set1" portion of the 1998 evaluation test data.

Use of the 1997 Evaluation Test Set as an additional resource for system development is permitted.

6. Summary of Show Sources

The following lists represent the television and radio programs for which the LDC has negotiated redistribution rights, and which the LDC has recorded for use in Hub 4 training and test sets.

ABC Nightline

ABC Prime Time

ABC World Nightly News

ABC World News Tonight

CNN Early Edition

CNN Early Primetime News

CNN Headline News

CNN Morning News

CNN World View

CNN Primetime News

CNN The World Today

C-SPAN Public Policy

C-SPAN Washington Journal

NPR All Things Considered

NPR Morning Edition

PRI Marketplace

PRI The World

7. Annotation of Data

NIST and the LDC have developed a transcription and annotation system to aid in the development and evaluation of speech recognition technology. The transcription/annotation format has been modified in 1998 to accomodate tags required in the Information Extraction Spoke. The new format, Universal Transcription Format (UTF), also adds tag information to self-identify the appropriate DTD to apply to the transcript. The new format is intended to be extensible to other speech evaluation domains and in the future to be usable as both a reference transcription and system output format. Most of the changes in the format involve the addition of tags/attributes to the 1997 Hub-4 format. So, it should be relatively easy for Hub-4 ASR sites to adapt to the new format. For consistency, all prior Broadcast News corpora (1996 - 1998 training, development test, and evaluation test material) has been reformatted into this new format. These transcripts are available in a single compendium from the LDC via LDC Order Number LDC98E10. A subset of the training and devtest files in the release have been IE-NE tagged. Documentation regarding the UTF format is included in the release. See the README file in the release for further details.

More information on annotation may be obtained from the 1997 Hub-4 annotation specification, which is available at ftp://jaguar.ncsl.nist.gov/csr96/h4/h4annot.ps. Note that a different annotation system has been implemented by the LDC for the second hundred hours of training material.

NIST will provide reference transcriptions and annotations for the evaluation test set after the recognition results have been submitted.

8. Evaluation Conditions

Participating sites are required to conduct a single evaluation over all of the evaluation data. For sites who do not wish to implement their own segementation algorithms, NIST will supply speech segmentation information for the evaluation data using an automatic segmentation utility provided by Carnegie Mellon University. The latest version of the CMUseg Acoustic Segmentation Software is available from the NIST Speech Software Website. Sites may use the NIST-supplied segmentation information, or they may perform their own segmentation and classification. Any recognition approach is allowed, including running a decoder in unsupervised transcription mode. Any audio segment in the evaluation test data may be used to help decode any other segment of audio. (In other words, adaptation techniques may make use of audio across episode and show boundaries.)\

9. Hub-4 Test Conditions (Hub and Spokes)

The Hub-4 Broadcast News Evaluation will include a "Hub" transcription task and two "Spoke" Tasks:

10X System (10 times real time or less systems)
IE-NE (Information Extraction - Named Entity)

9.1 Transcription Hub

This task is similar to the 1997 Hub 4 English language task, and the primary evaluation condition is to be word error.

9.2 "10X System" Spoke

This task involves submission of results for systems that run in less than or equal to 10X real time on a single processor (i.e., less than or equal to ~30 hours to process the ~3 hour evaluation test set). In the accompanying system description, system developers must document all computational resources used for the system, including processor type(s) and memory resources, and including discussion of processing time-allocation for the various signal-processing, segmentation, and decoding components of the system.

9.3 Information Extraction - Named Entity (IE-NE) Spoke

The Hub-4 Broadcast News Evaluation this year will include a new "Information Extraction" Spoke which will involve the implementation and evaluation of automatic Named Entity tagging as applied to the Hub-4 Reference and recognizer-produced transcriptions. This new spoke is based on the Message Understanding Conference (MUC) Named Entity task which involved the tagging of person, organization, location names and other entities in newswire text.

The Hub-4 IE-NE Spoke will explore the coupling of recognition and entity tagging technologies as an initial step toward creating new robust speech understanding technologies.

The spoke also supports the exploration of Named-Entity-based scoring as an alternative to traditional Word-Error-Rate(WER)-based scoring in evaluating continuous speech recognition performance. Whereas WER scoring is useful in evaluating recognition for dictation-oriented applications, NE-based scoring should be useful in highlighting "content word" errors which are critical in information search, detection, and tracking applications.

9.3.1 Participation Levels:

The Hub-4 Information Extraction Spoke will include 3 levels of participation:

Full-IE-NE: (Evaluation conditions 4-6, see below)

Full-IE-E sites will be required to develop and implement both Hub-4-style automatic speech recognition and IE-NE tagging. Sites without expertise in both areas are encouraged to team up with sites with complementary expertise to create cross-site Full-IE-NE systems. If possible, sites should use the same Hub-4 recognition algorithm they used to produce transcripts submitted for the Hub-4 primary evaluation as the front end to their IE-NE tagger.

Quasi-IE-NE: (Evaluation conditions 4-5, see below)

Sites with IE expertise who do not have access to a Hub-4 recognition system may participate in the Quasi-IE-NE subset of the spoke by implementing IE-NE tagging on only the output of the Baseline recognizer provided by NIST and the Reference Transcripts.

Baseline-IE-NE: (Evaluation conditions 1-3, see below).

To explore IE-NE as an alternative metric to Word Error Rate for speech recognition evaluation, NIST would like to apply the BBN-developed "Baseline" tagger to all Hub-4 recognizer transcripts (and to the human-generated "Reference" transcripts as a control), score the tagged transcripts using the Named Entity Scorer developed by MITRE/SAIC and report the results at the Broadcast News Workshop.

The Baseline-IE-NE tagging will be implemented entirely by NIST using the *.ctm files submitted by sites for the Hub-4 primary test condition with the BBN Baseline tagger. Therefore, even if you choose not to implement your own IE-NE tagger, your recognizer transcripts will be tagged using the Baseline tagger and scored by NIST (with no extra work on your part).

The CTM format is used for recognition system scoring by the NIST SCLITE Speech Recognition Scoring Software available from the NIST Speech Software Website. See the documentation accompanying SCLITE for more information.

ALL HUB-4 SITES WILL BE AUTOMATICALLY ENROLLED IN BASELINE-IE-NE UNLESS THEY SPECIFICALLY REQUEST TO BE EXCLUDED. If you do not wish to participate in Baseline-IE-NE (including having your results IE-NE-scored via the Baseline tagger and reported at the next Broadcast News Workshop), you must notify David Pallett at NIST in email (dpallett@nist.gov) PRIOR to submitting your Hub-4 system output to NIST for scoring.

9.3.2 Tagged Entities:

The Hub-4 98 Evaluation IE-NE Spoke will require identification of the following entities:

Named Entities: Named person, organization, or location
Temporal Expressions: date and time expressions
Numeric Expressions: monetary and percentage expressions

This is to be accomplished by applying automatic tagging software to the standard Hub-4 recognizer output files (the CTM-format files used in scoring speech recognition Word Error.) The tagged output format is still being developed and will be specified at a later date. The automatically-generated IE-NE tagged transcripts will be scored against a set of hand-tagged reference transcripts using the SAIC Named Entity Scoring software based on the MUC-7 Named Entity Scorer. See below for details regarding obtaining the software.

The Hub-4 IE-NE Task Definition Version 4.8 document has been derived from the MUC-7 Named Entity Task Definition and refined to accomodate the Hub-4 spoken broadcast news domain. 9.3.3 Data Sets:IE-NE-tagged "Reference" transcripts, and "Baseline" recognizer transcripts will be provided for each of the following data sets:

Training:: A subset of the Hub-4 training reference transcripts has been annotated with IE-NE tags. This data was selected to broadly cover the sources and epoch in the 1997 Hub-4 training set. It is included in the new Hub-4 UTF-formatted transcription compendium available from the LDC via LDC order number LDC98E10.
Development Test:: The 1997 Hub-4 English Broadcast News Evaluation Test Set (~3 hours) transcript has been annotated with IE-NE tags and will be used as the Hub-4 IE Development Test Set. It is included in the above Hub-4 UTF transcription compendium available from the LDC.
Evaluation Test:: The 1998 Hub-4 English Broadcast News Evaluation Test Set (~3 hours) transcripts will be annotated with IE tags and released after NIST scores the submitted Hub-4 recognizer transcripts this fall.

The IE tags for the reference transcripts for each data set will be generated by human annotators at MITRE and SAIC. For the Evaluation test data, the Reference Transcripts will be redundantly annotated with IE tags by several annotators. Inter-annotator agreement will then be measured and used to determine the reference set of tags for scoring. 9.3.4 Evaluation Conditions:The IE-NE Spoke has 6 conditions which will be evaluated: the first 3 using the BBN baseline NE tagger implemented at NIST, and the last 3 implemented by sites using their own taggers.

Baseline Tagger Conditions (implemented entirely by NIST):
Site-Developed Tagger Conditions (sites implement own tagger):

9.3.5 Participation Levels/Required Evaluation Conditions:

Full-IE-NE: Sites participating in Full IE-NE (requiring both recognition and IE-NE tagging) are required to implement Evaluation Conditions 4, 5, and 6.
Quasi-IE-NE: Sites participating in the Quasi IE-NE subset (requiring only IE-NE tagging) are required to implement Evaluation Conditions 4 and 5.
Baseline-IE-NE: NIST will run the BBN Baseline Tagger on the Reference Transcripts, Baseline Recognizer Transcripts, and site-produced Hub-4 CTM-format recognizer transcripts to implement Evaluation Conditions 1, 2, and 3.

9.3.6 IE Scoring:

NIST will score the 6 possible transcript/tag combination conditions using the Named Entity Scorer developed by SAIC. See below for details regarding obtaining the software.

9.3.7 Software:

The initial release of the Hub-4 IE-NE Scoring Software (Version 0.6) which was developed by SAIC is now available. The latest publicly available version of the scoring software is also available from the NIST Speech Software Website.
The technical contact for the Hub-4 Named Entity Scorer is:

Aaron Douthat (douthat@gso.saic.com)

The scorer used in last year's MUC-7 Named Entity evaluation is available upon request from MITRE. If sites wish to obtain access to the MUC-7 Noisy Data Named Entity Scorer, they must contact:

John Aberdeen (aberdeen@linus.mitre.org)

If sites wish to obtain access to the BBN "IdentiFinder" Baseline tagger, they must contact BBN directly. Here are the contacts for the tagger:

Software Licensing: Ralph Weischedel (weisched@bbn.com)
Software Support: Heidi J. Fox (hfox@bbn.com)

The MITRE Alembic Workbench mixed-initiative annotation environment will be used to prepare Named Entity answer keys.

9.3.8 IE-NE Evaluation Schedule

This schedule was revised on October 5 and pertains only to the Hub-4 Full-IE-NE and Quasi-IE-NE evaluation conditions.

November 2, 1998 - Deadline for site commitment to participate

November 23, 1998 - Evaluation test data to be at participating sites, test begins

December 14 (0700 EST) - Deadline for submission of all test results

December 21, 1998 - NIST releases scores

February 1999 - Workshop for Hub 4 participating sites

9.3.9 Contact:

NIST has worked closely with MITRE and SAIC, drawing on their experience in the MUC evaluations, to develop the specifications, corpora, and software for the IE-NE Spoke. However, so that a central flow of information may be maintained, please address any questions and copy all correspondence regarding the IE-NE Spoke to John Garofolo at NIST (jgarofolo@nist.gov). Questions and comments to be directed to all IE-NE participants may be sent to the Hub-4 IE-NE email list (hub4_ie_list@jaguar.ncsl.nist.gov). Please send email to John Garofolo if you'd like to be added to the list.

10. Scoring

Sites will generate decodings that include word time alignments. The same scoring algorithm (SCLITE) used for the 1997 Hub 4 evaluation will be used for the "Transcription Hub" and "10X Spoke" of this evaluation. Word error will be the primary metric. In addition, new complementary "Named Entity" metrics and scoring software will be employed for the "IE-NE Spoke".NIST will tabulate and report word error rates over the entire dataset.

NIST will also tabulate and report Word Error Rates for various subsets of test material to examine performance for different conditions. As in previous evaluations, these will include the effect of the following annotated phenomena on recognition:

Planned, high bandwidth channel, no background noise, native speaker speech (F0)
Spontaneous Speech (F1)
Reduced Bandwidth Speech (F2)
Speech in Background Music (F5)
Speech in Background Noise (F4)
Non-Native Speech (F5)
Combinations of above Conditions (FX)

Special attention will be given to the F0 condition. This condition is of particular interest because the absence of other complicating factors such as background noise, music and non-native dialects focuses attention on basic speech recognition issues common to all conditions.

Immediately after the evaluation, NIST will provide the complete annotation record for the evaluation test material, to facilitate the analysis of performance by individual sites.

Evaluating sites are encouraged to submit the output of their system for a portion of the development test to NIST prior to the formal evaluation, to verify that the system output is processed properly by the NIST scoring software.

The current version of the NIST SCLITE Speech Recognition Scoring Software is available from the NIST Speech Software Website. Revisions are made periodically and will be announced in email.

10.1 Orthographic Rules

System developers should familiarize themselves with the orthographic transformations and rules used in preparing both the reference transcriptions and system hypothesis transcriptions prior to official scoring by NIST so that they can obtain the most accurate scoring of their systems.

10.1.1 SNOR Format

The transcription format employed for scoring is called SNOR (Standard Normalized Orthographic Representation). The SNOR format is derived from the detailed transcription format used in Hub-4 via a filter which will be made publicly available in August. The SNOR format provides a common format for recognition output. In doing so, it removes lexical details which are not part of the current Hub-4 research focus (such as capitalization, punctuation, etc.) from the transcription format to define and simplify the recognition and scoring process. A SNOR-normalized transcription consists of text strings made up of ASCII characters and has the following contraints:

Whitespace separates words for languages that use words

The text is case-insensitive (usually in all upper case)

No puctuation is included except apostrophies for contractions

Previously hyphenated words are divided into their constituent parts separated by whitespace.

The human generated reference transcripts are stored in their original detailed format and are translated into SNOR prior to scoring the output of an ASR system. It is important that these transformations are properly included in the design of the recognition systems, so that the system output may be scored optimally.

10.1.2 Orthographic Normalization

After the reference transcripts are converted into SNOR, both the reference and ASR-produced transcripts will be transformed via a NIST supplied transcript filter. The filter, especially updated for 1998, will be available on the 1998 Hub-4 Website in August. The orthographic map file which will be used in the official scoring will not be made available until after the Hub test results are received and scored by NIST. The version of the filter used in the 1997 Hub-4 evaluation, tranfilt-1.8.tar.Z, and 1997 Hub-4 orthographic map file,en971128.glm, are currently available. The sections below on "Multiple Spellings" and "Contractions" describe the mapping process.

10.1.3 Notes on the Handling of Special Orthographic Conditions

This section describes orthographic conditions and speech phenomena which require special processing. Note that some of these conditions are scored as "optionally deletable". In these cases, an ASR system will not be penalized with an error for omitting the output of (or "deleting") the particular word in the system output. However, reference words marked as such will still count towards the total number of reference words during Word Error Rate computations.

Word fragments: Word fragments are represented in the SNOR reference transcription with only the text of what was actually spoken with a "-" indicating the missing portion of the verbally fragmented word. For WERR, they are included in the total word count but they are scored as optionally deletable. Fragments are also counted as correct if the transcribed portion of the fragment matches the initial substring of an inserted word (e.g. the fragment 'FR-' will be counted as correct if it is aligned to "FRED").
Unintelligible or Semi-Intelligible Words: Certain words in the reference transcripts may be marked as unintelligible. If the transcriber can make an educated guess at the unintelligible words, a "best guess" may also be provided. These best guesses will be included in the total word count, but will be scored as "optionally deletable".
Foreign Words: Foreign words, which are outside the language being evaluated, will be annotated as such in the reference transcripts. Such foreign words will be included in the total word count, but will be scored as "optionally deletable". Note that this annotation will not be applied to words of foreign origin that have been widely incorporated into the speech of the evaluated language.
Pause Fillers: For scoring purposes, all hesitation words, referred to as "non-lexemes", will be mapped to a single form, %HESITATION and will be scored as "optionally deletable". Although many different hesitation words are possible (E.g., um, er, uh, ah, etc.), they are all considered to be functionally equivalent from a linguistic perspective. When a hesitation word is hypothesized, an ASR system should emit either nothing or one of the accepted lexical tokens for hesitations. For English, the current list of recognized hesitation words is: uh, um, eh, mm, hm, ah, huh, ha, er, oof, hee, ach, eee, ew
Multiple Spellings: A single standardized spelling will generally be required and the recognizer must output this standard spelling in order to be scored as correct. These spellings are determined by NIST by first consulting the American Heritage Dictionary. If these aren't covered in the AHD, Web searches are performed to find the most common representation. If no single form is most prevalent, then two or more forms will be permitted via the orthographic map file. Spelling errors and multiple representations which commonly occur in the training data will also be permitted via the map file. As in previous years, a transcription filter and orthographic map file will be used on both the reference and hypothesis transcripts to apply rules for mapping common alternate representations to a single scorable form.
Homophones: Homophones will not be treated as equivalent. Homophones must be spelled correctly according to the given context in order to be counted as correct.
Overlapping Speech: Periods of overlapping speech will be specially annotated and will not be scored. Any words hypothesized by the recognizer during these periods will not be counted as errors.
Compound Words: Compound words are divided into their constituent parts in both the ref and hyp unless the word appears as a single word in the American Heritage Dictionary. If the word does not appear in either compounded or un-compounded form in the AHD, then the Web is searched to determine the most likely representation via frequency of occurrence.
Contractions: Contractions will be transformed into their expanded forms in both the reference and the hypotheses transcriptions. Human annotators will add the proper expansions to the reference transcripts. A list of common contractions with all possible expansions is included in the orthographic map file. The map file will be updated with new contractions occurring in the 1998 reference and hypothesis transcripts prior to scoring. The existing approach allows only the "REF" and its expansions for a given contraction to be scored as correct. To implement this, contractions in the recognizer output will be expanded to an alternation of all possible contractions and the alignment routines will select the minimal cost expansion.

11. Multiple Systems Running a Single Test

In order to discourage the running of several systems on a single test to improve one’s chances of scoring well, sites must designate one system as the primary system if more than one system is run on a single test. This designation must be made before the test is begun. Results must be reported for all systems run on any test.

12. System Descriptions

Sites are required to submit a standardized system description to NIST along with the results for each system run on any test. The format for these system descriptions is as follows: SITE/SYSTEM NAME
HUB-4 {CORE/CONTRAST} TEST

1) PRIMARY TEST SYSTEM DESCRIPTION:

2) ACOUSTIC TRAINING:

3) GRAMMAR TRAINING:

4) RECOGNITION LEXICON DESCRIPTION:

5) DIFFERENCES FOR EACH CONTRASTIVE TEST:

6) NEW CONDITIONS FOR THIS EVALUATION:

7) REFERENCES:

Evaluating sites will be required to provide a written description at the Workshop of computational resource requirements including processor speed and storage requirements used to produce the evaluation results, and to publish information about the complexity of new algorithms.

13. Sharing of System Output

As in last year's Hub-4 evaluation, we encourage sites to permit the sharing of their system output files with other participants. Since this practice appeared to be broadly accepted last year, this year NIST will assume that participating sites are willing to allow NIST to make their output available to other participants for diagnostic and research purposes. Therefore, if you do NOT want to share your output, please notify David Pallett (dpallett@nist.gov) at NIST in email prior to submitting your results for scoring.

14. Site Commitments

Sites interested in participating in the 1998 Hub 4 evaluation should notify NIST no later than September 29, 1998. NIST will ensure that participating sites receive appropriate training and devtest material in a timely fashion after authorization to do so from the LDC. Sites must be members in good standing with the LDC or have made a test-only agreement with the LDC prior to being given access to the Hub-4 corpora.

Site commitments are used to control evaluation and to manage evaluation resources. It is imperative that sites honor their commitments in order for the evaluation to have beneficial impact. Sites must notify NIST as soon as possible, prior to the distribution of the evaluation data, if it appears that a commitment may not be honored. Defaulting on a commitment may jeopardize permission to participate, and to obtain early distributions of future test data, in subsequent evaluations.

15. Workshop

A workshop will be held in February 1999 for presenting evaluation results and discussing the technology used in the Hub 4 evaluation. Evaluation results will be reported by NIST, and invited and contributed presentations will be made by evaluation participants. Presentations and results at the Workshop will be published in a written publicly-available Proceedings. N.B. Participants will be required to deliver camera-ready copies of their papers (plus release approvals) at least one week prior to the workshop.

16. Schedule

Note that a different schedule applies to participants in the Full-IE-NE and Quasi-IE-NE evaluations. See Section 9.3.8 for the IE-NE Spoke schedule.

September 29, 1998 - Deadline for site commitment to participate

October 6, 1998 - Deadline for sites to submit Devtest results (optional)

October 13, 1998 - Evaluation test data to be at participating sites, test begins

November 10, 1998 (0700 EST) - Deadline for submission of hub primary test results

November 18 , 1998 - NIST releases scores for the hub primary test results

November 20, 1998 (0700 EST) - Deadline for submission of spoke and hub contrast test results

November 25, 1998 - NIST releases scores for the spoke and hub contrast test results

February 1999 - Workshop for Hub 4 participating sites