Dragon report describing the production of the ASR transcripts. --------------------------------------------------------------- We put the files into separate subdirectories (named by the first two digits in the conversation) because some users (and shells) are bothered by 5000 files in a single directory. At NIST's request, Dragon undertook the automatic recognition of release 2 of the Switchboard I corpus. The object of the project was to evaluate techniques of Speaker ID based on speaker-specific language modelling, and to compare results based on true transcripts to results based on actual ASR output. This effort was bound to lead to better recognition results than on a fair test, because Dragon's Switchboard models were trained on many of the conversations which were recognized. Nevertheless, it was judged that the output would still be useful for evaluating this SID technique. Release 2 of the Switchboard corpus consists of 2435 conversations (we included 3 release 1 conversations inadvertantly omitted by LDC from the release 2 CDs). It is composed of about 3 million words of conversation, and amounts to 263 hours of speech after automatic inter-turn silence removal. We recognized the corpus with Dragon's 1998 Switchboard evaluation system (for a description see: Peskin et al., "Improvements in recognition of conversational telephone speech", Proc. ICASSP-99, Phoenix, 1999). In order to complete the job in a reasonable amount of time, we skipped the time-consuming second round of jacknifing adaptation. By running the initial round of speaker-independent recognition, followed by the final pass of MLLR adaptation and recognition, the system ran at about 70 X RT on a mixture of PII and PIII machines. This amounted to 2.1 cpu-years of computation. The overall word error rate was 20.8%, roughly ten points (absolute) better than the performance of the system on a fair test. See below for a discussion of the effect of training the acoustic and language models on the test data. To facilitate the comparison of an SID system based on true text to one based on ASR output, we compiled a table of word error rates by conversation side, relative to the ISIP transcripts. The transcripts were normalized by lowercasing, removing word fragments and non-speech events, but retaining words spoken over laughter (the token '[laughter-delicacy]' became 'delicacy'). They were further concatenated into a single utterance, and scored against the recognizer output. Our in-house scoring algorithm uses the same insertion, deletion, and substitution penalties as NIST's, but requires an exact match between reference and recognized to be scored as correct. It is thus somewhat harsher than NIST's, which collapses certain word variants into a single token. The resulting per-side error rates ranged from 3.5% to as high as (!) 188%. We looked at a few of the very worst sides, and [re]discovered some [probably well-known] errors in the acoustic data and the ISIP transcriptions. For example, - 2223 and 2786 appear to have incorrect ISIP transcripts. - 4188b appears to have an incorrect ISIP transcript. - 3243 wav data for a and b sides appears to be identical. - 2674b wav data for b side has both conversation halves at nearly equal amplitude. Excluding these clearly suspect sides, error rates still ranged up to 95%. Some of the worst remaining results are traceable to failures in our acoustic utterance chopper. While the chopper rejects most crosstalk, it can be fooled by a strong signal from the other side of the conversation (cf sw02167). Motivated by curiosity about the consequence of testing on training data, we divided the 4876 Switchboard conversation sides into 10 categories. The first category is 'suspect', which has obvious disabling errors in transcript or audio data; nine sides fell into this category. We also considered three levels of contamination for each side: side -- trained on this side, speaker -- trained on other side[s] by this speaker, but not this side, clean -- did not train on any sides by this speaker, for both the acoustic and language model. The 3X3 table yielded 9 further categories, three of which are empty. In the following table, the AM and LM columns give the nature of contamination of the acoustic and language models used for recognition. WER is the word error rate, nWords is the total number of words in the reference ISIP transcript, and nSides is the number of conversation sides in the category. AM LM WER nWords nSides ------------------------------------------------------------ ---- suspect ---- 95.09 5516 9 side side 18.12 1972473 3095 side speaker undef 0 0 side clean undef 0 0 speaker side 24.15 198028 330 speaker speaker 32.98 32560 43 speaker clean undef 0 0 clean side 24.88 778845 1288 clean speaker 31.69 68285 110 clean clean 44.57 258 1 Had we designed this experiment to examine the effect of training contamination, we would be able to come to firmer conclusions. But because we are working with "found" data, we are limited to comparing the error rates of categories consisting of disjoint sets of conversation sides; still, dozens of sides probably constitute a reasonable sample. - Unsurprisingly, if we train both LM and AM on the side (side-side training), we get the best results. - If we compare the error rates for speaker-side to speaker-speaker, and clean-side to clean-speaker, it appears that training the LM on the side gives 8-10 points of improvement over training it on other sides by the speaker. - The AM appears to benefit from training on the side, as compared to the speaker: side-side vs. speaker-side gives a six point improvement. - We don't have the data to say whether training the LM on the speaker yields an improvement over a clean LM (side-speaker vs. side-clean, speaker-speaker vs. speaker-clean, and clean-speaker vs. clean-clean have 0, 0, and 1 file for comparison). - Comparing speaker-side to clean-side, and speaker-speaker to clean-speaker, it is unclear whether there is any advantage to training the AM on the speaker, as compared to a clean AM. - The only totally clean conversation side for Dragon in the entire SwitchBoard 1 corpus is sw03822a.