Dragon report describing the production of the ASR transcripts.
---------------------------------------------------------------

We put the files into separate subdirectories (named by the first two
digits in the conversation) because some users (and shells) are
bothered by 5000 files in a single directory.

At NIST's request, Dragon undertook the automatic recognition of
release 2 of the Switchboard I corpus. The object of the project was
to evaluate techniques of Speaker ID based on speaker-specific
language modelling, and to compare results based on true transcripts
to results based on actual ASR output. This effort was bound to lead
to better recognition results than on a fair test, because Dragon's
Switchboard models were trained on many of the conversations which
were recognized. Nevertheless, it was judged that the output would
still be useful for evaluating this SID technique.

Release 2 of the Switchboard corpus consists of 2435 conversations (we
included 3 release 1 conversations inadvertantly omitted by LDC from
the release 2 CDs). It is composed of about 3 million words of
conversation, and amounts to 263 hours of speech after automatic
inter-turn silence removal. We recognized the corpus with Dragon's
1998 Switchboard evaluation system (for a description see: Peskin et
al., "Improvements in recognition of conversational telephone speech",
Proc. ICASSP-99, Phoenix, 1999). In order to complete the job in a
reasonable amount of time, we skipped the time-consuming second round
of jacknifing adaptation. By running the initial round of
speaker-independent recognition, followed by the final pass of MLLR
adaptation and recognition, the system ran at about 70 X RT on a
mixture of PII and PIII machines. This amounted to 2.1 cpu-years of
computation.

The overall word error rate was 20.8%, roughly ten points (absolute)
better than the performance of the system on a fair test. See below
for a discussion of the effect of training the acoustic and language
models on the test data.

To facilitate the comparison of an SID system based on true text to
one based on ASR output, we compiled a table of word error rates by
conversation side, relative to the ISIP transcripts. The transcripts
were normalized by lowercasing, removing word fragments and non-speech
events, but retaining words spoken over laughter (the token
'[laughter-delicacy]' became 'delicacy'). They were further
concatenated into a single utterance, and scored against the
recognizer output. Our in-house scoring algorithm uses the same
insertion, deletion, and substitution penalties as NIST's, but
requires an exact match between reference and recognized to be scored
as correct. It is thus somewhat harsher than NIST's, which collapses
certain word variants into a single token.

The resulting per-side error rates ranged from 3.5% to as high as (!)
188%. We looked at a few of the very worst sides, and [re]discovered
some [probably well-known] errors in the acoustic data and the ISIP
transcriptions. For example,

- 2223 and 2786   appear to have incorrect ISIP transcripts.
- 4188b           appears to have an incorrect ISIP transcript.
- 3243            wav data for a and b sides appears to be identical.
- 2674b           wav data for b side has both conversation halves at
		  nearly equal amplitude.

Excluding these clearly suspect sides, error rates still ranged up to
95%. Some of the worst remaining results are traceable to failures in
our acoustic utterance chopper. While the chopper rejects most
crosstalk, it can be fooled by a strong signal from the other side of
the conversation (cf sw02167).

Motivated by curiosity about the consequence of testing on training
data, we divided the 4876 Switchboard conversation sides into 10
categories. The first category is 'suspect', which has obvious
disabling errors in transcript or audio data; nine sides fell into
this category. We also considered three levels of contamination for
each side:

side    -- trained on this side,
speaker -- trained on other side[s] by this speaker, but not this side,
clean   -- did not train on any sides by this speaker,

for both the acoustic and language model. The 3X3 table yielded 9
further categories, three of which are empty.

In the following table, the AM and LM columns give the nature of
contamination of the acoustic and language models used for
recognition. WER is the word error rate, nWords is the total number of
words in the reference ISIP transcript, and nSides is the number of
conversation sides in the category.

AM             LM        WER     nWords      nSides
------------------------------------------------------------
---- suspect ----      95.09       5516           9
side         side      18.12    1972473        3095
side      speaker      undef          0           0
side        clean      undef          0           0
speaker      side      24.15     198028         330
speaker   speaker      32.98      32560          43
speaker     clean      undef          0           0
clean        side      24.88     778845        1288
clean     speaker      31.69      68285         110
clean       clean      44.57        258           1

Had we designed this experiment to examine the effect of training
contamination, we would be able to come to firmer conclusions. But
because we are working with "found" data, we are limited to comparing
the error rates of categories consisting of disjoint sets of
conversation sides; still, dozens of sides probably constitute a
reasonable sample.

- Unsurprisingly, if we train both LM and AM on the side (side-side
  training), we get the best results.

- If we compare the error rates for speaker-side to speaker-speaker,
  and clean-side to clean-speaker, it appears that training the LM on
  the side gives 8-10 points of improvement over training it on other
  sides by the speaker.

- The AM appears to benefit from training on the side, as compared to
  the speaker: side-side vs. speaker-side gives a six point
  improvement.

- We don't have the data to say whether training the LM on the speaker
  yields an improvement over a clean LM (side-speaker vs. side-clean,
  speaker-speaker vs. speaker-clean, and clean-speaker vs. clean-clean
  have 0, 0, and 1 file for comparison).

- Comparing speaker-side to clean-side, and speaker-speaker to
  clean-speaker, it is unclear whether there is any advantage to
  training the AM on the speaker, as compared to a clean AM.

- The only totally clean conversation side for Dragon in the entire
  SwitchBoard 1 corpus is sw03822a.