a partiCular query should match was split across two
   document lines, then the match would fail. After the con-
   ference, we found a relatively easy way to rectify this
   design deficiency and were able to run tests a33 though
   ~6, and'6 and r7. Unfortunately, this change greatly
   affected the system's response to the query and negation
   thresholds, and we were not able to run enough tests to
   find the optimum values for these parameters. The end
   result is that the best results we have for this improved
   system are still not as good as the official results we
   reported.

 * Another serious problem with the original system came
   to light during the official conference. As Donna Harman
   pointed out in her closing talk, no one has yet really
   explored the full ramifications of changes in term
   weighting strategies. Our original system used a very
   simple-minded linear-falloff weighting scheme. Our
   assumption was that concept strings appeared in reverse
   order of importance. However, we began to suspect,
   based on different concepts that were mentioned in a
   number of the conference presentations, that this was
   overly simplistic. We decided to implement a straight-
   forward exponential-decay weighting scheme. In this
   approach, the first query string gets a weight of 100, the
   second gets a weight of, say, 90, the third a weight of 81,
   the fourth a weight of 72, and so on, with each succeed-
   ing weight tacing 90% of its predecessor's value. Unfor-
   tunately, we did not have time to tune the system's
   response to this change either, and its results (a34, a35,
   a36, ~, and r7) are worse than the official results as
   well. However, it appears that there is plenty of room for
   experimentation with different term weighting schemes
   and we will continue working with them.

 * In our early work with the system before sending in the
   official results, we did a fair amount of testing using just
   the Associated Press documetit set. We were perhaps
   misled by how well the system did on these documents,
   and missed some chances to improve the system earlier.
   The test labelled a35-AP shows what the results look
   like for ~5 if we restrict the new system to just return-
   ing AP documents, and restrict the relevance judgements
   to just AP documents. Even with this imperfectly tuned
   new version, we see that the system is capable of signifi-
   cantly better performance. It is unclear why there should
   be such variation between the retrievability of the AP
   documents and the other document collections.

 At its best, our system performed as well as most of the sys-
 tems that participated in ~~C-l. However, there is ample
 room for improvement, as we have noted above, especially
 in comparison to many of the systems that came back for
 ThEC-2.


                                          178

 5.0 Further Research

 The ~fl~C-2 task is the first real application for our
 N-gram-based multiple-query system. As in any experiment
 of this nature, the results and problems suggest many more
 possible avenues of research. These ideas fall into two cate-
 gories.

 5.1 Analyzing the Current System's
 Performance
 Further analysis of the existing system will allow us to bet-
 ter understand its behavior and limitations. Some ways to do
 that include:

 * It is likely that generating query strings from the topic
   concept strings may have significantly limited perfor-
   mance. For example, Topic 74 about instances where the
   U.S. government propounds conflicting policies com-
   pletely failed to mention terms such as policy or regula-
   tion in the concept list. Thus, our system had only a very
   small chance whatsoever of finding matching docu-
   ments. Zirnmerman's filtering system [4] did well with
   handcrafted queries, so we should also try manually gen-
   erated queries.

 * Currently the system has a hard~oded cutoff threshold
   of 40 for the weighted aggregate score. The purpose of
   the threshold was to prevent the system from returning
   results that were guaranteed to be noise because of their
   very low score. This value was set more or less arbi-
   trarily, so we should experiment with changing this
   threshold to determine its true effect. In all likelihood, it
   could be a fair amount higher, preventing the system
   from generating other useless low-scoring results.

 * Currently the system sets a cap of three times the maxi-
   mum N-gram score for any query string score. Again,
   this value was determined only by a very rough empin-
   cal process, so we should experiment with changing this
   cap, to see how much impact it has.


 5.2 Extending the System
 We can also make some significant changes to the system to
 explore possibilities for other performance improvements.

 * Currently the system treats upper and lower case alike
   for both documents and queries. Since acronyms and
   brand names have different meanings sometimes from
   uncapitalized words having the same letters, perhaps
   there is a way to take the case of letters into account
   when computing a match. That is, we could count a