a partiCular query should match was split across two document lines, then the match would fail. After the con- ference, we found a relatively easy way to rectify this design deficiency and were able to run tests a33 though ~6, and'6 and r7. Unfortunately, this change greatly affected the system's response to the query and negation thresholds, and we were not able to run enough tests to find the optimum values for these parameters. The end result is that the best results we have for this improved system are still not as good as the official results we reported. * Another serious problem with the original system came to light during the official conference. As Donna Harman pointed out in her closing talk, no one has yet really explored the full ramifications of changes in term weighting strategies. Our original system used a very simple-minded linear-falloff weighting scheme. Our assumption was that concept strings appeared in reverse order of importance. However, we began to suspect, based on different concepts that were mentioned in a number of the conference presentations, that this was overly simplistic. We decided to implement a straight- forward exponential-decay weighting scheme. In this approach, the first query string gets a weight of 100, the second gets a weight of, say, 90, the third a weight of 81, the fourth a weight of 72, and so on, with each succeed- ing weight tacing 90% of its predecessor's value. Unfor- tunately, we did not have time to tune the system's response to this change either, and its results (a34, a35, a36, ~, and r7) are worse than the official results as well. However, it appears that there is plenty of room for experimentation with different term weighting schemes and we will continue working with them. * In our early work with the system before sending in the official results, we did a fair amount of testing using just the Associated Press documetit set. We were perhaps misled by how well the system did on these documents, and missed some chances to improve the system earlier. The test labelled a35-AP shows what the results look like for ~5 if we restrict the new system to just return- ing AP documents, and restrict the relevance judgements to just AP documents. Even with this imperfectly tuned new version, we see that the system is capable of signifi- cantly better performance. It is unclear why there should be such variation between the retrievability of the AP documents and the other document collections. At its best, our system performed as well as most of the sys- tems that participated in ~~C-l. However, there is ample room for improvement, as we have noted above, especially in comparison to many of the systems that came back for ThEC-2. 178 5.0 Further Research The ~fl~C-2 task is the first real application for our N-gram-based multiple-query system. As in any experiment of this nature, the results and problems suggest many more possible avenues of research. These ideas fall into two cate- gories. 5.1 Analyzing the Current System's Performance Further analysis of the existing system will allow us to bet- ter understand its behavior and limitations. Some ways to do that include: * It is likely that generating query strings from the topic concept strings may have significantly limited perfor- mance. For example, Topic 74 about instances where the U.S. government propounds conflicting policies com- pletely failed to mention terms such as policy or regula- tion in the concept list. Thus, our system had only a very small chance whatsoever of finding matching docu- ments. Zirnmerman's filtering system [4] did well with handcrafted queries, so we should also try manually gen- erated queries. * Currently the system has a hard~oded cutoff threshold of 40 for the weighted aggregate score. The purpose of the threshold was to prevent the system from returning results that were guaranteed to be noise because of their very low score. This value was set more or less arbi- trarily, so we should experiment with changing this threshold to determine its true effect. In all likelihood, it could be a fair amount higher, preventing the system from generating other useless low-scoring results. * Currently the system sets a cap of three times the maxi- mum N-gram score for any query string score. Again, this value was determined only by a very rough empin- cal process, so we should experiment with changing this cap, to see how much impact it has. 5.2 Extending the System We can also make some significant changes to the system to explore possibilities for other performance improvements. * Currently the system treats upper and lower case alike for both documents and queries. Since acronyms and brand names have different meanings sometimes from uncapitalized words having the same letters, perhaps there is a way to take the case of letters into account when computing a match. That is, we could count a