Karen Sparck Jones, 19 May 2004 Summary of DUC evaluations so far [and some related information] ----------------------------------------------------------------- Planning meeting for DUC01, Nov 2000 [NAACL 2001 Workshop on Automatic Summarisation 11 research papers, mostly non-DUC participants] DUC 01 Sept 2001 ---------------- intrinsic evaluation generic summaries - extracts (or near-extracts) Tasks: single documents - 100 words (call this Task 1) multiple documents - 50, 100, 200, 400 words (call these Tasks 2+) Data: 10 analysts x 1 set of documents x 6 types (eg event, biog) news material 10 documents per set, 30 sets training, 30 sets test human summaries - single document: 100 word author perspective - multiple document: 400 word report this reduced by hand to smaller sizes automatic baselines - single document: first 100 words multiple a) first n words b) first n sentences to word limit Evaluation: compare model (aka reference) human summary with peer (system/baseline/human) summary on peer quality ie grammaticality, cohesion, organisation by peer coverage, recall, precision with respect to model using model units (human), peer units (auto), and SEE program Participation: 14 participating teams (organisations) wide range of summarising strategies Results: (broad brush summary) - baselines <= systems < humans but large variations across documents/sets Observations: problems with applying measures Comment: Paul Over - `the systems are not producing junk' [Colocated SIGIR Workshop on Automatic summarising: 6 research papers, 2 from non-DUC authors] DUC 02 July 2002 ---------------- intrinsic evaluation generic summaries - abstracts and extracts Tasks: single documents - 100 words (call this Task 1) multiple documents - abstracts 10, 50, 100, 200 words (call these Tasks 2A+) extracts 200, 400 words (Tasks 2B+) Data: analysts/assessors chose 15 sets of documents x 4 types (eg event, biog) news material 10 documents per set, 60 sets test human summaries abstracts - single document: 100 word author perspective - multiple document: 200 word report this reduced by hand to smaller sizes extracts - multiple document: 400 word reduced by hand to smaller automatic baselines - single document: first 100 words multiple a) first n words in most recent document b) first sentence in documents in temporal order to word limit Evaluation: compare model (aka reference) human summary with peer (system/baseline/human) summary on peer quality ie grammaticality, cohesion, organisation by peer coverage, recall, precision with respect to model using model units (human), peer units (auto), and SEE program Participation: 16 participating teams (organisations) wide range of summarising strategies Results: (very broad brush summary) - baselines <= systems < humans but large variations across documents/sets overall performance rather low Observations: human assessment not so many problems as DUC 01 many model units not covered by systems Comments: though 3-way split (single abstracts, multi abstracts, multi extracts) each reasonable number of teams for comparisons newcomer participants focused on single document [Colocated ACL Workshop on Automatic summarisation: 6 research papers, 4 involving non-DUC authors] DUC 03 May 2003 --------------- intrinsic, simulated extrinsic evaluation generic summaries - abstracts or extracts Tasks: single documents - very short extracts/abstracts (Task 1) multiple documents - short extracts/abstracts events (Task 2) viewpoints (Task 3) topics (Task 4) Data: assessors chose 30 sets of documents x 3 types (TDT event, TREC, TREC novelty - novel sentences marked) news material human summaries abstracts - single document: very short TDT, viewpoint TREC - multiple document: short TDT, TREC, novelty manual baseline - single document T1 document headline element automatic baselines - multiple T2,T3 first n words in most recent document T2,T3 first sentence in documents in temporal order to word limit T4 first 100 words, 1st relevant sentences top ranked document T4 first relevant sentence in documents in rank order Evaluation: intrinsic - compare model (aka reference) human summary with peer (system/baseline/human) summary on peer quality ie grammaticality, cohesion, organisation by peer coverage, precision.recall with respect to model using model units (human), peer units (auto), and SEE program extrinsic - usefulness (T1) responsiveness (T4) 21 participating teams (organisations) wide range of summarising strategies Results: (very broad brush summary) - baselines <= systems < humans though variations across documents/sets across tasks can usually distinguish between three system classes - top performers, middle performers (most), bottom performers - but not within classes same performance ordering (ie baselines <= systems < manual) on peer quality (T2-4), coverage (T1-4) usefulness (T1), responsiveness (T4) system summaries could rate reasonably on usefulness and responsivenss though deficient in coverage overall level of system performance not high (quite low) Observations: still disagreement problems in human assessors many model units not covered by systems Comments: though multiple tasks, each a reasonable number of teams for comparisons newcomer participants spread across tasks [Colocated HLT-NAACL Workshop on Text Summarisation: 10 research papers, 4 involving non-DUC authors] DUC 04 May 2004 --------------- intrinsic evaluation generic summaries, query oriented summaries Tasks: single documents - very short English extracts/abstracts (Task 1) multiple documents - short English extracts/abstracts on events (Task 2) very short English extracts/abstracts of Arabic documents (Task 3) short English abstracts/extracts of Arabic documents (Task 4) on events short English abstracts/extracts focused by questions (Task 5) (note Tasks 3 and 4 used machine translated English output for the Arabic documents: participants were not asked to translate the Arabic documents, or to summarise in Arabic and then translate the Arabic summaries; manual document translations were comparison inputs) Data: assessors chose 50 sets of documents (T1, T2), including 25 sets (T3, T4) (TDT event) assessors chose 50 sets of documents germane to broad `Who?' question (T5) (TREC) news material human summaries abstracts - very short, short (T1-T4), short focused (T5) manual baseline - single document T1 first m bytes automatic baselines - multiple T2 first n bytes in most recent document T3 first m bytes in best translation T4 first n bytes in best translation of most recent T5 first n bytes of most recent document Evaluation: intrinsic - compare model (aka reference) human summary with peer (system/baseline/human) summary T1-T4 by ROUGE ngram matching (in various forms) T5 on peer quality ie grammaticality, cohesion, organisation by peer coverage, precision.recall with respect to model using model units (human), peer units (auto), and SEE program extrinsic - responsiveness (T5) Participation: 25 participating teams (organisations/distinct groups within organisations) wide range of strategies, with some developments for T5 Results: (broad brush summary) - baselines <= systems < humans (across all tasks) many systems indistinguishable, differences only between extremes no one team consistently superior across tasks same overall picture across performance measure types, ie intrinsic a) by quality and coverage; b) by ROUGE extrinsic by responsiveness participants able to tackle Arabic alright (~10 teams) machine translation for Arabic downgrades performance (difference per system between MT and manual translations not significant, but pattern difference across systems unlikely due to chance) scores on all measures rather low, with best on least demanding version of ROUGE coverage, then on quality, then on responsiveness Observations: quality questions worked decently, and useful ROUGE variations clearly showed unigram test least demanding, but tracked by `longest common substring' version; however ROUGE is a very coarse method of summary evaluation T5 more substantial as purpose-oriented summary than previous forms, but also difficult; responsiveness assessment worked alright Comments: slow increase in number of participants from DUC 01-04 encouraging, also in number of teams undertaking each task low scores on intrinsic evaluations discouraging but suggest part of the problem is limitations of human summary comparison as evaluation mode, though other part is limitations of extractive- style automatic summarising (coverage scores against a single human model are naturally low, but there is still room for improvement against this ceiling) extrinsic purpose-oriented task evaluation is coming along, but needs more development extractive-style summarising has many weaknesses so better extrinsic evaluations are needed both to determine its value in real contexts and to promote work on non-extractive methods news material is getting boring [no colocated summarising workshop; HLT-NAACL conference no independent team summarising papers] Note: Working Group on Road Map for future DUCs presented initial proposals for DUC 05-07 at DUC 04