Karen Sparck Jones, 19 May 2004


Summary of DUC evaluations so far  [and some related information]
-----------------------------------------------------------------


Planning meeting for DUC01, Nov 2000

        [NAACL 2001 Workshop on Automatic Summarisation
         11 research papers, mostly non-DUC participants]


DUC 01 Sept 2001
----------------

  intrinsic evaluation
  generic summaries - extracts (or near-extracts)


Tasks:
  single documents - 100 words                 (call this Task 1)
  multiple documents - 50, 100, 200, 400 words (call these Tasks 2+)


Data:
  10 analysts x 1 set of documents x 6 types (eg event, biog)
  news material
  10 documents per set, 30 sets training, 30 sets test

  human summaries - single document:   100 word author perspective
                  - multiple document: 400 word report
                      this reduced by hand to smaller sizes

  automatic baselines - single document: first 100 words
                        multiple a) first n words
                                 b) first n sentences to word limit


Evaluation:
  compare model (aka reference) human summary
    with peer (system/baseline/human) summary

  on peer quality ie grammaticality, cohesion, organisation
  by peer coverage, recall, precision with respect to model
    using model units (human), peer units (auto), and SEE program


Participation:
  14 participating teams (organisations)
  wide range of summarising strategies


Results:
 (broad brush summary) -

  baselines <= systems < humans
    but large variations across documents/sets


Observations:
  problems with applying measures


Comment:
  Paul Over - `the systems are not producing junk'


        [Colocated SIGIR Workshop on Automatic summarising:
         6 research papers, 2 from non-DUC authors]


DUC 02 July 2002
----------------

  intrinsic evaluation
  generic summaries - abstracts and extracts


Tasks:
  single documents - 100 words                           (call this Task 1)
  multiple documents - abstracts 10, 50, 100, 200 words  (call these Tasks 2A+)
                       extracts 200, 400 words           (Tasks 2B+)


Data:
  analysts/assessors chose 15 sets of documents x 4 types (eg event, biog)
  news material
  10 documents per set, 60 sets test

  human summaries
    abstracts - single document:   100 word author perspective
              - multiple document: 200 word report
                  this reduced by hand to smaller sizes
  extracts    - multiple document: 400 word
                  reduced by hand to smaller

  automatic baselines - single document: first 100 words
                        multiple a) first n words in most recent document
                                 b) first sentence in documents in
                                      temporal order to word limit


Evaluation:
  compare model (aka reference) human summary
    with peer (system/baseline/human) summary

  on peer quality ie grammaticality, cohesion, organisation
  by peer coverage, recall, precision with respect to model
    using model units (human), peer units (auto), and SEE program


Participation:
  16 participating teams (organisations)
  wide range of summarising strategies


Results:
  (very broad brush summary) -

  baselines <= systems < humans
           but large variations across documents/sets
  overall performance rather low


Observations:
  human assessment not so many problems as DUC 01
  many model units not covered by systems


Comments:
  though 3-way split (single abstracts, multi abstracts,
    multi extracts) each reasonable number of teams for comparisons
  newcomer participants focused on single document


        [Colocated ACL Workshop on Automatic summarisation:
         6 research papers, 4 involving non-DUC authors]


DUC 03 May 2003
---------------

  intrinsic, simulated extrinsic evaluation
  generic summaries - abstracts or extracts


Tasks:
  single documents - very short extracts/abstracts  (Task 1)
  multiple documents - short extracts/abstracts
                         events                     (Task 2)
                         viewpoints                 (Task 3)
                         topics                     (Task 4)


Data:
  assessors chose 30 sets of documents x 3 types
    (TDT event, TREC, TREC novelty - novel sentences marked)
  news material

  human summaries
    abstracts - single document:   very short TDT, viewpoint TREC
              - multiple document: short TDT, TREC, novelty

  manual baseline     - single document T1   document headline element
  automatic baselines - multiple T2,T3   first n words in most recent document
                                 T2,T3   first sentence in documents in
                                           temporal order to word limit
                                 T4      first 100 words, 1st relevant 
sentences
                                           top ranked document
                                 T4      first relevant sentence in documents 
in
                                           rank order


Evaluation:
  intrinsic - compare model (aka reference) human summary
                with peer (system/baseline/human) summary

              on peer quality ie grammaticality, cohesion, organisation
              by peer coverage, precision.recall with respect to model
                using model units (human), peer units (auto), and SEE program
  extrinsic - usefulness (T1)
              responsiveness (T4)


21 participating teams (organisations)
wide range of summarising strategies


Results:
  (very broad brush summary) -

  baselines <= systems < humans
           though variations across documents/sets

  across tasks can usually distinguish between three system classes
   - top performers, middle performers (most), bottom performers -
   but not within classes

  same performance ordering (ie baselines <= systems < manual)
   on peer quality (T2-4), coverage (T1-4)
      usefulness (T1), responsiveness (T4)

  system summaries could rate reasonably on usefulness and responsivenss
   though deficient in coverage

  overall level of system performance not high (quite low)


Observations:
  still disagreement problems in human assessors
  many model units not covered by systems


Comments:
  though multiple tasks, each a reasonable number of teams for comparisons
  newcomer participants spread across tasks


        [Colocated HLT-NAACL Workshop on Text Summarisation:
         10 research papers, 4 involving non-DUC authors]


DUC 04 May 2004
---------------

  intrinsic evaluation
  generic summaries, query oriented summaries


Tasks:
  single documents - very short English extracts/abstracts  (Task 1)
  multiple documents - short English extracts/abstracts
                         on events (Task 2)
                       very short English extracts/abstracts of Arabic
                         documents (Task 3)
                       short English abstracts/extracts of Arabic
                         documents (Task 4)
                         on events
                       short English abstracts/extracts focused by
                         questions (Task 5)

 (note Tasks 3 and 4 used machine translated English output for the
  Arabic documents: participants were not asked to translate the
  Arabic documents, or to summarise in Arabic and then translate
  the Arabic summaries; manual document translations were comparison
  inputs)


Data:
  assessors chose 50 sets of documents (T1, T2), including 25 sets (T3, T4)
       (TDT event)
  assessors chose 50 sets of documents germane to broad `Who?' question (T5)
       (TREC)
  news material

  human summaries
    abstracts - very short, short (T1-T4), short focused (T5)

  manual baseline     - single document T1   first m bytes
  automatic baselines - multiple T2      first n bytes in most recent document
                                 T3      first m bytes in best translation
                                 T4      first n bytes in best translation of
                                           most recent
                                 T5      first n bytes of most recent document


Evaluation:
  intrinsic - compare model (aka reference) human summary
                with peer (system/baseline/human) summary
       T1-T4  by ROUGE ngram matching (in various forms)

       T5     on peer quality ie grammaticality, cohesion, organisation
              by peer coverage, precision.recall with respect to model
                using model units (human), peer units (auto), and SEE program
  extrinsic - responsiveness (T5)


Participation:
  25 participating teams (organisations/distinct groups within organisations)
  wide range of strategies, with some developments for T5


Results:
 (broad brush summary) -

  baselines <= systems < humans (across all tasks)
  many systems indistinguishable, differences only between extremes
  no one team consistently superior across tasks
  same overall picture across performance measure types,
   ie intrinsic a) by quality and coverage; b) by ROUGE
      extrinsic by responsiveness

  participants able to tackle Arabic alright (~10 teams)
  machine translation for Arabic downgrades performance
   (difference per system between MT and manual translations
   not significant, but pattern difference across systems
   unlikely due to chance)

  scores on all measures rather low, with best on least demanding
   version of ROUGE coverage, then on quality, then on responsiveness


Observations:

  quality questions worked decently, and useful
  ROUGE variations clearly showed unigram test least demanding,
   but tracked by `longest common substring' version; however
   ROUGE is a very coarse method of summary evaluation
  T5 more substantial as purpose-oriented summary than previous
   forms, but also difficult; responsiveness assessment worked
   alright


Comments:

  slow increase in number of participants from DUC 01-04 encouraging,
   also in number of teams undertaking each task
  low scores on intrinsic evaluations discouraging but suggest part
   of the problem is limitations of human summary comparison as
   evaluation mode, though other part is limitations of extractive-
   style automatic summarising
   (coverage scores against a single human model are naturally low,
   but there is still room for improvement against this ceiling)
  extrinsic purpose-oriented task evaluation is coming along, but needs
   more development
  extractive-style summarising has many weaknesses so better extrinsic
   evaluations are needed both to determine its value in real contexts
   and to promote work on non-extractive methods
  news material is getting boring


       [no colocated summarising workshop; HLT-NAACL conference no
        independent team summarising papers]


Note:

  Working Group on Road Map for future DUCs presented initial
   proposals for DUC 05-07 at DUC 04