NIST 2008 Machine Translation Evaluation - (Open MT-08)

Official Evaluation Results

Date of release: Fri Jun 06, 2008

Version: mt08_official_release_v0

The NIST 2008 Machine Translation Evaluation (MT-08) is part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT-08 evaluation plan.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT-08 was an evaluation of research algorithms, the MT-08 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by the (NIST), nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.


Evaluation Tasks

The MT-08 evaluation consisted of four tasks. Each task required a system to perform translation from a given source language into the target language. The source and target language pairs that made up the four MT-08 tasks were:

Evaluation Conditions

MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in the training and development of the core MT engine. The evaluation conditions were called "Constrained Data Track" and "Un-Constrained Data Track".

Other submissions not in categories described above will not be reported in the final release.

Evaluation Data

Source Data

MT-08 evaluation data sets contained documents drawn from newswire text documents and web-based newsgroup documents. The source documents were encoded in UTF-8.

The test data was selected from a pool of data collected by the LDC during July 2007. The careful selection process sought to have a variety of sources (see below) and publication dates while hitting the target test set size.

Source Language
Sources
Newswire
Newsgroup / Web
Arabic
AAW, AFP, AHR, ASB, HYT, NHR, QDS, XIN
Assabah
Xinhua News Agency
various web forums
Chinese
AFP, CNS, GMW, PDA, PLA, XIN
various web forums
Urdu
BBC,JNG, PTB, VOA
various web forums
English
AFP, APW, LTW, NYT, XIN
n/a

Reference Data

MT-08 reference data consists of four independently generated high quality translations that were produced by professional translation companies. Each translation agency was required to have native speaker(s) of the source and target languages working on the translations.

Current versus Progress Data Division

For those willing to abide by the strict processing rules, a "PROGRESS" test set was distributed to use as a BLIND benchmark for several evaluations. Teams that processed this data submitted their translations to NIST and deleted all related files (source, translations, and any other derivitive file). The scores of the progress test sets were reported to the participants but were not reported here. Future Open MT evaluations will report PROGRESS test set scores from year to year.

Performance Measurement

Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences, the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).

Although BLEU was the official metric for MT-08, measuring translation quality is an ongoing research topic in the MT community. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

Automatic metrics reported:

Other metrics (to be) reported:

Evaluation Participants

The table below lists the organizations entered as participants and the evaluation tasks they are registered for in MT-08.

Site ID Organization Location
apptek Applications Technology Inc. USA
auc The American University in Cairo Egypt
basistech Basis Technology USA
bbn BBN Technologies USA
bjut-mtg Beijing University of Technology,
Machine Translation Group
China
cas-ia Chinese Academy of Sciences, Institute of Automation China
cas-ict Chinese Academy of Sciences, Institute of Computing Technology China
cas-is Chinese Academy of Sciences, Institute of Software China
cmu-ebmt Carnegie Mellon USA
cmu-smt Carnegie Mellon, interACT USA
cmu-xfer Carnegie Mellon USA
columbia Columbia University USA
cued University of Cambridge, Dept. of Engineering UK
edinburgh University of Edinburgh UK
google Google USA
hit-ir Harbin Institute of Technology, Information Retrieval Laboratory China
hkust . China
ibm IBM USA
lium Universite du Maine (Le Mans), Laboratoire d'Informatique France
msra Microsoft Research Asia China
nrc National Research Council Canada
nthu National Tsing Hua University Taiwan
ntt NTT Communication Science Laboratories Japan
qmul Queen Mary University of London UK
sakhr Sakhr Software Co. Egypt
sri SRI International USA
stanford Stanford University USA
uka Universitaet Karlsruhe Germany
umd University of Maryland USA
upc-lsi Universitat Politechnica de Catalunya, LSI Spain
upc-talp Universitat Politechnica de Catalunya, TALP Spain
xmu-iai Xiamen University, Institute of Artificial Intelligence China
Collaborations
ibm_umd IBM /
University of Maryland MD
USA
jhu_umd Johns Hopkins University /
University of Maryland
USA
isi_lw USC-ISI /
Language Weaver Inc.
USA
msr_msra Microsoft Research /
Microsoft Research Asia
.
msr_nrc_sri Microsoft Research /
Microsoft Research Asia /
National Research Council Canada /
SRI International
.
nict_atr NICT /
ATR
Japan
nrc_systran National Research Council Canada /
SYSTRAN
.

Evaluation Systems

Each site/team could submit up to four systems for evaluation with one system marked as its primary system. The primary system indicated the site/team's best effort. This official public version of the results report the results only for the primary systems. Note that these charts show an absolute ranking according to the primary metric.

Systems that fail to meet the requirements for either track will not be reported here.

"significance groups*" shows areas where the wilcoxon signed rank test was not able to differenciate system performance at the 95% confidence level. That is, if two systems belong to the same significance group (by sharing the same number), then they are determined to be comparble, based n BLEU-4 scoring.


Results Section

Contains Valid On-Time Submissions

Late and corrected submission will be linked here


Overall System Results

Arabic to English (primary system) Results

Entire Current Evaluation Test Set
significance
groups*
systemBLEU-4*IBM BLEUNISTTERMETEOR
Constrained Training Track
1google_arabic_constrained_primary0.45570.452610.882148.5350.6857
2IBM-UMD_arabic_constrained_primary0.45250.430010.618348.4360.6539
3IBM_arabic_constrained_primary0.45070.427610.590448.5470.6530
3bbn_arabic_constrained_primary0.43400.429010.659049.5990.6784
4LIUM_arabic_constrained_primary0.42980.410510.273250.4840.6490
5isi-lw_arabic_constrained_primary0.42480.422710.407751.8200.6695
6CUED_arabic_constrained_primary0.42380.40189.948651.5570.6274
6SRI_arabic_constrained_primary0.42290.403110.193549.7800.6430
7Edinburgh_arabic_constrained_primary0.40290.38339.964151.1650.6396
8UMD_arabic_constrained_primary0.39060.378410.117652.1580.6553
9UPC_arabic_constrained_primary0.37430.35769.655353.2600.6380
10columbia_arabic_constrained_primary0.37400.35949.480651.9730.6092
9,10NTT_arabic_constrained_primary0.36710.35409.880656.0770.6312
11CMUEBMT_arabic_constrained_primary0.34810.34799.216557.3760.6057
12qmul_arabic_constrained_primary0.33080.31818.812455.1450.5893
13SAKHR_arabic_constrained_primary0.31330.31339.137357.1590.6659
14UPC.lsi_english_constrained_primary0.30210.28768.635058.2280.5639
15BASISTECH_arabic_constrained_primary0.25290.24237.878163.0150.5454
16AUC_arabic_constrained_primary0.14150.13596.321076.4060.4468
UnConstrained Training Track
17google_arabic_unconstrained_primary0.47720.473911.186446.8530.6996
18IBM_arabic_unconstrained_primary0.47170.452711.059146.7550.6902
19apptek_arabic_unconstrained_primary0.44830.447410.842048.2630.7160
20cmu-smt_arabic_unconstrained_primary0.43120.411410.361750.0820.6672
* designates primary metric

Chinese to English (primary system) Results

Entire Current Evaluation Test Set
significance
groups*
systemBLEU-4*IBM BLEUNISTTERMETEOR
Constrained Training Track
1MSR-NRC-SRI_chinese_constrained_primary0.30890.29478.505958.4600.5379
1bbn_chinese_constrained_primary0.30590.29598.202357.0670.5468
1isi-lw_chinese_constrained_primary0.30410.29408.095057.7340.5467
1google_chinese_constrained_primary0.29990.28878.514358.3590.5567
2MSR-MSRA_chinese_constrained_primary0.29010.27668.148060.0730.5171
3SRI_chinese_constrained_primary0.26970.25757.894261.6220.5101
3Edinburgh_chinese_constrained_primary0.26080.25137.811760.6540.5142
4SU_chinese_constrained_primary0.25470.24207.799463.2880.5122
4,5UMD_chinese_constrained_primary0.25060.23877.823662.1340.5167
4,5NTT_chinese_constrained_primary0.24690.22707.951163.4150.5126
5NRC_chinese_constrained_primary0.24580.23737.996463.8350.5362
5CASIA_chinese_constrained_primary0.24070.23107.579062.5180.4999
6NICT-ATR_chinese_constrained_primary0.22690.21847.163564.5240.4962
6ICT_chinese_constrained_primary0.22580.22136.155161.3870.4878
7JHU-UMD_chinese_constrained_primary0.21110.20796.050961.8340.4691
8XMU_chinese_constrained_primary0.19790.19386.751463.1390.4780
9HITIRLab_chinese_constrained_primary0.18660.17956.594267.3760.4458
10hkust_large_primary0.16780.16246.712475.8030.4332
10ISCAS_chinese_constrained_primary0.15690.15205.955768.2210.4354
11NTHU_Chinese_constrained_primary0.03930.03903.509693.8920.3209
UnConstrained Training Track
12google_chinese_unconstrained_primary0.31950.30698.862857.0090.5707
13cmu-smt_chinese_unconstrained_primary0.25970.24748.002662.4110.5363
14NRC-SYSTRAN_chinese_unconstrained_primary0.25230.24438.047363.0020.5490
15UKA_chinese_unconstrained_primary0.24060.23237.457161.7060.4916
16CMUXfer_chinese_unconstrained_primary0.13100.13096.245276.7220.4614
17BJUT_chinese_unconstrained_primary0.07350.06944.723977.6850.3944
* designates primary metric

Urdu to English (primary system) Results

significance
groups*
systemBLEU-4*IBM BLEUNISTTERMETEOR
Constrained Training Track
1google_urdu_constrained_primary0.22810.22807.840669.9060.5693
2bbn_urdu_constrained_primary0.20280.20267.692770.8850.5437
2IBM_urdu_constrained_primary0.20260.19997.702268.8600.5096
2isi-lw_urdu_constrained_primary0.19830.19857.303072.7490.5239
3UMD_urdu_constrained_primary0.18290.18267.290568.7480.5053
4MITLLAFRL_urdu_constrained_primary0.16660.16667.046072.859
5UPC_urdu_constrained_primary0.16140.16147.095872.8390.4904
6columbia_urdu_constrained_primary0.14590.14606.547478.6860.4903
6,7Edinburgh_urdu_constrained_primary0.14560.14556.439375.9820.5215
7,8NTT_urdu_constrained_primary0.13940.13836.960475.6050.5022
8qmul_urdu_constrained_primary0.13380.13386.291581.4570.4728
8CMU-XFER_urdu_constrained_primary#0.10160.10174.1885108.1670.3518
* designates primary metric
# designates system with known alignment problem, corrected system submitted late.

English to Chinese (primary system) Results

Here is a description of the scores:

  • BLEU-4*: primary metric produced using mteval-v12, which is a language independent version that tokenizes on every unicode symbol.
  • BLEU-4 normalized: makes use of a mapping file to normalize both the reference and system translations to a single variant of certain sybmols.
  • NIST: the Doddington improvment to BLEU as reported from mteval-v12.
  • BLEU-4 word segmented: mteval-v12 with word scoring, using a standard word segmenter for both reference and system translation.
  • We are not identifying significance groups for this task.

    .systemBLEU-4*BLEU-4
    normalized
    NISTBLEU-4
    Word Segmented
    .
    Constrained Training Track
    .google_english_constrained_primary0.41420.43099.77270.1643.
    .MSRA_English_constrained_primary0.40990.43439.49180.1769.
    .isi-lw_english_constrained_primary0.38570.41638.68100.1687.
    .NICT-ATR_english_constrained_primary0.34380.37187.96080.1416.
    .HITIRLab_english_constrained_primary0.32250.34367.37680.0946.
    .ICT_english_constrained_primary0.31760.34117.70300.0879.
    .CMUEBMT_english_constrained_primary0.27380.29547.30420.0760.
    .XMU_english_constrained_primary0.25020.26646.20830.0593.
    .UMD_english_constrained_primary0.19820.23913.69220.0899.
    UnConstrained Training Track
    .google_english_unconstrained_primary0.47100.491410.78680.1963.
    .BJUT_english_unconstrained_primary0.27650.29067.81850.1046.
    * designates primary metric

    Results by Genre

    All reported scores are limited to the entire "CURRENT" data sets. All primary submissions are shown here.

    Site Results ( alphabetic order )

    All scores are BLEU-4*
    .systemAll dataNWWB
    Arabic to English
    .AUC_arabic_constrained_primary0.14150.17180.0983
    .BASISTECH_arabic_constrained_primary0.25290.29510.1900
    .CMUEBMT_arabic_constrained_primary0.34810.40940.2695
    .CUED_arabic_constrained_primary0.42380.48190.3456
    .Edinburgh_arabic_constrained_primary0.40290.46750.3008
    .IBM-UMD_arabic_constrained_primary0.45250.50850.3489
    .IBM_arabic_constrained_primary0.45070.50890.3432
    .LIUM_arabic_constrained_primary0.42980.48300.3431
    .NTT_arabic_constrained_primary0.36710.41860.2923
    .SAKHR_arabic_constrained_primary0.31330.35050.2622
    .SRI_arabic_constrained_primary0.42290.48860.3171
    .UMD_arabic_constrained_primary0.39060.44520.3117
    .UPC.lsi_english_constrained_primary0.30210.34750.2292
    .UPC_arabic_constrained_primary0.37430.42810.2840
    .bbn_arabic_constrained_primary0.43400.49190.3497
    .columbia_arabic_constrained_primary0.37400.44310.2797
    .google_arabic_constrained_primary0.45570.51640.3724
    .isi-lw_arabic_constrained_primary0.42480.48700.3355
    .qmul_arabic_constrained_primary0.33080.40050.2358
    UNCONSTRAINED SYSTEMSAll dataNWWB
    .IBM_arabic_unconstrained_primary0.47170.52640.3762
    .apptek_arabic_unconstrained_primary0.44830.49000.3925
    .cmu-smt_arabic_unconstrained_primary0.43120.48840.3392
    .google_arabic_unconstrained_primary0.47720.53850.3940
    Chinese to English
    .CASIA_chinese_constrained_primary0.24070.27560.1936
    .Edinburgh_chinese_constrained_primary0.26080.29760.2116
    .HITIRLab_chinese_constrained_primary0.18660.21160.1529
    .ICT_chinese_constrained_primary0.22580.27600.1586
    .ISCAS_chinese_constrained_primary0.15690.18050.1257
    .JHU-UMD_chinese_constrained_primary0.21110.25020.1586
    .MSR-MSRA_chinese_constrained_primary0.29010.34350.2175
    .MSR-NRC-SRI_chinese_constrained_primary0.30890.36140.2376
    .NICT-ATR_chinese_constrained_primary0.22690.25790.1854
    .NRC_chinese_constrained_primary0.24580.26790.2150
    .NTHU_Chinese_constrained_primary0.03930.03670.0425
    .NTT_chinese_constrained_primary0.24690.28280.1991
    .SRI_chinese_constrained_primary0.26970.31540.2075
    .SU_chinese_constrained_primary0.25470.29240.2039
    .UMD_chinese_constrained_primary0.25060.29390.1871
    .XMU_chinese_constrained_primary0.19790.24010.1401
    .bbn_chinese_constrained_primary0.30590.36390.2273
    .google_chinese_constrained_primary0.29990.34890.2344
    .hkust_large_primary0.16780.18910.1377
    .isi-lw_chinese_constrained_primary0.30410.36760.2176
    UNCONSTRAINED SYSTEMSAll dataNWWB
    .BJUT_chinese_unconstrained_primary0.07350.07510.0689
    .CMUXfer_chinese_unconstrained_primary0.13100.15360.0994
    .NRC-SYSTRAN_chinese_unconstrained_primary0.25230.27570.2192
    .UKA_chinese_unconstrained_primary0.24060.28460.1810
    .cmu-smt_chinese_unconstrained_primary0.25970.29090.2127
    .google_chinese_unconstrained_primary0.31950.37010.2515
    Urdu to English
    .CMU-XFER_urdu_constrained_primary#0.10160.18270.0183
    .Edinburgh_urdu_constrained_primary0.14560.16090.1291
    .IBM_urdu_constrained_primary0.20260.23470.1668
    .MITLLAFRL_urdu_constrained_primary0.16660.19390.1373
    .NTT_urdu_constrained_primary0.13940.16300.1155
    .UMD_urdu_constrained_primary0.18290.21600.1478
    .UPC_urdu_constrained_primary0.16140.18780.1320
    .bbn_urdu_constrained_primary0.20280.23880.1632
    .columbia_urdu_constrained_primary0.14590.17140.1195
    .google_urdu_constrained_primary0.22810.26190.1903
    .isi-lw_urdu_constrained_primary0.19830.22920.1645
    .qmul_urdu_constrained_primary0.13380.15780.1077
    * designates primary metric
    # designates system with known alignment problem, corrected system submitted late.