NIST 2008 Machine Translation Evaluation - (Open MT-08)

Official Evaluation Results

Date of release: Fri Jun 06, 2008

Version: mt08_official_release_v0

The NIST 2008 Machine Translation Evaluation (MT-08) is part of an ongoing series of evaluations of human language translation technology. NIST conducts these evaluations in order to support machine translation (MT) research and help advance the state-of-the-art in machine translation technology. These evaluations provide an important contribution to the direction of research efforts and the calibration of technical capabilities. The evaluation was administered as outlined in the official MT-08 evaluation plan.

Disclaimer

These results are not to be construed, or represented as endorsements of any participant's system or commercial product, or as official findings on the part of NIST or the U.S. Government. Note that the results submitted by developers of commercial MT products were generally from research systems, not commercially available products. Since MT-08 was an evaluation of research algorithms, the MT-08 test design required local implementation by each participant. As such, participants were only required to submit their translation system output to NIST for uniform scoring and analysis. The systems themselves were not independently evaluated by NIST.

Certain commercial equipment, instruments, software, or materials are identified in this paper in order to specify the experimental procedure adequately. Such identification is not intended to imply recommendation or endorsement by the (NIST), nor is it intended to imply that the equipment, instruments, software or materials are necessarily the best available for the purpose.

There is ongoing discussion within the MT research community regarding the most informative metrics for machine translation. The design and implementation of these metrics are themselves very much part of the research. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

The data, protocols, and metrics employed in this evaluation were chosen to support MT research and should not be construed as indicating how well these systems would perform in applications. While changes in the data domain, or changes in the amount of data used to build a system, can greatly influence system performance, changing the task protocols could indicate different performance strengths and weaknesses for these same systems.

Because of the above reasons, this should not be interpreted as a product testing exercise and the results should not be used to make conclusions regarding which commercial products are best for a particular application.

Evaluation Tasks

The MT-08 evaluation consisted of four tasks. Each task required a system to perform translation from a given source language into the target language. The source and target language pairs that made up the four MT-08 tasks were:

Translate Arabic text into English text
Translate Chinese text into English text
Translate Urdu text into English text
Translate English text into Chinese text

Evaluation Conditions

MT research and development requires language data resources. System performance is strongly affected by the type and amount of resources used. Therefore, two different resource categories were defined as conditions of evaluation. The categories differed solely by the amount of data that was available for use in the training and development of the core MT engine. The evaluation conditions were called "Constrained Data Track" and "Un-Constrained Data Track".

Constrained Data Track - limited the training data to data in the LDC public catalogue existing before July 1st, 2007.

Note: For the URDU task, the constrained data condition required that the core system development use only the provided resource DVD. No other data was allowed for the primary condition of interest.

Un-Constrained Data Track - extended the training data to any publicly available data existing before July 1st, 2007.

Other submissions not in categories described above will not be reported in the final release.

Evaluation Data

Source Data

MT-08 evaluation data sets contained documents drawn from newswire text documents and web-based newsgroup documents. The source documents were encoded in UTF-8.

The test data was selected from a pool of data collected by the LDC during July 2007. The careful selection process sought to have a variety of sources (see below) and publication dates while hitting the target test set size.

Source Language	Sources
Source Language	Newswire	Newsgroup / Web
Arabic	AAW, AFP, AHR, ASB, HYT, NHR, QDS, XIN Assabah Xinhua News Agency	various web forums
Chinese	AFP, CNS, GMW, PDA, PLA, XIN	various web forums
Urdu	BBC,JNG, PTB, VOA	various web forums
English	AFP, APW, LTW, NYT, XIN	n/a

Reference Data

MT-08 reference data consists of four independently generated high quality translations that were produced by professional translation companies. Each translation agency was required to have native speaker(s) of the source and target languages working on the translations.

Current versus Progress Data Division

For those willing to abide by the strict processing rules, a "PROGRESS" test set was distributed to use as a BLIND benchmark for several evaluations. Teams that processed this data submitted their translations to NIST and deleted all related files (source, translations, and any other derivitive file). The scores of the progress test sets were reported to the participants but were not reported here. Future Open MT evaluations will report PROGRESS test set scores from year to year.

Performance Measurement

Machine translation quality was measured automatically using an N-gram co-occurrence statistic metric developed by IBM and referred to as BLEU. BLEU measures translation accuracy according to N-grams or sequence of N-words that it shares with one or more high quality reference translations. Thus, the more co-occurrences, the better the score. BLEU is an accuracy metric, ranging from "0" to "1" with "1" being the best possible score. A detailed description of BLEU can be found in the paper Papineni, Roukos, Ward, Zhu (2001). "Bleu: a Method for Automatic Evaluation of Machine Translation" (keyword = RC22176).

Although BLEU was the official metric for MT-08, measuring translation quality is an ongoing research topic in the MT community. At the present time, there is no single metric that has been deemed to be completely indicative of all aspects of system performance.

Automatic metrics reported:

BLEU-4 (MTeval-v11b: official metric)
IBM BLEU (IBM's BLEU with original brevity penalty)
NIST (NIST's refinement of BLEU, commonly referred to as NIST)
TER
METEOR

Other metrics (to be) reported:

Human Assessments of Adequacy (judged by participants and others)
Human judgments of Preference (judged by participants and others)
MT Comprehension Test (implemented by MIT-LL)

Evaluation Participants

The table below lists the organizations entered as participants and the evaluation tasks they are registered for in MT-08.

Site ID	Organization	Location
apptek	Applications Technology Inc.	USA
auc	The American University in Cairo	Egypt
basistech	Basis Technology	USA
bbn	BBN Technologies	USA
bjut-mtg	Beijing University of Technology, Machine Translation Group	China
cas-ia	Chinese Academy of Sciences, Institute of Automation	China
cas-ict	Chinese Academy of Sciences, Institute of Computing Technology	China
cas-is	Chinese Academy of Sciences, Institute of Software	China
cmu-ebmt	Carnegie Mellon	USA
cmu-smt	Carnegie Mellon, interACT	USA
cmu-xfer	Carnegie Mellon	USA
columbia	Columbia University	USA
cued	University of Cambridge, Dept. of Engineering	UK
edinburgh	University of Edinburgh	UK
google	Google	USA
hit-ir	Harbin Institute of Technology, Information Retrieval Laboratory	China
hkust	.	China
ibm	IBM	USA
lium	Universite du Maine (Le Mans), Laboratoire d'Informatique	France
msra	Microsoft Research Asia	China
nrc	National Research Council	Canada
nthu	National Tsing Hua University	Taiwan
ntt	NTT Communication Science Laboratories	Japan
qmul	Queen Mary University of London	UK
sakhr	Sakhr Software Co.	Egypt
sri	SRI International	USA
stanford	Stanford University	USA
uka	Universitaet Karlsruhe	Germany
umd	University of Maryland	USA
upc-lsi	Universitat Politechnica de Catalunya, LSI	Spain
upc-talp	Universitat Politechnica de Catalunya, TALP	Spain
xmu-iai	Xiamen University, Institute of Artificial Intelligence	China
Collaborations
ibm_umd	IBM / University of Maryland MD	USA
jhu_umd	Johns Hopkins University / University of Maryland	USA
isi_lw	USC-ISI / Language Weaver Inc.	USA
msr_msra	Microsoft Research / Microsoft Research Asia	.
msr_nrc_sri	Microsoft Research / Microsoft Research Asia / National Research Council Canada / SRI International	.
nict_atr	NICT / ATR	Japan
nrc_systran	National Research Council Canada / SYSTRAN	.

Evaluation Systems

Each site/team could submit up to four systems for evaluation with one system marked as its primary system. The primary system indicated the site/team's best effort. This official public version of the results report the results only for the primary systems. Note that these charts show an absolute ranking according to the primary metric.

Systems that fail to meet the requirements for either track will not be reported here.

"significance groups*" shows areas where the wilcoxon signed rank test was not able to differenciate system performance at the 95% confidence level. That is, if two systems belong to the same significance group (by sharing the same number), then they are determined to be comparble, based n BLEU-4 scoring.

Results Section

Contains Valid On-Time Submissions

Late and corrected submission will be linked here

Overall System Results

Arabic to English (primary system) Results

Entire Current Evaluation Test Set

significance groups*	system	BLEU-4*	IBM BLEU	NIST	TER	METEOR
Constrained Training Track
1	google_arabic_constrained_primary	0.4557	0.4526	10.8821	48.535	0.6857
2	IBM-UMD_arabic_constrained_primary	0.4525	0.4300	10.6183	48.436	0.6539
3	IBM_arabic_constrained_primary	0.4507	0.4276	10.5904	48.547	0.6530
3	bbn_arabic_constrained_primary	0.4340	0.4290	10.6590	49.599	0.6784
4	LIUM_arabic_constrained_primary	0.4298	0.4105	10.2732	50.484	0.6490
5	isi-lw_arabic_constrained_primary	0.4248	0.4227	10.4077	51.820	0.6695
6	CUED_arabic_constrained_primary	0.4238	0.4018	9.9486	51.557	0.6274
6	SRI_arabic_constrained_primary	0.4229	0.4031	10.1935	49.780	0.6430
7	Edinburgh_arabic_constrained_primary	0.4029	0.3833	9.9641	51.165	0.6396
8	UMD_arabic_constrained_primary	0.3906	0.3784	10.1176	52.158	0.6553
9	UPC_arabic_constrained_primary	0.3743	0.3576	9.6553	53.260	0.6380
10	columbia_arabic_constrained_primary	0.3740	0.3594	9.4806	51.973	0.6092
9,10	NTT_arabic_constrained_primary	0.3671	0.3540	9.8806	56.077	0.6312
11	CMUEBMT_arabic_constrained_primary	0.3481	0.3479	9.2165	57.376	0.6057
12	qmul_arabic_constrained_primary	0.3308	0.3181	8.8124	55.145	0.5893
13	SAKHR_arabic_constrained_primary	0.3133	0.3133	9.1373	57.159	0.6659
14	UPC.lsi_english_constrained_primary	0.3021	0.2876	8.6350	58.228	0.5639
15	BASISTECH_arabic_constrained_primary	0.2529	0.2423	7.8781	63.015	0.5454
16	AUC_arabic_constrained_primary	0.1415	0.1359	6.3210	76.406	0.4468
UnConstrained Training Track
17	google_arabic_unconstrained_primary	0.4772	0.4739	11.1864	46.853	0.6996
18	IBM_arabic_unconstrained_primary	0.4717	0.4527	11.0591	46.755	0.6902
19	apptek_arabic_unconstrained_primary	0.4483	0.4474	10.8420	48.263	0.7160
20	cmu-smt_arabic_unconstrained_primary	0.4312	0.4114	10.3617	50.082	0.6672

* designates primary metric

Chinese to English (primary system) Results

Entire Current Evaluation Test Set

significance groups*	system	BLEU-4*	IBM BLEU	NIST	TER	METEOR
Constrained Training Track
1	MSR-NRC-SRI_chinese_constrained_primary	0.3089	0.2947	8.5059	58.460	0.5379
1	bbn_chinese_constrained_primary	0.3059	0.2959	8.2023	57.067	0.5468
1	isi-lw_chinese_constrained_primary	0.3041	0.2940	8.0950	57.734	0.5467
1	google_chinese_constrained_primary	0.2999	0.2887	8.5143	58.359	0.5567
2	MSR-MSRA_chinese_constrained_primary	0.2901	0.2766	8.1480	60.073	0.5171
3	SRI_chinese_constrained_primary	0.2697	0.2575	7.8942	61.622	0.5101
3	Edinburgh_chinese_constrained_primary	0.2608	0.2513	7.8117	60.654	0.5142
4	SU_chinese_constrained_primary	0.2547	0.2420	7.7994	63.288	0.5122
4,5	UMD_chinese_constrained_primary	0.2506	0.2387	7.8236	62.134	0.5167
4,5	NTT_chinese_constrained_primary	0.2469	0.2270	7.9511	63.415	0.5126
5	NRC_chinese_constrained_primary	0.2458	0.2373	7.9964	63.835	0.5362
5	CASIA_chinese_constrained_primary	0.2407	0.2310	7.5790	62.518	0.4999
6	NICT-ATR_chinese_constrained_primary	0.2269	0.2184	7.1635	64.524	0.4962
6	ICT_chinese_constrained_primary	0.2258	0.2213	6.1551	61.387	0.4878
7	JHU-UMD_chinese_constrained_primary	0.2111	0.2079	6.0509	61.834	0.4691
8	XMU_chinese_constrained_primary	0.1979	0.1938	6.7514	63.139	0.4780
9	HITIRLab_chinese_constrained_primary	0.1866	0.1795	6.5942	67.376	0.4458
10	hkust_large_primary	0.1678	0.1624	6.7124	75.803	0.4332
10	ISCAS_chinese_constrained_primary	0.1569	0.1520	5.9557	68.221	0.4354
11	NTHU_Chinese_constrained_primary	0.0393	0.0390	3.5096	93.892	0.3209
UnConstrained Training Track
12	google_chinese_unconstrained_primary	0.3195	0.3069	8.8628	57.009	0.5707
13	cmu-smt_chinese_unconstrained_primary	0.2597	0.2474	8.0026	62.411	0.5363
14	NRC-SYSTRAN_chinese_unconstrained_primary	0.2523	0.2443	8.0473	63.002	0.5490
15	UKA_chinese_unconstrained_primary	0.2406	0.2323	7.4571	61.706	0.4916
16	CMUXfer_chinese_unconstrained_primary	0.1310	0.1309	6.2452	76.722	0.4614
17	BJUT_chinese_unconstrained_primary	0.0735	0.0694	4.7239	77.685	0.3944

* designates primary metric

Urdu to English (primary system) Results

significance groups*	system	BLEU-4*	IBM BLEU	NIST	TER	METEOR
Constrained Training Track
1	google_urdu_constrained_primary	0.2281	0.2280	7.8406	69.906	0.5693
2	bbn_urdu_constrained_primary	0.2028	0.2026	7.6927	70.885	0.5437
2	IBM_urdu_constrained_primary	0.2026	0.1999	7.7022	68.860	0.5096
2	isi-lw_urdu_constrained_primary	0.1983	0.1985	7.3030	72.749	0.5239
3	UMD_urdu_constrained_primary	0.1829	0.1826	7.2905	68.748	0.5053
4	MITLLAFRL_urdu_constrained_primary	0.1666	0.1666	7.0460	72.859
5	UPC_urdu_constrained_primary	0.1614	0.1614	7.0958	72.839	0.4904
6	columbia_urdu_constrained_primary	0.1459	0.1460	6.5474	78.686	0.4903
6,7	Edinburgh_urdu_constrained_primary	0.1456	0.1455	6.4393	75.982	0.5215
7,8	NTT_urdu_constrained_primary	0.1394	0.1383	6.9604	75.605	0.5022
8	qmul_urdu_constrained_primary	0.1338	0.1338	6.2915	81.457	0.4728
8	CMU-XFER_urdu_constrained_primary#	0.1016	0.1017	4.1885	108.167	0.3518

* designates primary metric
# designates system with known alignment problem, corrected system submitted late.

English to Chinese (primary system) Results

Here is a description of the scores:

BLEU-4*: primary metric produced using mteval-v12, which is a language independent version that tokenizes on every unicode symbol.

BLEU-4 normalized: makes use of a mapping file to normalize both the reference and system translations to a single variant of certain sybmols.

NIST: the Doddington improvment to BLEU as reported from mteval-v12.

BLEU-4 word segmented: mteval-v12 with word scoring, using a standard word segmenter for both reference and system translation.

We are not identifying significance groups for this task.

.	system	BLEU-4*	BLEU-4 normalized	NIST	BLEU-4 Word Segmented	.
Constrained Training Track
.	google_english_constrained_primary	0.4142	0.4309	9.7727	0.1643	.
.	MSRA_English_constrained_primary	0.4099	0.4343	9.4918	0.1769	.
.	isi-lw_english_constrained_primary	0.3857	0.4163	8.6810	0.1687	.
.	NICT-ATR_english_constrained_primary	0.3438	0.3718	7.9608	0.1416	.
.	HITIRLab_english_constrained_primary	0.3225	0.3436	7.3768	0.0946	.
.	ICT_english_constrained_primary	0.3176	0.3411	7.7030	0.0879	.
.	CMUEBMT_english_constrained_primary	0.2738	0.2954	7.3042	0.0760	.
.	XMU_english_constrained_primary	0.2502	0.2664	6.2083	0.0593	.
.	UMD_english_constrained_primary	0.1982	0.2391	3.6922	0.0899	.
UnConstrained Training Track
.	google_english_unconstrained_primary	0.4710	0.4914	10.7868	0.1963	.
.	BJUT_english_unconstrained_primary	0.2765	0.2906	7.8185	0.1046	.

* designates primary metric

Results by Genre

All reported scores are limited to the entire "CURRENT" data sets. All primary submissions are shown here.

Site Results ( alphabetic order )

All scores are BLEU-4*
.	system	All data	NW	WB
Arabic to English
.	AUC_arabic_constrained_primary	0.1415	0.1718	0.0983
.	BASISTECH_arabic_constrained_primary	0.2529	0.2951	0.1900
.	CMUEBMT_arabic_constrained_primary	0.3481	0.4094	0.2695
.	CUED_arabic_constrained_primary	0.4238	0.4819	0.3456
.	Edinburgh_arabic_constrained_primary	0.4029	0.4675	0.3008
.	IBM-UMD_arabic_constrained_primary	0.4525	0.5085	0.3489
.	IBM_arabic_constrained_primary	0.4507	0.5089	0.3432
.	LIUM_arabic_constrained_primary	0.4298	0.4830	0.3431
.	NTT_arabic_constrained_primary	0.3671	0.4186	0.2923
.	SAKHR_arabic_constrained_primary	0.3133	0.3505	0.2622
.	SRI_arabic_constrained_primary	0.4229	0.4886	0.3171
.	UMD_arabic_constrained_primary	0.3906	0.4452	0.3117
.	UPC.lsi_english_constrained_primary	0.3021	0.3475	0.2292
.	UPC_arabic_constrained_primary	0.3743	0.4281	0.2840
.	bbn_arabic_constrained_primary	0.4340	0.4919	0.3497
.	columbia_arabic_constrained_primary	0.3740	0.4431	0.2797
.	google_arabic_constrained_primary	0.4557	0.5164	0.3724
.	isi-lw_arabic_constrained_primary	0.4248	0.4870	0.3355
.	qmul_arabic_constrained_primary	0.3308	0.4005	0.2358
UNCONSTRAINED SYSTEMS		All dataNW	WB
.	IBM_arabic_unconstrained_primary	0.4717	0.5264	0.3762
.	apptek_arabic_unconstrained_primary	0.4483	0.4900	0.3925
.	cmu-smt_arabic_unconstrained_primary	0.4312	0.4884	0.3392
.	google_arabic_unconstrained_primary	0.4772	0.5385	0.3940
Chinese to English
.	CASIA_chinese_constrained_primary	0.2407	0.2756	0.1936
.	Edinburgh_chinese_constrained_primary	0.2608	0.2976	0.2116
.	HITIRLab_chinese_constrained_primary	0.1866	0.2116	0.1529
.	ICT_chinese_constrained_primary	0.2258	0.2760	0.1586
.	ISCAS_chinese_constrained_primary	0.1569	0.1805	0.1257
.	JHU-UMD_chinese_constrained_primary	0.2111	0.2502	0.1586
.	MSR-MSRA_chinese_constrained_primary	0.2901	0.3435	0.2175
.	MSR-NRC-SRI_chinese_constrained_primary	0.3089	0.3614	0.2376
.	NICT-ATR_chinese_constrained_primary	0.2269	0.2579	0.1854
.	NRC_chinese_constrained_primary	0.2458	0.2679	0.2150
.	NTHU_Chinese_constrained_primary	0.0393	0.0367	0.0425
.	NTT_chinese_constrained_primary	0.2469	0.2828	0.1991
.	SRI_chinese_constrained_primary	0.2697	0.3154	0.2075
.	SU_chinese_constrained_primary	0.2547	0.2924	0.2039
.	UMD_chinese_constrained_primary	0.2506	0.2939	0.1871
.	XMU_chinese_constrained_primary	0.1979	0.2401	0.1401
.	bbn_chinese_constrained_primary	0.3059	0.3639	0.2273
.	google_chinese_constrained_primary	0.2999	0.3489	0.2344
.	hkust_large_primary	0.1678	0.1891	0.1377
.	isi-lw_chinese_constrained_primary	0.3041	0.3676	0.2176
UNCONSTRAINED SYSTEMS		All dataNW	WB
.	BJUT_chinese_unconstrained_primary	0.0735	0.0751	0.0689
.	CMUXfer_chinese_unconstrained_primary	0.1310	0.1536	0.0994
.	NRC-SYSTRAN_chinese_unconstrained_primary	0.2523	0.2757	0.2192
.	UKA_chinese_unconstrained_primary	0.2406	0.2846	0.1810
.	cmu-smt_chinese_unconstrained_primary	0.2597	0.2909	0.2127
.	google_chinese_unconstrained_primary	0.3195	0.3701	0.2515
Urdu to English
.	CMU-XFER_urdu_constrained_primary#	0.1016	0.1827	0.0183
.	Edinburgh_urdu_constrained_primary	0.1456	0.1609	0.1291
.	IBM_urdu_constrained_primary	0.2026	0.2347	0.1668
.	MITLLAFRL_urdu_constrained_primary	0.1666	0.1939	0.1373
.	NTT_urdu_constrained_primary	0.1394	0.1630	0.1155
.	UMD_urdu_constrained_primary	0.1829	0.2160	0.1478
.	UPC_urdu_constrained_primary	0.1614	0.1878	0.1320
.	bbn_urdu_constrained_primary	0.2028	0.2388	0.1632
.	columbia_urdu_constrained_primary	0.1459	0.1714	0.1195
.	google_urdu_constrained_primary	0.2281	0.2619	0.1903
.	isi-lw_urdu_constrained_primary	0.1983	0.2292	0.1645
.	qmul_urdu_constrained_primary	0.1338	0.1578	0.1077

* designates primary metric
# designates system with known alignment problem, corrected system submitted late.