¹ Haijun Li, ¹ Tianqi Shi, ¹ Zifu Shang, ¹ Yuxuan Han, ¹ Xueyu Zhao, ¹ Hao Wang, ¹ Yu Qian, ¹ Zhiqiang Qian, ¹ Linlong Xu, ¹ Minghao Wu, ¹ Longyue Wang, ² Gongbo Tang, ¹ Weihua Luo, ¹ Zhao Xu, ¹ Kaifu Zhang
¹ Alibaba International Digital Commerce, ² Beijing Language and Culture University
TransBench is the first industry-oriented comprehensive multilingual translation evaluation system designed for industrial applications. It quantifies translation model performance across diverse industries and linguistic environments through meticulously curated datasets aligned with universal translation standards, vertical industry norms, and cultural localization requirements.
- 🌐 Global Language Coverage: 16+ languages including Chinese, English, French, Japanese, Arabic, etc.
- 🏭 Industry-Specific Evaluation: Specialized datasets for e-commerce, customer service, marketing, and cross-cultural adaptation
- 📊 Multi-Dimensional Assessment: Combines linguistic accuracy, cultural appropriateness, and industry-specific requirements
- 🔍 Robustness Testing: Includes stability attack data (misspellings, word-order chaos, terminology errors)
TransBench evaluates models through three core dimensions:
-
General Translation Standard
Focus: Basic translation accuracy
Primary Metric: BLEU score -
E-Commerce Vertical Standard
Focus: Industry-specific translation quality
Primary Metric: E-MOS (Expert Mean Opinion Score) -
Cultural Localization Standard
Focus: Cross-cultural adaptation
Primary Metric: Accuracy Rate
Our datasets cover multiple domains and linguistic features:
Category | Subdomains | Languages Covered |
---|---|---|
E-Commerce | Product listings, SEO texts, customer reviews | 16 languages |
Customer Service | Q&A dialogues, knowledge base | 12 languages |
Cultural Adaptation | Taboo mappings, honorific norms | 8 languages |
Stress Testing | Adversarial inputs, error simulations | All languages |
Scoring Rules
Composite Score = Average of normalized scores across three dimensions
Latest Update: 2025-04-28
Rank | Model | Type | Params | Release Date | Composite | General ↑ | E-Commerce ↑ | Culture ↑ |
---|---|---|---|---|---|---|---|---|
1 | GPT-4o | LLM | - | 2024-11-20 | 48.408 | 4.255 | 0.303 | - |
2 | DeepL Translate | MT | - | 2025-04-27 | 48.371 | 4.068 | 0.245 | - |
... | ... | ... | ... | ... | ... | ... | ... | ... |
Notes:
- Release dates for commercial MT systems indicate evaluation dates
- ↑ indicates positive correlation, ↓ indicates negative correlation
- N/D = Not Available
- BLEU Score: Measures n-gram precision against reference translations
- Error Rate: Counts mistranslations and omissions
- E-MOS: Expert evaluation (1-5 scale) on:
- Product term accuracy
- SEO effectiveness
- Query understanding
- Taboo Avoidance: Religious/dietary/gender norm compliance
- Honorific Accuracy: Context-appropriate formal language
- Localization Index: Target-culture naturalness
We welcome contributions through:
- Dataset Improvements: Submit high-quality translation pairs
- Model Submissions: Evaluate your translation model
- Cultural Expertise: Help refine localization criteria
See CONTRIBUTING.md for details.
If you find TransBench useful for your research and applications, please cite:
@misc{zhao2024marcoo1openreasoningmodels,
title={TransBench: Benchmarking Machine Translation for Industrial-Scale Applications},
author={Haijun Li, Tianqi Shi, Zifu Shang, Yuxuan Han, Xueyu Zhao, Hao Wang, Yu Qian, Zhiqiang Qian, Linlong Xu, Minghao Wu, Chenyang Lyu, Longyue Wang, Gongbo Tang, Weihua Luo, Zhao Xu, Kaifu Zhang},
year={2025},
eprint={2505.14244},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2505.14244},
}
The project is released under Apache License 2.0 (http://www.apache.org/licenses/LICENSE-2.0, SPDX-License-identifier: Apache-2.0).
We thank our industry partners and linguistic experts for their invaluable contributions to developing robust evaluation standards.