iBLEU: Interactively Debugging and Scoring Statistical Machine Translation Systems

Machine Translation (MT) systems are evaluated and debugged using the BLEU automated metric. However, the current community implementation of BLEU is not ideal for MT system developers and researchers since it only produces textual information. I present a novel tool called iBLEU that organizes BLEU scoring information in a visual and easy-to-understand manner, making it easier for MT system developers & researchers to quickly locate documents and sentences on which their system performs poorly. It also allows comparing translations from two different MT systems. Furthermore, one can also choose to compare to the publicly available MT systems, e.g., Google Translate and Bing Translator, with a single click. It can run on all major platforms and requires no setup whatsoever.