We frequently receive questions about what are the options if a user's TMs are small. It's difficult to recommend a source for 3rd party TMs. There are copyright and licensing issues, quality issues, appropriate use issues... the list goes on and on. This release of the first version of the UN corpus gave hope that quality data might finally be available.
We share this information without warranty of any kind. We have not used the data and we do not know how it is organized. Testing the TMs is now officially on our TODO list.
If you try these TMs and have problems, please let us know. We're eager to help you and our growing Slate Desktop community. If you have experience with pre-release versions of the data, I hope you can share your experiences here.
---------- Moses Support list email announcing the release ----------
Date: Wed, 25 May 2016 12:32:51 +0200
From: Marcin Junczys-Dowmunt [email-redacted]
Subject: [Moses-support] Official release of the United Nations Parallel Corpus v1.0
To: moses-support <firstname.lastname@example.org>
I would like to announce the official release of the United Nations Parallel Corpus v1.0. The corpus was created as part of the United Nations commitment to multilingualism and as a reaction to the growing importance of statistical machine translation (SMT) within the Department for General Assembly and Conference Management (DGACM) translation services and the United Nations. It covers 25 years, from 1990 to 2014, and contains documents in the six official languages of the United Nations: Arabic, Chinese, English, French, Russian, and Spanish.
The purpose of the corpus is to allow access to multilingual language resources and facilitate research and progress in various natural language processing tasks, including machine translation. For convenience, the corpus is also available pre-packaged as bi-texts for each language pair.
A subset of the corpus is available as a six-language fully-parallel corpus, i.e. all sentences have equivalents in all six languages. Data from 2015 has been used to created official development sets and test sets, also fully aligned across the six official UN languages. The paper reports SMT baselines for all languages pairs for this corpus.
The corpus is available at:
The corresponding publication is available at:
While registering, please leave a short description of the work for which you plan to use the corpus. In the near future we plan to set up a section with references to papers that describe research done with UN corpus. Feel free to share links and bibliography items with us (either with me or any of the authors of the above paper).
Sorry for cross-posting,