This is the Hunglish CDROM, featuring a sentence-aligned Hungarian-English parallel corpus of about 54.2 m words in 2.07 m sentences, a monolingual corpus of about 46.4 m words in 3.00 m sentences, and ancillary software.


directory size (MB) contents
doc 4 Documentation
data 427 The data files sorted by genre
src 26 Source code for the ancillary software
lr 136 Language resources for Hungarian
bin 11 Precompiled binaries


The following copyright applies to all source code and executables on this CD:

Creative Commons License
This work is licensed under the Creative Commons Attribution 2.0 License.

The raw text files in the data directory are public domain. Papers and documentation contained on the CD are copyright of their authors.


The CD was produced by the joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (Dániel Varga, Péter Halácsy, András Kornai, László Németh, and Viktor Trón), and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics (Tamás Váradi, Bálint Sass, Gergő Bottyán, Enikő Héja, Ágnes Gyarmati, Ágnes Mészáros and Dávid Labundy), who contributed all the monolingual source material and most of the bilingual source files in the magazine section.


The Hunglish project is supported by an ITEM grant by the the Hungarian Ministry of Informatics and Communication. András Aklán (BUTE) provided effective project management for the production process and Mike Maxwell (LDC) advised us on the structure of the CD and found many bugs. We thank Magyar Telekom Rt. for infrastructure support.