Hunglish

Introduction

This is the Hunglish CDROM, featuring a sentence-aligned Hungarian-English parallel corpus of about 54.2 m words in 2.07 m sentences, a monolingual corpus of about 46.4 m words in 3.00 m sentences, and ancillary software.

Inventory

directory	size (MB)	contents
doc	4	Documentation
data	427	The data files sorted by genre
src	26	Source code for the ancillary software
lr	136	Language resources for Hungarian
bin	11	Precompiled binaries

Copyright

The following copyright applies to all source code and executables on this CD:

This work is licensed under the Creative Commons Attribution 2.0 License.

The raw text files in the data directory are public domain. Papers and documentation contained on the CD are copyright of their authors.

Authors

The CD was produced by the joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (Dániel Varga, Péter Halácsy, András Kornai, László Németh, and Viktor Trón), and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics (Tamás Váradi, Bálint Sass, Gergő Bottyán, Enikő Héja, Ágnes Gyarmati, Ágnes Mészáros and Dávid Labundy), who contributed all the monolingual source material and most of the bilingual source files in the magazine section.

Acknowledgements

The Hunglish project is supported by an ITEM grant by the the Hungarian Ministry of Informatics and Communication. András Aklán (BUTE) provided effective project management for the production process and Mike Maxwell (LDC) advised us on the structure of the CD and found many bugs. We thank Magyar Telekom Rt. for infrastructure support.