Hunglish

Introduction

This is the data directory of the Hunglish CDROM, containing the parallel corpus as sentence pair files and, where copyright permits, the original monolingual files as well. The corpus is sorted by genre: the main categories are as follows.

film

Movie subtitles. The raw files were provided to MOKK for research only and can not be republished on this CD. The sentence aligned files are given in "shuffled" format: one sentence per line (Hungarian sentence TAB English sentence) and the lines alphabetically sorted so as to make republishing of the original subtitle files impossible. This data segment has many speling errors owing to OCR text extraction, and as a special aid for subselecting a higher quality dataset, sentence by sentence figures of alignment merit are provided in the file quality.

law

Legal texts. The raw files came from CELEX and are reproduced in full under the law/en (English) and law/hu (Hungarian) subdirectories. The aligned files are in the law/bi directory.

lit

Literature. For "classical" material no longer under copyright, the raw files came from Project Gutenberg and the Hungarian Electronic Library. For these, both raw (en, hu) and aligned (bi) files are available. For "modern" material still under copyright and made avaliable to MOKK for research purposes, the sentence pair (bi) files are shuffled together in one bi/Shuffle.

mag

Magazines and news. This material is still largely in preparation at the time the CD goes to press, please visit the Hunglish website for more.

swdoc

Software documentation. The raw files come from OpenOffice.org, Mozilla, Gnome, KDE, and other major FOSS (Free Open Source Software) projects.

mono

Monolingual Hungarian files taken from the Hungarian National Corpus: parliament and city council minutes, laws, regulations, and other "official" material as well as the archives of the chat rooms of a major Hungarian internet portal, index.hu. Please note that the tokenization of these files differs from the rest of the corpus inasmuch as both sentence-final punctuation and trailing periods after abbreviations are given as whitespace-separated tokens. This explains the discrepancy between the numbver of words and sentences quoted in the summary below and the number coming from wc.

Summary

directory	size (MB)	words	contents	raw text
film	18	3.27 m	move subtitles	never
law	233	31.53 m	EU law (CELEX)	full
lit	85	17.24 m	literature	when (C) lapsed
mag	5	0.36 m	magazines, news	yes, research only
swdoc	8	1.27 m	software documentation	full
mono	8	46.4 m	monolingual Hungarian	full

A more detailed inventory is available in the form of a catalog and wc files in each directory, the former providing author and title information (in both languages) and the latter containing the output of wc for each file. A file conf which summarizes our confidence in the alignment is also provided for each directory so that users can select a high-confidence subset of the corpus if they wish.

Copyright

The raw files are either public domain or provided for research purposes with the consent of the copyright holder, Diplomacy and Trade Magazine (see mag/*/DTM*).

Documentation

Further details on corpus preparation and alignment can be found in this draft and this paper.

Acknowledgements

We gratefully acknowledge CELEX, Project Gutenberg, the Hungarian Electronic Library, Typotex, and Diplomacy and Trade Magazine.