Introduction
This is the data directory of the Hunglish CDROM, containing the parallel
corpus as sentence pair files and, where copyright permits, the original
monolingual files as well. The corpus is sorted by genre: the main categories
are as follows.
film
Movie subtitles. The raw files were provided to MOKK for
research only and can not be republished on this CD. The sentence aligned
files are given in "shuffled" format: one sentence per line (Hungarian
sentence TAB English sentence) and the lines alphabetically sorted so as
to make republishing of the original subtitle files impossible. This
data segment has many speling errors owing to OCR text extraction, and
as a special aid for subselecting a higher quality dataset, sentence by
sentence figures of alignment merit are provided in the file
quality.
law
Legal texts. The raw files came from CELEX and are reproduced in full under
the law/en (English) and law/hu
(Hungarian) subdirectories. The aligned files are in the law/bi directory.
lit
Literature. For "classical" material no longer under copyright,
the raw files came from Project
Gutenberg and the Hungarian Electronic
Library. For these, both raw (en, hu) and aligned (bi) files are
available. For "modern" material still under copyright and made avaliable to
MOKK for research purposes, the sentence pair (bi) files are shuffled together
in one bi/Shuffle.
mag
Magazines and news. This material is still largely in preparation
at the time the CD goes to press, please visit the
Hunglish website for more.
swdoc
Software documentation. The raw files come from OpenOffice.org,
Mozilla, Gnome, KDE, and other major FOSS (Free Open Source Software) projects.
mono
Monolingual Hungarian files taken from the Hungarian National
Corpus: parliament and city council minutes, laws, regulations, and other
"official" material as well as the archives of the chat rooms of a major
Hungarian internet portal, index.hu. Please note that the tokenization of these
files differs from the rest of the corpus inasmuch as both sentence-final
punctuation and trailing periods after abbreviations are given as
whitespace-separated tokens. This explains the discrepancy between the
numbver of words and sentences quoted in the summary below and the number
coming from wc.
Summary
directory |
size (MB) |
words |
contents |
raw text |
film |
18 |
3.27 m |
move subtitles |
never |
law |
233 |
31.53 m |
EU law (CELEX) |
full |
lit |
85 |
17.24 m |
literature |
when (C) lapsed |
mag |
5 |
0.36 m |
magazines, news |
yes, research only |
swdoc |
8 |
1.27 m |
software documentation |
full |
mono |
8 |
46.4 m |
monolingual Hungarian |
full |
A more detailed inventory is available in the form of a catalog
and wc files in each directory, the former providing author and title
information (in both languages) and the latter containing the output of wc for
each file. A file conf which summarizes our confidence in the
alignment is also provided for each directory so that users can select a
high-confidence subset of the corpus if they wish.
Copyright
The raw files are either public domain or provided for research purposes
with the consent of the copyright holder, Diplomacy and Trade Magazine
(see mag/*/DTM*).
Documentation
Further details on corpus preparation and alignment
can be found in this draft and this
paper.
Acknowledgements
We gratefully acknowledge CELEX, Project Gutenberg, the Hungarian
Electronic Library, Typotex, and Diplomacy and Trade Magazine.