Construction and Annotation of a Maltese corpus

Joe Caruana

This paper describes work in progress towards the creation of a morpho-syntactically annotated computer corpus of Maltese. Hitherto, the language has not been the object of much research in either corpus-linguistics or natural language processing, and is largely bereft of the apparatus of computer-linguistic resources and tools available for work on other languages.

The texts making up the existing corpus data are preponderantly drawn from written sources, the large majority being newspaper writing. A small proportion of the texts is drawn from transcribed spoken data, largely of broadcast nature. The tension between the availabilty of texts and the linguistic desiderata of "representativeness" and "balance" in corpus design is briefly discussed.

The paper then proceeds to outline the choices made with regard to encoding the corpus data. The adoption of XCES/EAGLES corpus architecture and markup guidelines is briefly discussed and motivated. The adaptations that certain linguistic features of Maltese necessitate in both EAGLES and XCES standards are described, and illustrated with reference to the number system of Maltese nouns, the root and pattern morphology of Semitic Maltese, and cliticisation. The paper concludes with a prospectus of further work on the project.