TrAVaSI_GDLI-quotation corpus
Please use the following text to cite this item or export to a predefined format:
Favaro, Manuel; Guadagnini, Elisa; Sassolini, Eva; Biffi, Marco and Montemagni, Simonetta, 2022, TrAVaSI_GDLI-quotation corpus, CLARIN DSpace, http://hdl.handle.net/20.500.11752/ILC-984
Authors
Item identifier
Project URL
Date issued
2022
Size
45.000 tokens
Language(s)
Description
The TrAVaSI_GDLI-quotation corpus (TrAVaSI_GDLI-QC) is a first nucleus of a diachronic corpus for Italian collecting a sample of the quotations of a historical dictionary, namely the "Grande Dizionario della Lingua Italiana" (GDLI) by Salvatore Battaglia, which includes a huge collection of quotations covering the entire history of the Italian language, ranging from the Middle Ages to the present day. Different criteria guided the composition of the corpus. Among the most cited authors, those who guaranteed to cover the widest chronological span were selected. Representativeness of different text typologies (e.g. chronicle, literary prose, poetry, treatises) was also taken into account. The resulting TrAVaSI_GDLI-QC consists of two balanced sub-corpora, with quotations from works written between 14th and 20th century: one collecting 1500 prose quotes from 15 authors (100 each) for a total of about 35.000 tokens, and the other gathering 500 poetry quotes from 10 authors (50 each) for a total of about 10.000 tokens. TrAVaSI_GDLI-QC is morpho-syntactically annotated and lemmatized. The annotation, conforming to the Universal Dependencies standard (UD, De Marneffe et al. 2021), has been carried out semi-automatically. First, both sub-corpora were automatically annotated with the Stanza “combined” model for Italian. Automatic annotation was then manually revised. The resulting corpus has also been used to retrain Stanza to deal with historical varieties of the Italian language: achieved results are encouraging.
Acknowledgement
Regione Toscana (POR FSE 2014-2020 - Asse A - Priorità A.2 – Obiettivo A.2.1 – Azione A.2.1.7)
Project code:249795
Project name:Trattamento Automatico di Varietà Storiche di Italiano (TrAVaSI)
Collections
This item isPublicly Available
and licensed under:
Files in this item
Loading files... This may take a few seconds as file previews are being generated. If the process takes too long, please contact the system administrator test@test.sk