el Corpus del Español REAL
This project is maintained by CEREAL-es
[01.2025] CEREALv2 is out. This version is extracted from Colossal OSCAR and contains almost 10 times more data than v1, 1TB of Spanish texts with annotations for country of origin.
[01.2025] Data from the Basque Country (.eus), Catalonia (.cat) and Galicia (.gal) has been added to enlarge the amount of texts written in the Spanish spoken in the Iberian Peninsula.
Spanish is one of the most widespread languages in the world: it is an official language in 20 countries and the second most-spoken native language. Its contact with different coexistent languages and the rich regional and cultural diversity has produced varieties which divert from each other at different extents. Still, available corpora, and the models trained upon them, generally treat Spanish as one monolithic language, which dampers prediction and generation power when dealing with different varieties. CEREAL aims at alleviating the situation by making available documents from the Web with annotations for 24 countries of origin.
CEREAL is a Spanish document-level corpus extracted from OSCAR with documents classified according to their country of origin. It covers 24 countries where Spanish is widely spoken. The base corpus, CEREAL–Corpus del Español REAL, contains 13.5 million documents with gold annotations, where the country of origin has been extracted from the information available in the URL of the document. The extended corpus, CEREALex, contains 28 million of additional documents with silver annotations, where the country of origin has been automatically assigned using docTransformer, our document-level classifier. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text which comes from OSCAR and therefore from Common Crawl.
Different background culture, different lexicon and different grammatical structures present in the country-dependant textual corpora leave their imprint in semantic representations learned from them. In embeddings learned from a monolithic Spanish corpus, these nuances are erased. This is evident when estimating the strength of biases (the size effect) in the semantic spaces and the performance in bilingual lexicon induction (accuracy in BLI).
Human biases are non-pejorative indications of human preferences. Psychologists show through Implicit Association Tests (IAT) that humans have positive biases towards flowers (vs insects) and musical instruments (vs weapons) for example. We extend this analysis to word embeddings through our in-house CA-WEAT tests and apply it to CEREAL embeddings [1]. As the example in Figure 2 shows, there is indeed a difference in the preferences which are rooted in culture.
Lexicon might have big differences in different Spanish-speaking countries. For instance, elote, choclo and mazorca all mean “corn” in different regions. The word itself has a different usage, being much more frequent in America than in Europe. As Figure 3 shows, in the Mexican embedding space elote appears close to dishes where it is an essential ingredient (e.g., tamales and esquites) but also close to other vegetables (e.g., chiles and calabaza). In the Chilean embedding space, choclo appears surrounded by other vegetables only. This behaviour makes the topology of the embedding spaces different [2] and therefore relevant for NLP tasks such as bilingual lexicon induction [1].
Mexico:
Chile:
Spain:
Figure 3. t-SNE projection of the embedding spaces. The Spanish words equivalent to the English "corn" differ across countries and also do the neighbouring words and their position.
As example of the importance, we induce the VARILEX word dictionaries using VecMap on CEREAL embeddings [1]:
The topology of the embedding spaces is different enough so that distances between the spaces allow us to infer differences (and similarities) among varieties. We estimate the distances with several isomorphism metrics and derive dendograms showing the proximity of the varieties [2].
More examples, experiments and the technical details of the document-level classifier, measuring isomorphism between semantic spaces, and the creation and adaptation of our multi-variety resources can be found in [1,2].
Note: Download CEREALv2 from Zenodo. The links below correspond to CEREALv1 and are kept here to reproduce the results of the papers [1,2].
The table shows the statistics (number of documents and unique sentences per language) for CEREAL and CEREALex. All country-specific datasets are available through Zenodo. Click on the red numbers to download the collections for different language varieties at the document and the sentence level. The link to the word embeddings built with the sentence-level corpora is also added with their vocabulary. Notice that the embeddings are estimated after cleaning the sentence-level corpus which is provided only after deduplication and in alphabetical order without any cleaning.
Document Level (#docs) | Fragment Level (#frag) | Embeddings (vocab) | |||||
---|---|---|---|---|---|---|---|
Country | Code | CEREAL | CEREALex | CEREAL | CEREALex | CEREAL | CEREALex |
Andorra | ad | 1,551 | — | 13,023 | — | 2,672 [tsne] | — |
Argentina | ar | 1,969,559 | 2,713,759 | 20,958,972 | 33,854,130 | 284,192 [tsne] | 532,890 |
Bolivia | bo | 74,673 | — | 976,031 | — | 53,800 [tsne] | — |
Chile | cl | 1,115,516 | 1,095,185 | 12,100,443 | 10,077,118 | 199,494 [tsne] | 307,846 |
Colombia | co | 649,991 | — | 8,331,461 | — | 163,213 [tsne] | — |
Costa Rica | cr | 59,069 | — | 826,332 | — | 45,894 [tsne] | — |
Cuba | cu | 116,390 | — | 1,921,505 | — | 82,276 [tsne] | — |
República Dominicana | do | 113,676 | — | 1,184,014 | — | 52,410 [tsne] | — |
Ecuador | ec | 157,755 | — | 1,624,840 | — | 64,313 [tsne] | — |
España | es | 5,714,316 | 15,689,557 | 70,458,818 | 192,199,885 | 596,843 [tsne] | 1,428,724 |
Guinea Ecuatorial | gq | 801 | — | 4,055 | — | 1,699 [tsne] | — |
Guatemala | gt | 51,273 | — | 561,899 | — | 35,861 [tsne] | — |
Honduras | hn | 59,662 | — | 656,485 | — | 35,708 [tsne] | — |
México | mx | 2,443,404 | 3,314,396 | 20,883,245 | 39,410,541 | 250,314 [tsne] | 489,705 |
Nicaragua | ni | 36,880 | — | 405,986 | — | 31,346 [tsne] | — |
Panamá | pa | 39,027 | — | 449,172 | — | 31,269 [tsne] | — |
Perú | pe | 441,513 | — | 5,069,664 | — | 122,885 [tsne] | — |
Filipinas | ph | 109 | — | — | — | 406 [tsne] | — |
Puerto Rico | pr | 11,972 | — | 128,110 | — | 15,063 [tsne] | — |
Paraguay | py | 66,438 | — | 775,578 | — | 46,514 [tsne] | — |
El Salvador | sv | 41,037 | — | 401,553 | — | 29,434 [tsne] | — |
United States | us | 21,746 | — | 378,458 | — | 34,369 [tsne] | — |
Uruguay | uy | 153,713 | — | 1,805,013 | — | 75,492 [tsne] | — |
Venezuela | ve | 109,084 | — | 1,202,227 | — | 59,335 [tsne] | — |
Mix | mix | — | 4,866,901 | — | 61,908,112 | — | — |
All | all | 13,449,155 | 27,679,798 | 151,116,884 | 337,449,786 | 736,896 | — |
Those interested in replicating our experiments and resulting models to produce CEREALex can download the corpora used for training the 3-class, 4-class and 5-class classifier.
Training corpora (docTransformer classifier) | |||
---|---|---|---|
3-class (es, mx, mix) | training | validation | test |
4-class (cl, es, mx, mix) | training | validation | test |
5-class (ar, cl, es, mx, mix) | training | validation | test |
The classification models trained with our document-level classifier are hosted by HuggingFace.
The table above links to the word embedding models per country and configuration. In order to reproduce the work in [2], we also provide embeddings to the 24 Spanish varieties with two additional seeds (seed 2, seed 3), and five embedding models for Peninsular Spanish differing in the training data or seed.
We collect CA-WEAT1 and CA-WEAT2 lists from volunteers in Bolivia, Colombia, Cuba, Ecuador, Mexico and Spain. These lists are used in [1] to quantify human biases in CEREAL embeddings.
We adapt the VARILEX-R bilingual lexicons for English paired to 21 Spanish varieties. We provide a subset of entries both at the phrase and at the word level. This resource is used in [1] for the bilingual lexicon induction experiments.
Visit the Github repositories containing the code for the document level classifier, the stylistic analysis and the analysis of human biases with CA-WEAT lists.
Please, use the following entries when citing this research work.
@InProceedings{espana-bonet-barron-cedeno-2024-elote-naacl,
title = "Elote, Choclo and Mazorca: on the Varieties of {S}panish",
author = "Espa{\~n}a-Bonet, Cristina and
Barr{\'o}n-Cede{\~n}o, Alberto",
editor = "Duh, Kevin and
Gomez, Helena and
Bethard, Steven",
booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.naacl-long.204",
pages = "3689--3711"
}
@inproceedings{espana-bonet-etal-2024-elote,
title = "When Elote, Choclo and Mazorca are not the Same. Isomorphism-Based Perspective to the {S}panish Varieties Divergences",
author = "Espa{\~n}a-Bonet, Cristina and
Bhatt, Ankur and
Dutta Chowdhury, Koel and
Barr{\'o}n-Cede{\~n}o, Alberto",
editor = {Scherrer, Yves and
Jauhiainen, Tommi and
Ljube{\v{s}}i{\'c}, Nikola and
Zampieri, Marcos and
Nakov, Preslav and
Tiedemann, J{\"o}rg},
booktitle = "Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)",
month = jun,
year = "2024",
address = "Mexico City, Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.vardial-1.5",
pages = "56--77"
}