el Corpus del Español REAL




This project is maintained by CEREAL-es

Elote, Choclo and Mazorca: on the Varieties of Spanish

Spanish is one of the most widespread languages in the world: it is an official language in 20 countries and the second most-spoken native language. Its contact with different coexistent languages and the rich regional and cultural diversity has produced varieties which divert from each other at different extents. Still, available corpora, and the models trained upon them, generally treat Spanish as one monolithic language, which dampers prediction and generation power when dealing with different varieties. CEREAL aims at alleviating the situation by making available documents from the Web with annotations for 24 countries of origin.

Countries where Spanish is spoken and the porportion of online data

Figure 1. Countries where Spanish is spoken. Orange bubbles represent the proportion of documents in CEREAL, while color bubbles represent the population in the country. Mexico, the country with the highest number of Spanish speakers, is taken as unit measure: countries with a larger ratio documents/population appear in orange, countries with a lower ratio show the superposition of bubbles.


CEREAL & CEREALex

CEREAL is a Spanish document-level corpus extracted from OSCAR with documents classified according to their country of origin. It covers 24 countries where Spanish is widely spoken. The base corpus, CEREAL–Corpus del Español REAL, contains 13.5 million documents with gold annotations, where the country of origin has been extracted from the information available in the URL of the document. The extended corpus, CEREALex, contains 28 million of additional documents with silver annotations, where the country of origin has been automatically assigned using docTransformer, our document-level classifier. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text which comes from OSCAR and therefore from Common Crawl.


Cultural effects in CEREAL embeddings

Different background culture, different lexicon and different grammatical structures present in the country-dependant textual corpora leave their imprint in semantic representations learned from them. In embeddings learned from a monolithic Spanish corpus, these nuances are erased. This is evident when estimating the strength of biases (the size effect) in the semantic spaces and the performance in bilingual lexicon induction (accuracy in BLI).

Human biases are non-pejorative indications of human preferences. Psychologists show through Implicit Association Tests (IAT) that humans have positive biases towards flowers (vs insects) and musical instruments (vs weapons) for example. We extend this analysis to word embeddings through our in-house CA-WEAT tests and apply it to CEREAL embeddings [1]. As the example in Figure 2 shows, there is indeed a difference in the preferences which are rooted in culture.

Embeddings+CA-WEATs from Spain:      and from Mexico:






Figure 2. These results for IAT1 (flowers vs insects preference) show that the bias is stronger in Spain than in Mexico. The difference might have cultural reasons behind (some insects being edible in Mexico and not in Spain for instance) and it would be diluted when considering the texts from Spain and Mexico together.


Lexicon might have big differences in different Spanish-speaking countries. For instance, elote, choclo and mazorca all mean “corn” in different regions. The word itself has a different usage, being much more frequent in America than in Europe. As Figure 3 shows, in the Mexican embedding space elote appears close to dishes where it is an essential ingredient (e.g., tamales and esquites) but also close to other vegetables (e.g., chiles and calabaza). In the Chilean embedding space, choclo appears surrounded by other vegetables only. This behaviour makes the topology of the embedding spaces different [2] and therefore relevant for NLP tasks such as bilingual lexicon induction [1].

Mexico:              Chile:              Spain:
mx    cl    es

Figure 3. t-SNE projection of the embedding spaces. The Spanish words equivalent to the English "corn" differ across countries and also do the neighbouring words and their position.

As example of the importance, we induce the VARILEX word dictionaries using VecMap on CEREAL embeddings [1]:

Accuracy on BLI



Figure 4. Chosing the embedding space corresponding to the variety of the dictionary we want to induce achieves the highest accuracy, even higher than using the embeddings built with all the available data in any Spanish variety (all).


The topology of the embedding spaces is different enough so that distances between the spaces allow us to infer differences (and similarities) among varieties. We estimate the distances with several isomorphism metrics and derive dendograms showing the proximity of the varieties [2].

Dendogram visualisation

Dendogram with EV (100 freq) metric
















Figure 5. The figure shows the (visual) dendogram obtained with the scores given by the Eigenvalue similarity metric applied on every pair of varieties. According to these results, voseo is the strongest characteristic derived from the CEREAL word embeddings.


More examples, experiments and the technical details of the document-level classifier, measuring isomorphism between semantic spaces, and the creation and adaptation of our multi-variety resources can be found in [1,2].


Download the data

The table shows the statistics (number of documents and unique sentences per language) for CEREAL and CEREALex. All country-specific datasets are available through Zenodo. Click on the red numbers to download the collections for different language varieties at the document and the sentence level. The link to the word embeddings built with the sentence-level corpora is also added with their vocabulary. Notice that the embeddings are estimated after cleaning the sentence-level corpus which is provided only after deduplication and in alphabetical order without any cleaning.

Document Level (#docs) Fragment Level  (#frag) Embeddings  (vocab)
Country Code CEREAL CEREALex CEREAL CEREALex CEREAL CEREALex
Andorra ad 1,551 13,023  2,672 [tsne]
Argentina ar 1,969,559 2,713,759 20,958,972 33,854,130 284,192 [tsne] 532,890
Bolivia bo 74,673 976,031 53,800 [tsne]
Chile cl 1,115,516 1,095,185 12,100,443 10,077,118 199,494 [tsne] 307,846
Colombia co 649,991 8,331,461 163,213 [tsne]
Costa Rica cr 59,069 826,332 45,894 [tsne]
Cuba cu 116,390 1,921,505 82,276 [tsne]
República Dominicana do 113,676 1,184,014 52,410 [tsne]
Ecuador ec 157,755 1,624,840 64,313 [tsne]
España es 5,714,316 15,689,557 70,458,818 192,199,885 596,843 [tsne] 1,428,724
Guinea Ecuatorial gq 801 4,055  1,699 [tsne]
Guatemala gt 51,273 561,899 35,861 [tsne]
Honduras hn 59,662 656,485 35,708 [tsne]
México mx 2,443,404 3,314,396 20,883,245 39,410,541 250,314 [tsne] 489,705
Nicaragua ni 36,880 405,986 31,346 [tsne]
Panamá pa 39,027 449,172 31,269 [tsne]
Perú pe 441,513 5,069,664 122,885 [tsne]
Filipinas ph 109   406 [tsne]
Puerto Rico pr 11,972 128,110 15,063 [tsne]
Paraguay py 66,438 775,578 46,514 [tsne]
El Salvador sv 41,037 401,553 29,434 [tsne]
United States us 21,746 378,458 34,369 [tsne]
Uruguay uy 153,713 1,805,013 75,492 [tsne]
Venezuela ve 109,084 1,202,227 59,335 [tsne]
Mix mix 4,866,901 61,908,112
All all 13,449,155 27,679,798 151,116,884 337,449,786 736,896

Those interested in replicating our experiments and resulting models to produce CEREALex can download the corpora used for training the 3-class, 4-class and 5-class classifier.

Training corpora (docTransformer classifier)
3-class (es, mx, mix) training validation test
4-class (cl, es, mx, mix) training validation test
5-class (ar, cl, es, mx, mix) training validation test


Download the models

The classification models trained with our document-level classifier are hosted by HuggingFace.

The table above links to the word embedding models per country and configuration. In order to reproduce the work in [2], we also provide embeddings to the 24 Spanish varieties with two additional seeds (seed 2, seed 3), and five embedding models for Peninsular Spanish differing in the training data or seed.


Download the additional resources

We collect CA-WEAT1 and CA-WEAT2 lists from volunteers in Bolivia, Colombia, Cuba, Ecuador, Mexico and Spain. These lists are used in [1] to quantify human biases in CEREAL embeddings.

We adapt the VARILEX-R bilingual lexicons for English paired to 21 Spanish varieties. We provide a subset of entries both at the phrase and at the word level. This resource is used in [1] for the bilingual lexicon induction experiments.


Download the code

Visit the Github repositories containing the code for the document level classifier, the stylistic analysis and the analysis of human biases with CA-WEAT lists.


Citation

Please, use the following entries when citing this research work.

[1] Cristina España-Bonet and Alberto Barrón-Cedeño. Elote, Choclo and Mazorca: on the Varieties of Spanish. In proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’24). Mexico City, Mexico, June 2024.
@InProceedings{espana-bonet-barron-cedeno-2024-elote-naacl,
    title = "Elote, Choclo and Mazorca: on the Varieties of {S}panish",
    author = "Espa{\~n}a-Bonet, Cristina  and
      Barr{\'o}n-Cede{\~n}o, Alberto",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.204",
    pages = "3689--3711"
}
[2] Cristina España-Bonet, Ankur Bhatt, Koel Dutta Chowdhury and Alberto Barrón-Cedeño. When Elote, Choclo and Mazorca are not the Same. Isomorphism-Based Perspective to the Spanish Varieties Divergences. In proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’24). Mexico City, Mexico, June 2024.
@inproceedings{espana-bonet-etal-2024-elote,
    title = "When Elote, Choclo and Mazorca are not the Same. Isomorphism-Based Perspective to the {S}panish Varieties Divergences",
    author = "Espa{\~n}a-Bonet, Cristina  and
      Bhatt, Ankur  and
      Dutta Chowdhury, Koel  and
      Barr{\'o}n-Cede{\~n}o, Alberto",
    editor = {Scherrer, Yves  and
      Jauhiainen, Tommi  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      Zampieri, Marcos  and
      Nakov, Preslav  and
      Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.vardial-1.5",
    pages = "56--77"
}