Elote, Choclo and Mazorca: on the Varieties of Spanish

Spanish is one of the most widespread languages in the world: it is an official language in 20 countries and the second most-spoken native language. Its contact with different coexistent languages and the rich regional and cultural diversity has produced varieties which divert from each other at different extents. Still, available corpora, and the models trained upon them, generally treat Spanish as one monolithic language, which dampers prediction and generation power when dealing with different varieties. CEREAL aims at alleviating the situation by making available documents from the Web with annotations for 24 countries of origin.

Countries where Spanish is spoken and the porportion of online data

Figure 1. Countries where Spanish is spoken. Orange bubbles represent the proportion of documents in CEREAL, while color bubbles represent the population in the country. Mexico, the country with the highest number of Spanish speakers, is taken as unit measure: countries with a larger ratio documents/population appear in orange, countries with a lower ratio show the superposition of bubbles.

CEREAL & CEREALex

CEREAL is a Spanish document-level corpus extracted from OSCAR with documents classified according to their country of origin. It covers 24 countries where Spanish is widely spoken. The base corpus, CEREAL–Corpus del Español REAL, contains 13.5 million documents with gold annotations, where the country of origin has been extracted from the information available in the URL of the document. The extended corpus, CEREALex, contains 28 million of additional documents with silver annotations, where the country of origin has been automatically assigned using docTransformer, our document-level classifier. Following OSCAR, we provide our annotations with CCO license, but we do not hold the copyright of the content text which comes from OSCAR and therefore from Common Crawl.

Cultural effects in CEREAL embeddings

Different background culture, different lexicon and different grammatical structures present in the country-dependant textual corpora leave their imprint in semantic representations learned from them. In embeddings learned from a monolithic Spanish corpus, these nuances are erased. This is evident when estimating the strength of biases (the size effect) in the semantic spaces and the performance in bilingual lexicon induction (accuracy in BLI).

Human biases are non-pejorative indications of human preferences. Psychologists show through Implicit Association Tests (IAT) that humans have positive biases towards flowers (vs insects) and musical instruments (vs weapons) for example. We extend this analysis to word embeddings through our in-house CA-WEAT tests and apply it to CEREAL embeddings [1]. As the example in Figure 2 shows, there is indeed a difference in the preferences which are rooted in culture.

Embeddings+CA-WEATs from Spain: and from Mexico: Size effect with the Spanish CAWEAT1 lists on Peninsular Spanish embeddings

Size effect with the Mexican CAWEAT1 lists on Mexican Spanish embeddings

Figure 2. These results for IAT1 (flowers vs insects preference) show that the bias is stronger in Spain than in Mexico. The difference might have cultural reasons behind (some insects being edible in Mexico and not in Spain for instance) and it would be diluted when considering the texts from Spain and Mexico together.

Lexicon might have big differences in different Spanish-speaking countries. For instance, elote, choclo and mazorca all mean “corn” in different regions. The word itself has a different usage, being much more frequent in America than in Europe. As Figure 3 shows, in the Mexican embedding space elote appears close to dishes where it is an essential ingredient (e.g., tamales and esquites) but also close to other vegetables (e.g., chiles and calabaza). In the Chilean embedding space, choclo appears surrounded by other vegetables only. This behaviour makes the topology of the embedding spaces different [2] and therefore relevant for NLP tasks such as bilingual lexicon induction [1].

Mexico: Chile: Spain:

Figure 3. t-SNE projection of the embedding spaces. The Spanish words equivalent to the English "corn" differ across countries and also do the neighbouring words and their position.

As example of the importance, we induce the VARILEX word dictionaries using VecMap on CEREAL embeddings [1]:

Accuracy on BLI

Figure 4. Chosing the embedding space corresponding to the variety of the dictionary we want to induce achieves the highest accuracy, even higher than using the embeddings built with all the available data in any Spanish variety (all).

The topology of the embedding spaces is different enough so that distances between the spaces allow us to infer differences (and similarities) among varieties. We estimate the distances with several isomorphism metrics and derive dendograms showing the proximity of the varieties [2].

Dendogram visualisation

Dendogram with EV (100 freq) metric

Figure 5. The figure shows the (visual) dendogram obtained with the scores given by the Eigenvalue similarity metric applied on every pair of varieties. According to these results, voseo is the strongest characteristic derived from the CEREAL word embeddings.

More examples, experiments and the technical details of the document-level classifier, measuring isomorphism between semantic spaces, and the creation and adaptation of our multi-variety resources can be found in [1,2].

Download the data

The table shows the statistics (number of documents and unique sentences per language) for CEREAL and CEREALex. All country-specific datasets are available through Zenodo. Click on the red numbers to download the collections for different language varieties at the document and the sentence level. The link to the word embeddings built with the sentence-level corpora is also added with their vocabulary. Notice that the embeddings are estimated after cleaning the sentence-level corpus which is provided only after deduplication and in alphabetical order without any cleaning.

		Document Level (#docs)		Fragment Level (#frag)		Embeddings (vocab)
Country	Code	CEREAL	CEREALex	CEREAL	CEREALex	CEREAL	CEREALex
Andorra	ad	1,551	—	13,023	—	2,672 [tsne]	—
Argentina	ar	1,969,559	2,713,759	20,958,972	33,854,130	284,192 [tsne]	532,890
Bolivia	bo	74,673	—	976,031	—	53,800 [tsne]	—
Chile	cl	1,115,516	1,095,185	12,100,443	10,077,118	199,494 [tsne]	307,846
Colombia	co	649,991	—	8,331,461	—	163,213 [tsne]	—
Costa Rica	cr	59,069	—	826,332	—	45,894 [tsne]	—
Cuba	cu	116,390	—	1,921,505	—	82,276 [tsne]	—
República Dominicana	do	113,676	—	1,184,014	—	52,410 [tsne]	—
Ecuador	ec	157,755	—	1,624,840	—	64,313 [tsne]	—
España	es	5,714,316	15,689,557	70,458,818	192,199,885	596,843 [tsne]	1,428,724
Guinea Ecuatorial	gq	801	—	4,055	—	1,699 [tsne]	—
Guatemala	gt	51,273	—	561,899	—	35,861 [tsne]	—
Honduras	hn	59,662	—	656,485	—	35,708 [tsne]	—
México	mx	2,443,404	3,314,396	20,883,245	39,410,541	250,314 [tsne]	489,705
Nicaragua	ni	36,880	—	405,986	—	31,346 [tsne]	—
Panamá	pa	39,027	—	449,172	—	31,269 [tsne]	—
Perú	pe	441,513	—	5,069,664	—	122,885 [tsne]	—
Filipinas	ph	109	—	—	—	406 [tsne]	—
Puerto Rico	pr	11,972	—	128,110	—	15,063 [tsne]	—
Paraguay	py	66,438	—	775,578	—	46,514 [tsne]	—
El Salvador	sv	41,037	—	401,553	—	29,434 [tsne]	—
United States	us	21,746	—	378,458	—	34,369 [tsne]	—
Uruguay	uy	153,713	—	1,805,013	—	75,492 [tsne]	—
Venezuela	ve	109,084	—	1,202,227	—	59,335 [tsne]	—

Mix	mix	—	4,866,901	—	61,908,112	—	—
All	all	13,449,155	27,679,798	151,116,884	337,449,786	736,896	—

Those interested in replicating our experiments and resulting models to produce CEREALex can download the corpora used for training the 3-class, 4-class and 5-class classifier.

Training corpora (docTransformer classifier)
3-class (es, mx, mix)	training	validation	test
4-class (cl, es, mx, mix)	training	validation	test
5-class (ar, cl, es, mx, mix)	training	validation	test

Download the models

The classification models trained with our document-level classifier are hosted by HuggingFace.

The table above links to the word embedding models per country and configuration. In order to reproduce the work in [2], we also provide embeddings to the 24 Spanish varieties with two additional seeds (seed 2, seed 3), and five embedding models for Peninsular Spanish differing in the training data or seed.

Download the additional resources

We collect CA-WEAT1 and CA-WEAT2 lists from volunteers in Bolivia, Colombia, Cuba, Ecuador, Mexico and Spain. These lists are used in [1] to quantify human biases in CEREAL embeddings.

We adapt the VARILEX-R bilingual lexicons for English paired to 21 Spanish varieties. We provide a subset of entries both at the phrase and at the word level. This resource is used in [1] for the bilingual lexicon induction experiments.

Download the code

Visit the Github repositories containing the code for the document level classifier, the stylistic analysis and the analysis of human biases with CA-WEAT lists.

Citation

Please, use the following entries when citing this research work.

[1] Cristina España-Bonet and Alberto Barrón-Cedeño. Elote, Choclo and Mazorca: on the Varieties of Spanish. In proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics (NAACL’24). Mexico City, Mexico, June 2024.

@InProceedings{espana-bonet-barron-cedeno-2024-elote-naacl,
    title = "Elote, Choclo and Mazorca: on the Varieties of {S}panish",
    author = "Espa{\~n}a-Bonet, Cristina  and
      Barr{\'o}n-Cede{\~n}o, Alberto",
    editor = "Duh, Kevin  and
      Gomez, Helena  and
      Bethard, Steven",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-long.204",
    pages = "3689--3711"
}

[2] Cristina España-Bonet, Ankur Bhatt, Koel Dutta Chowdhury and Alberto Barrón-Cedeño. When Elote, Choclo and Mazorca are not the Same. Isomorphism-Based Perspective to the Spanish Varieties Divergences. In proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial’24). Mexico City, Mexico, June 2024.

@inproceedings{espana-bonet-etal-2024-elote,
    title = "When Elote, Choclo and Mazorca are not the Same. Isomorphism-Based Perspective to the {S}panish Varieties Divergences",
    author = "Espa{\~n}a-Bonet, Cristina  and
      Bhatt, Ankur  and
      Dutta Chowdhury, Koel  and
      Barr{\'o}n-Cede{\~n}o, Alberto",
    editor = {Scherrer, Yves  and
      Jauhiainen, Tommi  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      Zampieri, Marcos  and
      Nakov, Preslav  and
      Tiedemann, J{\"o}rg},
    booktitle = "Proceedings of the Eleventh Workshop on NLP for Similar Languages, Varieties, and Dialects (VarDial 2024)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.vardial-1.5",
    pages = "56--77"
}