WiLI-2018: Wikipedia Language Identification Database

Tilfeingisslag:

This contains 1,000 paragraphs in Faroese (and the same amount in all the other 234 languages). This means that there are 500 paragraphs in the training set and another 500 paragraphs in the test set. You can download the files for all 235 languages below.

Tak niður

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced, and a train-test split is provided.

Download page: https://zenodo.org/records/841984
Hugging Face Dataset: https://huggingface.co/datasets/wili_2018
Article: https://arxiv.org/pdf/1801.07779.pdf
Release: 2018
Contact: info@martin-thoma.de

Útgevari

Nýtsla

Snið

Mál

,

Loyvi

WiLI-2018: Wikipedia Language Identification DatabaseWiLI-2018: Wikipedia Language Identification Database
Scroll to Top