WiLI-2018: Wikipedia Language Identification Database

Resource Type:

This contains 1,000 paragraphs in Faroese (and the same amount in all the other 234 languages). This means that there are 500 paragraphs in the training set and another 500 paragraphs in the test set. You can download the files for all 235 languages below.

Download

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced, and a train-test split is provided.

Download page: https://zenodo.org/records/841984
Hugging Face Dataset: https://huggingface.co/datasets/wili_2018
Article: https://arxiv.org/pdf/1801.07779.pdf
Release: 2018
Contact: info@martin-thoma.de

Publisher

Uses

Format

Language(s)

,

License

WiLI-2018: Wikipedia Language Identification DatabaseWiLI-2018: Wikipedia Language Identification Database
Scroll to Top