WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced, and a train-test split is provided.
Download page: https://zenodo.org/records/841984
Hugging Face Dataset: https://huggingface.co/datasets/wili_2018
Article: https://arxiv.org/pdf/1801.07779.pdf
Release: 2018
Contact: info@martin-thoma.de



