WiLI-2018: Wikipedia Language Identification Database

Resource Type: Text

This contains 1,000 paragraphs in Faroese (and the same amount in all the other 234 languages). This means that there are 500 paragraphs in the training set and another 500 paragraphs in the test set. You can download the files for all 235 languages below.

Download

Description
Overview

WiLI-2018, the Wikipedia language identification benchmark dataset, contains 235000 paragraphs of 235 languages. The dataset is balanced, and a train-test split is provided.

Download page: https://zenodo.org/records/841984
Hugging Face Dataset: https://huggingface.co/datasets/wili_2018
Article: https://arxiv.org/pdf/1801.07779.pdf
Release: 2018
Contact: info@martin-thoma.de