FineWeb-2-fao

Tilfeingisslag:

This is a deduplicated, reproducible dataset with 95 million words made from the Faroese text in the Common Crawl corpus. Please note that the dataset is split into a training and a test dataset.

Lat upp

This is the second iteration of the popular FineWeb dataset, bringing high quality pretraining data to over 1.000 languages.

The FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. This dataset originates from Common Crawl. The use of this dataset is also subject to CommonCrawl’s Terms of Use.

The data was sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using datatrove, a large scale data processing library. For PII and opt-out see Personal and Sensitive Information and opt-out.

For Faroese, this carefully deduplicated and filtered dataset comprises around 250 megabytes of compressed text data, with roughly 95 million words (95.066.7973) within 261.937 documents.

Release: 08.01.2025

Útgevari

Nýtsla

, , , , , ,

Snið

Mál

Loyvi

FineWeb-2-faoFineWeb-2-fao
Scroll to Top