This is the second iteration of the popular FineWeb dataset, bringing high quality pretraining data to over 1.000 languages.
The FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. This dataset originates from Common Crawl. The use of this dataset is also subject to CommonCrawl’s Terms of Use.
The data was sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using datatrove, a large scale data processing library. For PII and opt-out see Personal and Sensitive Information and opt-out.
For Faroese, this carefully deduplicated and filtered dataset comprises around 250 megabytes of compressed text data, with roughly 95 million words (95.066.7973) within 261.937 documents.
Release: 08.01.2025




