FineWeb-2-fao

Tilfeingisslag: Tekstur

This is a deduplicated, reproducible dataset with 95 million words made from the Faroese text in the Common Crawl corpus. Please note that the dataset is split into a training and a test dataset.

Lat upp

Lýsing
Yvirlit

This is the second iteration of the popular FineWeb dataset, bringing high quality pretraining data to over 1.000 languages.

The FineWeb2 dataset is fully reproducible, available under the permissive ODC-By 1.0 license and extensively validated through hundreds of ablation experiments. This dataset originates from Common Crawl. The use of this dataset is also subject to CommonCrawl’s Terms of Use.

The data was sourced from 96 CommonCrawl snapshots, spanning the summer of 2013 to April 2024, and processed using datatrove, a large scale data processing library. For PII and opt-out see Personal and Sensitive Information and opt-out.

For Faroese, this carefully deduplicated and filtered dataset comprises around 250 megabytes of compressed text data, with roughly 95 million words (95.066.7973) within 261.937 documents.

Release: 08.01.2025

Útgevari	FineWeb
Nýtsla	Málmyndil, Marking, Rættstavari, Setningagreining, Talukenning, Teldutýðing, Teldutala
Snið	parquet
Mál	Føroyskt
Loyvi	ODC-By 1.0

FineWeb-2-fao

Lat upp

Tilfeingisslag

Nýtsla

Snið

Loyvi

Mál

Útgevari

FineWeb-2-fao

Related

Líknandi

FC3: Faroese Common Crawl Corpus

Faroese BLARK small – Corpus

Tatoeba Parallel Sentences