There are two versions of this dataset; one deduplicated and one cleaned. The text has been collected from various sources on the internet:
| Domain | Docs | % of total |
|---|---|---|
| wikipedia.org | 25K | 10.43 |
| snar.fo | 8K | 3.32 |
| kvf.fo | 6.6K | 2.76 |
| portal.fo | 5.6K | 2.31 |
| vp.fo | 5.4K | 2.26 |
| in.fo | 5.3K | 2.20 |
| blogspot.com | 3.8K | 1.59 |
| sangtekstir.com | 3.6K | 1.51 |
| fsf.fo | 3.4K | 1.41 |
| dimma.fo | 3.1K | 1.30 |
The deduplicated dataset contains:
772.73k documents, 166.00M words, 1.18B characters, and 19.69M segments.
The cleaned dataset contains:
239.92k documents, 93.45M words, 582.04M characters, and 4.53M segments.
The actual packaging of these text data is licensed under the Creative Commons CC0 license, but it is your responsibility to ensure that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.
Released: September 2024
Cleaned (626 MB) Deduplicated (1.25 GB)





