HPLT – Faroese

Resource Type: Text

This is a text dataset developed by HPLT (High Performance Language Technology). You can visit the download page by clicking the link below. For further information, see the description below.

Open Page

Description
Overview

There are two versions of this dataset; one deduplicated and one cleaned. The text has been collected from various sources on the internet:

Domain	Docs	% of total
wikipedia.org	25K	10.43
snar.fo	8K	3.32
kvf.fo	6.6K	2.76
portal.fo	5.6K	2.31
vp.fo	5.4K	2.26
in.fo	5.3K	2.20
blogspot.com	3.8K	1.59
sangtekstir.com	3.6K	1.51
fsf.fo	3.4K	1.41
dimma.fo	3.1K	1.30

The deduplicated dataset contains:
772.73k documents, 166.00M words, 1.18B characters, and 19.69M segments.

The cleaned dataset contains:
239.92k documents, 93.45M words, 582.04M characters, and 4.53M segments.

The actual packaging of these text data is licensed under the Creative Commons CC0 license, but it is your responsibility to ensure that any use of the data complies with any applicable legal framework, such as, among others, the EU Copyright Directive 2019/790 and the General Data Protection Regulation 2018, as amended.

Released: September 2024

Cleaned (626 MB) Deduplicated (1.25 GB)

Publisher	HPLT
Uses	Language Model, Spell Checker, Speech Recognition, Machine Translation
Format	JSON
Language(s)	Faroese, Multilingual, Other languages
License	CC0, License not specified

HPLT – Faroese

Open Page

Resource Type

Uses

Format

License

Language(s)

Publisher

HPLT – Faroese

Related

Related

New Testament: Audio and Text

Faroese BLARK small – Corpus

FO-EN Translated Sentences