Faroese BLARK small – Corpus

Tilfeingisslag:

This dataset is a filtered version of the corpus (35.6 M tokens) first published as BLARK – Basic Language Resource Kit for Faroese.

Lat upp

This pre-processed and filtered version of the BLARK 1.0 corpus includes the steps:

  • Normalize format to utf-8
  • Remove shorter sentences (less than 10 units, where units are separated by spaces)
  • Remove archaic Faroese
  • Remove separators (‘\r’, ‘\t’, ‘\n’)
  • Remove non standard formatting. Examples: ‘§§’, ‘ | ‘, ‘**’, ‘ • ‘, ‘ • ‘, ‘.- ‘, ‘: ?’, ‘.?’, ‘\xa0’, ‘\xad’, ‘_ _’, ‘. .’, etc.
  • Remove (most) numbered lists, of formats: 1), 1:, Stk. 1 etc.
  • Replace arbitrary number of question/exclamation marks and full-stops with 1. Example: !!!!!! -> !
  • Remove websites that start with http
  • Remove sentences without (or with little) linguistic content. In practice: all sentences where more than half of the characters (excluding spaces) are number, punctuations and letters in caps-lock (acronyms and initials)
  • Remove duplicates

Release: 28.06.2023
Updated: 07.08.2023
Contact: mtd@setur.fo

Útgevari

, ,

Nýtsla

,

Snið

Mál

Loyvi

Faroese BLARK small - CorpusFaroese BLARK small – Corpus
Scroll to Top