This pre-processed and filtered version of the BLARK 1.0 corpus includes the steps:
- Normalize format to utf-8
- Remove shorter sentences (less than 10 units, where units are separated by spaces)
- Remove archaic Faroese
- Remove separators (‘\r’, ‘\t’, ‘\n’)
- Remove non standard formatting. Examples: ‘§§’, ‘ | ‘, ‘**’, ‘ • ‘, ‘ • ‘, ‘.- ‘, ‘: ?’, ‘.?’, ‘\xa0’, ‘\xad’, ‘_ _’, ‘. .’, etc.
- Remove (most) numbered lists, of formats: 1), 1:, Stk. 1 etc.
- Replace arbitrary number of question/exclamation marks and full-stops with 1. Example: !!!!!! -> !
- Remove websites that start with http
- Remove sentences without (or with little) linguistic content. In practice: all sentences where more than half of the characters (excluding spaces) are number, punctuations and letters in caps-lock (acronyms and initials)
- Remove duplicates
Release: 28.06.2023
Updated: 07.08.2023
Contact: mtd@setur.fo




