This is a synthetic Faroese Semantic Textual Similarity dataset. Labels range from 0 (no similarity) to 5 (the two sentences are completely equivalent). The dataset was generated by:
- Translating sentences from the Basic Faroese Language Resource Kit (BLARK) corpus to English by leveraging a Nordic LLM, GPT-Sw3.
- Sentences were compared to each other in terms of semantic similarity by Sentence BERT (SBERT)
- Pairs of sentences were then sampled uniformly in terms of similarity score, to compile a balanced dataset.
The dataset contains 200 sentences for each class (Similarity = 0,1,2,3,4,5). It has around 1.200 sentence pairs.
Release: 08.03.2024
Contact: barbaras@setur.fo





