Synthetic Faroese STS Dataset

Resource Type:

Here we have a Faroese Semantic Textual Similarity dataset that was made using machine translation.

Open

This is a synthetic Faroese Semantic Textual Similarity dataset. Labels range from 0 (no similarity) to 5 (the two sentences are completely equivalent). The dataset was generated by:

  1. Translating sentences from the Basic Faroese Language Resource Kit (BLARK) corpus to English by leveraging a Nordic LLM, GPT-Sw3.
  2. Sentences were compared to each other in terms of semantic similarity by Sentence BERT (SBERT)
  3. Pairs of sentences were then sampled uniformly in terms of similarity score, to compile a balanced dataset.

The dataset contains 200 sentences for each class (Similarity = 0,1,2,3,4,5). It has around 1.200 sentence pairs.

Release: 08.03.2024
Contact: barbaras@setur.fo

Publisher

,

Uses

, , ,

Format

Language(s)

, ,

License

Synthetic Faroese STS DatasetSynthetic Faroese STS Dataset
Scroll to Top