Synthetic Faroese STS Dataset

Tilfeingisslag:

Here we have a Faroese Semantic Textual Similarity dataset that was made using machine translation.

Lat upp

This is a synthetic Faroese Semantic Textual Similarity dataset. Labels range from 0 (no similarity) to 5 (the two sentences are completely equivalent). The dataset was generated by:

  1. Translating sentences from the Basic Faroese Language Resource Kit (BLARK) corpus to English by leveraging a Nordic LLM, GPT-Sw3.
  2. Sentences were compared to each other in terms of semantic similarity by Sentence BERT (SBERT)
  3. Pairs of sentences were then sampled uniformly in terms of similarity score, to compile a balanced dataset.

The dataset contains 200 sentences for each class (Similarity = 0,1,2,3,4,5). It has around 1.200 sentence pairs.

Release: 08.03.2024
Contact: barbaras@setur.fo

Útgevari

,

Nýtsla

, , ,

Snið

Mál

, ,

Loyvi

Synthetic Faroese STS DatasetSynthetic Faroese STS Dataset
Scroll to Top