Synthetic Faroese STS Dataset

Resource Type: Text

Here we have a Faroese Semantic Textual Similarity dataset that was made using machine translation.

Description
Overview

This is a synthetic Faroese Semantic Textual Similarity dataset. Labels range from 0 (no similarity) to 5 (the two sentences are completely equivalent). The dataset was generated by:

Translating sentences from the Basic Faroese Language Resource Kit (BLARK) corpus to English by leveraging a Nordic LLM, GPT-Sw3.
Sentences were compared to each other in terms of semantic similarity by Sentence BERT (SBERT)
Pairs of sentences were then sampled uniformly in terms of similarity score, to compile a balanced dataset.

The dataset contains 200 sentences for each class (Similarity = 0,1,2,3,4,5). It has around 1.200 sentence pairs.

Release: 08.03.2024
Contact: barbaras@setur.fo

Publisher	Barbara Scalvini, Centre for Language Technology
Uses	Sentiment Analysis, Speech Recognition, Machine Translation, Speech Synthesis
Format	CSV
Language(s)	English, Faroese, Multilingual
License	CC BY 4.0

Synthetic Faroese STS Dataset

Open

Resource Type

Uses

Format

License

Language(s)

Publisher

Synthetic Faroese STS Dataset

Related

Related

Ravnur BLARK Textgrids

Ravnursson Faroese Speech and Transcripts

Tatoeba Parallel Sentences