ScandiBERT: Scandinavian Language Model

Resource Type:

A Scandinavian language model created using resources for 5 Nordic languages.

Open

This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text.

The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.

Language Data Size
Icelandic See IceBERT paper 16 GB
Danish Danish Gigaword Corpus (incl Twitter) 4,7 GB
Norwegian NCC corpus 42 GB
Swedish Swedish Gigaword Corpus 3,4 GB
Faroese FC3 + Sosialurinn + Bible 69 MB

Release: 12.03.2023
Contact: vesteinn.snaebjarnarson@gmail.com

Publisher

Uses

Language(s)

, , , , ,

License

ScandiBERT: Scandinavian Language ModelScandiBERT: Scandinavian Language Model
Scroll to Top