This is a Scandinavian BERT model trained on a large collection of Danish, Faroese, Icelandic, Norwegian and Swedish text.
The model was trained on the data shown in the table below. Batch size was 8.8k, the model was trained for 72 epochs on 24 V100 cards for about 2 weeks.
| Language | Data | Size |
|---|---|---|
| Icelandic | See IceBERT paper | 16 GB |
| Danish | Danish Gigaword Corpus (incl Twitter) | 4,7 GB |
| Norwegian | NCC corpus | 42 GB |
| Swedish | Swedish Gigaword Corpus | 3,4 GB |
| Faroese | FC3 + Sosialurinn + Bible | 69 MB |
Release: 12.03.2023
Contact: vesteinn.snaebjarnarson@gmail.com




