FaroeseOCR is a dataset containing Faroese Optical Character Recognition (OCR) parallel data, created to support the development and refinement of OCR technology for Faroese texts. The repository includes 588 lines from the Good Templars newspaper Dúgvan (1941) and 213 lines from the poetry paper Várskot (1904), all of which have been scanned from Timarit.is using ABBYY Finereader and meticulously proofread by a native Faroese speaker.
In addition to the parallel text data, FaroeseOCR offers a compiled list of OCR errors commonly found in Faroese texts, along with their frequencies. This list provides valuable insights into typical recognition issues and serves as a tool for improving OCR accuracy for Faroese documents.
Release: 2023
Contact: annika@hi.is



