FO-Tokenizer

FO-Tokenizer is a specialized tokenizer designed for Faroese text, based on Miðeind’s Icelandic Tokenizer. This tool processes text into tokens such as words, punctuation, numbers, and dates, while accurately segmenting sentences and recognizing Faroese-specific elements like abbreviations sourced from Málráðið (“The Faroese Language Committee”). The modifications include updates to handle Faroese conventions for dates, metrics, and other text features, ensuring accurate tokenization tailored to the language.

FO-Tokenizer offers both shallow and deep tokenization. Shallow tokenization separates sentences into space-delimited tokens, while deep tokenization annotates each token with detailed information like type, value, and context, making it highly useful for natural language processing (NLP) tasks. The tool supports command-line usage and integration into Python projects, enabling flexible applications such as text analysis, corpus generation, and grammar checking.

The tokenizer is available under the MIT license and can be installed directly from its GitHub repository. Comprehensive documentation and examples are included to guide users in leveraging its capabilities for Faroese language processing.

Release: 2024
Contact: annika@hi.is

Publisher	Annika Simonsen
Uses	Tagging, Spell Checker, Name Entity Recognition, Parsing, Speech Recognition, Machine Translation
Format	Python
Language(s)	Faroese
License	MIT

Resource Type

Uses

Format

License

Language(s)

Publisher

Related

Resource Type

Uses

Format

License

Language(s)

Publisher

FO-Tokenizer

Related

Related

Universal Dependency Parser

Finetune MMS Adapter Models for ASR

ScandiBERT: Scandinavian Language Model