FO-Tokenizer is a specialized tokenizer designed for Faroese text, based on Miðeind’s Icelandic Tokenizer. This tool processes text into tokens such as words, punctuation, numbers, and dates, while accurately segmenting sentences and recognizing Faroese-specific elements like abbreviations sourced from Málráðið (“The Faroese Language Committee”). The modifications include updates to handle Faroese conventions for dates, metrics, and other text features, ensuring accurate tokenization tailored to the language.
FO-Tokenizer offers both shallow and deep tokenization. Shallow tokenization separates sentences into space-delimited tokens, while deep tokenization annotates each token with detailed information like type, value, and context, making it highly useful for natural language processing (NLP) tasks. The tool supports command-line usage and integration into Python projects, enabling flexible applications such as text analysis, corpus generation, and grammar checking.
The tokenizer is available under the MIT license and can be installed directly from its GitHub repository. Comprehensive documentation and examples are included to guide users in leveraging its capabilities for Faroese language processing.
Release: 2024
Contact: annika@hi.is




