Hopp til hovedinnhold

Exploring NLP libraries for Norwegian

One of the many branches of machine learning is Natural Language Processing, or NLP, where computers are trained to understand and generate text and speech. However, while discussing potential use cases for the technology, we realized the need for an overview of language processing resources for Norwegian.

To explore this, we did a conversation analysis PoC – analyzing and presenting the contents and metadata of a conversation between two or more people talking in Norwegian. This blog post will give a quick intro to some of the most useful resources for Norwegian NLP, to help you get started with your own NLP project!

[@portabletext/react] Unknown block type "__block", specify a component for it in the `components.types` propAnalysis output of a test discussion revolving around NLP product development.

Transcription

Our first task was transcribing the recorded speech. After evaluating several options, the Google Speech-to-Text API emerged as the best alternative, scoring as low as 7 % normalized Levenshtein distance and 17 % Word Error Rate on test samples. This corresponds well to the results of recent tests1 evaluating different speech-to-text APIs in English.

Diarization

A key part of speech analysis is speaker identification, known in the field as diarization. As opposed to most NLP methods, this should theoretically be language indifferent. However, as Google only supports diarization of English (as of Dec 2019), we instead employed one of the many available Python libraries, pyAudioAnalysis, for diarization, achieving at best 96 % accuracy in takes with different gender speakers.

Summarization

As for the summary of the text, we used the Gensim library, which offers an extractive summarization model based on the TextRank algorithm. In our experience, the function is effective when input text data is of high quality, but becomes equivalently confused by low-grade transcriptions. We experimented with abstractive summarization models for English (translating the transcribed speech back and forth), but any meaningful insight was clearly lost in translation and/or transcription.

Keyword extraction

To extract keywords from the transcription, we implemented a modified tf-idf algorithm, aided by a Norwegian Snowball stemmer provided by NLTK2. With a careful optimization of the TF/IDF weights, stopword deletion and outlier removal, the program was able to extract highly relevant keywords (for Norwegian speakers: top left box in the analysis output above).

POS tagging

One of the most sophisticated libraries we encountered was spaCy. It has a trained convolutional neural network model for Norwegian which enables context-based recognition of named entities, part-of-speech tagging and even dependency parsing. To demonstrate its POS tagging abilities, we let spaCy analyze the following sentence:

i år skal bekk publisere tolv julekalendere

or in English, "This year, Bekk is publishing twelve Advent calendars". The main challenge posed here is the fact that bekk, apart from being our company name, also means brook or creek in Norwegian. While brook and creek are clearly common nouns, bekk, in this case, is intented as a company name – that is, a proper noun. We run the following code:

import spacy
nlp = spacy.load("nb_core_news_sm")

def getPOS(sentence): # returns POS class of all words in the sentence
    doc = nlp(sentence)
    word_and_pos = []
    for token in doc:
        word_and_pos.append([token.text, token.pos_])
    return word_and_pos

sentence = 'i år skal bekk publisere tolv julekalendere'
for (word, pos) in getPOS(sentence):
    print(word,"|",pos,"\n-------")

receiving the output:

i | ADP
-------
år | NOUN
-------
skal | AUX
-------
bekk | PROPN
-------
publisere | VERB
-------
tolv | NUM
-------
julekalendere | NOUN
-------

As we see, spaCy understands from the context that bekk is in fact a proper noun! This is a very helpful tool in extraction of named entities, but also structural sentence analysis or at some point even abstractive approaches.

This has hopefully been a useful intro to some of the many available resources for Norwegian NLP! We tackled several other challenges, including sentiment analysis, speech time mapping and sociogram generation, which may be covered some other time. Feel free to drop me an email if you're interested in hearing more

1https://www.rev.ai/blog/how-to-calculate-word-error-rate/, https://medium.com/descript/which-automatic-transcription-service-is-the-most-accurate-2018-2e859b23ed19

2When combining stemming with tf-idf, we recommend grouping words on their stem, setting the stem's TF to the sum of each word's TF, and the stem's IDF to the lowest of each word's IDF.

Relevant resources recommended by the author

Did you like the post?

Feel free to share it with friends and colleagues