ACADEMIC JOURNAL
PROCEEDINGS
OF PETROZAVODSK
STATE
UNIVERSITY
(1947-1975, 2008-present)

ISSN 2542-1077 (Print)
ISSN 1994-5973 (Online)

Kipyatkova, I. S., Rodionova, A. P., Kagirov, I. A., Krizhanovsky, I. A.. S SPEECH AND TEXT DATA PREPARATION FOR DEVELOPMENT OF AN AUTOMATIC SPEECH RECOGNITION SYSTEM FOR THE KARELIAN LANGUAGE. Proceedings of Petrozavodsk State University. 2023;45(5):89–98. DOI: 10.15393/uchz.art.2023.924

Theoretical, Applied and Comparative Linguistics

SPEECH AND TEXT DATA PREPARATION FOR DEVELOPMENT OF AN AUTOMATIC SPEECH RECOGNITION SYSTEM FOR THE KARELIAN LANGUAGE

Kipyatkova
I. S.

St. Petersburg Federal Research Center of the Russian Academy of Sciences

Rodionova
A. P.

St. Petersburg Federal Research Center of the Russian Academy of Sciences

Kagirov
I. A.

St. Petersburg Federal Research Center of the Russian Academy of Sciences

Krizhanovsky
A. A. St. Petersburg, Russian Federation

St. Petersburg Federal Research Center of the Russian Academy of Sciences

Keywords:
Karelian language
Livvi-Karelian dialect
natural language automatic processing
speech recognition systems training
datasets
corpus linguistics

Summary: This paper addresses some aspects of collecting and preparing language data of the Livvi dialect of the Karelian language needed for training a system of automatic speech-to-text conversion. The importance of such technologies for the Karelian language derives from its status as a low-resource language, which is a serious obstacle to its study and preservation. The main tasks at the current stage of the research are to collect and annotate speech and text corpora, as well as to create a transcription dictionary. The speech corpus includes audio recordings of 15 speakers (6 men and 9 women). All the recordings were transcribed and segmented into single utterances. The volume of records after the removal of “junk” fragments was 3.5 hours. The volume of the text corpus after the removal of repeated sentences was over 5M word usages. Based on the collected text corpus, a dictionary was created, which will subsequently be used as a part of the Karelian speech recognition system. All the words included in the dictionary were automatically transcribed (phonemic transcription). In the feather research collected text and speech data will be used for training and testing the Livvi-Karelian speech recognition system.

Displays: 1423; Downloads: 4;

The journal is on the list of peer-reviewed academic journals recommended for PhD and doctoral students (January 10, 2019).

Indexed in RSCI (Russian Science Citation Index) since 2008.

Registration Certificate PI No FS77-69487 was issued on April 25, 2017, by the Federal Service for Media Law and Cultural Heritage Protection Law Compliance.

Editorial office address:
Petrozavodsk State University
33 Lenin Ave., Petrozavodsk,
185910, Russian Federation
Telephone: +7 (8142) 769711
Е-mail: uchzap@petrsu.ru

ACADEMIC JOURNAL PROCEEDINGS OF PETROZAVODSK STATE UNIVERSITY (1947-1975, 2008-present)

Theoretical, Applied and Comparative Linguistics

SPEECH AND TEXT DATA PREPARATION FOR DEVELOPMENT OF AN AUTOMATIC SPEECH RECOGNITION SYSTEM FOR THE KARELIAN LANGUAGE

ACADEMIC JOURNAL
PROCEEDINGS
OF PETROZAVODSK
STATE
UNIVERSITY
(1947-1975, 2008-present)