ACADEMIC JOURNAL
|
ISSN 2542-1077 (Print) ISSN 1994-5973 (Online) |
Theoretical, Applied and Comparative Linguistics |
Kipyatkova I. S. | St. Petersburg Federal Research Center of the Russian Academy of Sciences |
Rodionova A. P. | St. Petersburg Federal Research Center of the Russian Academy of Sciences |
Kagirov I. A. | St. Petersburg Federal Research Center of the Russian Academy of Sciences |
Krizhanovsky A. A. St. Petersburg, Russian Federation | St. Petersburg Federal Research Center of the Russian Academy of Sciences |
Keywords: Karelian language Livvi-Karelian dialect natural language automatic processing speech recognition systems training datasets corpus linguistics |
Summary: This paper addresses some aspects of collecting and preparing language data of the Livvi dialect of the
Karelian language needed for training a system of automatic speech-to-text conversion. The importance of such
technologies for the Karelian language derives from its status as a low-resource language, which is a serious obstacle to
its study and preservation. The main tasks at the current stage of the research are to collect and annotate speech and text
corpora, as well as to create a transcription dictionary. The speech corpus includes audio recordings of 15 speakers (6
men and 9 women). All the recordings were transcribed and segmented into single utterances. The volume of records
after the removal of “junk” fragments was 3.5 hours. The volume of the text corpus after the removal of repeated
sentences was over 5M word usages. Based on the collected text corpus, a dictionary was created, which will
subsequently be used as a part of the Karelian speech recognition system. All the words included in the dictionary were automatically transcribed (phonemic transcription). In the feather research collected text and speech data will be used for training and testing the Livvi-Karelian speech recognition system. |
Displays: 568; Downloads: 4; |