Russian Word Lists

A list of Russian lemmas and their inflected forms can be downloaded from http://www.speakrus.ru/dict/all_forms.rar. This file was made by Andrei Usachev in 2004, from the data at the StarLing website of Sergei Starostin, which in turn is an on-line implementation of A. A. Zaliznjak’s morphological dictionary (Grammaticeskij slovar’ russkogo jazyka [A grammatical dictionary of the Russian language], Moscow: Russkij Jazyk, 1977).

Each line of Usachev's file contains a lemma followed by its derived forms. The letter “ё” is distinguished from “e”, and the forms — but not the lemmas — have primary (') and secondary (`) stress marked: however there are numerous deviations from these principles. The file covers 86,839 lemmas, and 1,558,781 forms (the latter figure is slightly inflated because homographs with different stress position are counted as different).

Usachev's file uses the Windows CP1251 encoding. It has been automatically converted to Unicode UTF-8 encoding (with byte-order mark, and CR/LF line endings), using the CpConverter utility, v.0.1.5, and the result is contained in the download below, under the name of zaliznjak.txt (Russian lemmas and derived forms — utf8 encoding, 61.4 MB).

The data has been further converted by program so that each line contains a form followed by its possible lemmas. The result is contained in the download below, under the name of zaliznjak_forms.txt (Russian forms and their lemmas — utf8 encoding, 68.6 MB).

As a further development, the previous file has been filtered, leaving only forms containing the letter “ё”. For each such form, we add (on the same line) any near-homographs, that is, forms differing at most by replacing the “ё” by “e”. For example, всем is a near-homograph of всём. This file is intended to assist with manual enrichment of digitised Russian texts by adding this diacritic where appropriate (see section below). There are 81,264 forms containing “ё”. The result is contained in the download below, under the name of zaliznjak_forms_e.txt (Russian forms with “ё” and their lemmas, and near-homographs — utf8 encoding, 3.6 MB)

All three files are available in this download .


Enrichment of digitised Russian text by replacing “e” by “ё”.

The above download also contains a program for MS Windows, called udarjenie.exe, which is intended to help with the enrichment of digitised Russian text by replacing “e” by “ё”. The actual enrichment has to be performed manually, eg. using a text editor; the program only supplies some information to help limit the labour involved.

The program reads a text file, and makes a list of the words in the text containing the letter “e”, and for which there is, in the file zaliznjak_forms_e.txt — this file must be available — the corresponding word with “ё” instead of “e”. If there is also in zaliznjak_forms_e.txt the word with “e”, this will be shown beside the form with “ё”. The result is a list of words of the text which may be considered as candidates for changing “e” to “ё”.

If zaliznjak_forms_e.txt contains the word with “ё”, but not the word with “e”, it strongly suggests a change to the text (but see below for caution, due to possible incompleteness of the zalinjak_forms_e.txt file). If zaliznjak_forms_e.txt contains the words both with “ё” and with “e”, each individual token in the text must be inspected to see whether a change is appropriate.

The file containing the Russian text is expected to be encoded in utf8, with byte-order mark, and with CR/LF line endings. An option is offered to treat words containing a hyphen by dividing them into separate words. This option is active by default, and should improve the results in most cases. However, if the text already has hyphenated words explicitly separated into their parts, either by a space or by a plus-sign (eg. по-+моему or что+-нибудь), the option should be unchecked.

We show some typical lines of output:

жестко                                 1: жёстко, жёсткий

This means: the text contains жестко, once. zaliznjak_forms_e.txt contains жёстко (lemma: жёсткий), but not жестко. This suggests changing жестко to жёстко in text.

тертый                                 1: тёртый, тереть, тёртый

This means: the text contains тертый, once. zaliznjak_forms_e.txt contains тёртый (lemmas: тереть or тёртый), but not тертый. This suggests changing тертый to тёртый in text, but if it is desired to add the lemma, each individual token must be examined.

черту                                  2: чёрту, чёрт — черту', черта

This means: the text contains черту, twice. zaliznjak_forms_e.txt contains чёрту (lemma: чёрт), but also черту' (lemma: черта). This suggests changing some tokens of черту in the text to чёрту, viz. those which belong to lemma чёрт.

летом                                  5: лётом, лёт — ле'том, лето — летом, летом

This means: the text contains летом, five times. zaliznjak_forms_e.txt contains лётом (lemma: лёт), but also ле'том (lemmas: лето or летом).This suggests changing some tokens of летом in text to лётом, viz. those which belong to lemma лёт.

Such choices may be between different lemmas, as in the previous two examples; or between inflectional categories of the same lemma, as:

звезды                                 4: звёзды [nom pl], звезда — звезды' [gen sg], звезда

or between variant spellings of the same inflectional category of the same lemma, as:

кием                                   2: киём [inst sg], кий — ки'ем [inst sg], кий


Comments on the data

The results produced by the program depend on the contents of zaliznjak_forms_e.txt, which in turn reflect the contents of Usachev's file. A number of features of this file may adversely affect the results, and are noted here. It would be advisable to check these points against the printed dictionary.

  • Some expected forms containing “ё” are absent from the list, among which we have noticed всё, вперёд, вдвоём (and вдвоем), втроём, трёхсот, дёрнув, насчёт, задёрнутым, передёрнув, продаёшь, поперёк, почём, причём, перевёртывать, отдаёт, съёжившись, оскорблённой, оскорблёнием,возбуждённо, щёлкнув, напролёт, живьём, отдёргивал, издаёт, незащищённым, развёртывал, ручьёв, нашёптывал and ещё. (Some may be correctly omitted, others are present in reflexive form only.) Their absence means that the program will not report the corresponding forms with “e” as candidates for addition of the accent. For example, a line such as the following will not appear in the program output:

    все                                  257: всё, весь — все', весь

  • In the marking of stress, the same form may occur in the list marked with both primary and secondary stress, or with primary stress only, or with no stress marked at all. For example, the lemma авиасъёмка has listed forms including both а`виасъёмки and авиасъёмки. If a text contains the form авиасъемки, this will produce the line:

    авиасъемки                             1: авиасъёмки, авиасъёмка — а`виасъёмки, авиасъёмка

    The choice implied between the two (form, lemma) pairs is not relevant for our purposes, even if there is a genuine difference of stress pattern between the two forms, and we should treat this as a single pair.

  • The list contains non-reflexive forms connected with reflexive lemmas, and vice versa. For example, the form толчётся is listed under both lemmas толочь and толочься. So if a text contains the form толчется, the program will produce a line like this:

    толчется                               1: толчётся, толочь, толочься

    The implied choice between two lemmas is spurious for our purposes, and only the reflexive lemma (in this case) should be considered.

  • In a number of places, the list gives, not a complete derived form, but an ending, without specifying the stem to which the ending should be attached (see the lemmas: искренний, развесёлый, compounds of ехаться, compounds of толочь — some of these hanging endings can even be seen on the Starling site); hopefully the stems have been guessed correctly.



  • Russian morphology on-line

    The StarLing site. An alternative way to fetch paradigms from this site is exemplified by
          http://starling.rinet.ru/cgi-bin/morph.cgi?flags=endnnnnn&root=config&word=%F0%E0%E7%E2%E5%F1%E5%EB%FB%E9 for развеселый
          http://starling.rinet.ru/cgi-bin/morph.cgi?flags=endnnnnn&root=config&word=%F1%FA%E5%F5%E0%F2%FC%F1%FF for съехаться
          http://starling.rinet.ru/cgi-bin/morph.cgi?flags=endnnnnn&root=config&word=%E8%F1%EA%F0%E5%ED%ED%E8%E9 for искренний
          http://starling.rinet.ru/cgi-bin/morph.cgi?flags=endnnnnn&root=config&word=%F2%EE%EB%EE%F7%FC for толочь
    This method requires the word to be supplied as CP1251 hexadecimal codes (for which see, eg. here). As there is no code for “ё” in CP1251, this character cannot be included in a search request, but should be replaced by “е”.

    More up-to-date than the on-line StarLing system are the downloadable versions for installation on local machines, for a variety of operating systems, and with English-language interfaces. However, I am unaware of any correspondingly updated version of Usachev's file.

    Another on-line site for Russian morphology, with a more modern interface, is Морфология.ру, but the material there is not marked for stress.

    A further useful source is the Russian Wiktionary.

    There are online services which insert primary stress into Russian text, and which change “e” to “ё” — for example, Морфер. Ambiguous tokens are treated by listing the alternatives, eg. все is replaced by всё|все and господа by го́спода|господа́, and such ambiguities can only be resolved manually.


    Thanks

    Thanks are due to Maria Gouskova for advice on the use of Usachev's file.


    Ciarán Ó Duibhín
    2017/08/20
    Clár cinn / Home page / Page d'accueil / Hauptseite