Russian Word Lists
A list of Russian lemmas and their inflected forms can be downloaded from http://www.speakrus.ru/dict/all_forms.rar. This file was made by Andrei Usachev in 2004, from the data at the StarLing website of Sergei Starostin, which in turn is an on-line implementation of A. A. Zaliznjak’s morphological dictionary (Grammaticeskij slovar’ russkogo jazyka [A grammatical dictionary of the Russian language], Moscow: Russkij Jazyk, 1977).
Each line of Usachev's file contains a lemma followed by its derived forms. The letter “ё” is distinguished from “e”, and the forms — but not the lemmas — have primary (') and secondary (`) stress marked: however there are numerous deviations from these principles. The file covers 86,839 lemmas, and 1,558,781 forms (the latter figure is slightly inflated because homographs with different stress position are counted as different).
Usachev's file uses the Windows CP1251 encoding. It has been automatically converted to Unicode UTF-8 encoding (with byte-order mark, and CR/LF line endings), using the CpConverter utility, v.0.1.5, and the result is contained in the download below, under the name of zaliznjak.txt (Russian lemmas and derived forms — utf8 encoding, 61.4 MB).
The data has been further converted by program so that each line contains a form followed by its possible lemmas. The result is contained in the download below, under the name of zaliznjak_forms.txt (Russian forms and their lemmas — utf8 encoding, 68.6 MB).
As a further development, the previous file has been filtered, leaving only forms containing the letter “ё”. For each such form, we add (on the same line) any near-homographs, that is, forms differing at most by replacing the “ё” by “e”. For example, всем is a near-homograph of всём. This file is intended to assist with manual enrichment of digitised Russian texts by adding this diacritic where appropriate (see section below). There are 81,264 forms containing “ё”. The result is contained in the download below, under the name of zaliznjak_forms_e.txt (Russian forms with “ё” and their lemmas, and near-homographs — utf8 encoding, 3.6 MB)
All three files are available in this download .
The above download also contains a program for MS Windows, called udarjenie.exe, which is intended to help with the enrichment of digitised Russian text by replacing “e” by “ё”. The actual enrichment has to be performed manually, eg. using a text editor; the program only supplies some information to help limit the labour involved.
The program reads a text file, and makes a list of the words in the text containing the letter “e”, and for which there is, in the file zaliznjak_forms_e.txt — this file must be available — the corresponding word with “ё” instead of “e”. If there is also in zaliznjak_forms_e.txt the word with “e”, this will be shown beside the form with “ё”. The result is a list of words of the text which may be considered as candidates for changing “e” to “ё”.
If zaliznjak_forms_e.txt contains the word with “ё”, but not the word with “e”, it strongly suggests a change to the text (but see below for caution, due to possible incompleteness of the zalinjak_forms_e.txt file). If zaliznjak_forms_e.txt contains the words both with “ё” and with “e”, each individual token in the text must be inspected to see whether a change is appropriate.
The file containing the Russian text is expected to be encoded in utf8, with byte-order mark, and with CR/LF line endings. An option is offered to treat words containing a hyphen by dividing them into separate words. This option is active by default, and should improve the results in most cases. However, if the text already has hyphenated words explicitly separated into their parts, either by a space or by a plus-sign (eg. по-+моему or что+-нибудь), the option should be unchecked.
We show some typical lines of output:
жестко 1: жёстко, жёсткий
This means: the text contains жестко, once. zaliznjak_forms_e.txt contains жёстко (lemma: жёсткий), but not жестко. This suggests changing жестко to жёстко in text.
тертый 1: тёртый, тереть, тёртый
This means: the text contains тертый, once. zaliznjak_forms_e.txt contains тёртый (lemmas: тереть or тёртый), but not тертый. This suggests changing тертый to тёртый in text, but if it is desired to add the lemma, each individual token must be examined.
черту 2: чёрту, чёрт — черту', черта
This means: the text contains черту, twice. zaliznjak_forms_e.txt contains чёрту (lemma: чёрт), but also черту' (lemma: черта). This suggests changing some tokens of черту in the text to чёрту, viz. those which belong to lemma чёрт.
летом 5: лётом, лёт — ле'том, лето — летом, летом
This means: the text contains летом, five times. zaliznjak_forms_e.txt contains лётом (lemma: лёт), but also ле'том (lemmas: лето or летом).This suggests changing some tokens of летом in text to лётом, viz. those which belong to lemma лёт.
Such choices may be between different lemmas, as in the previous two examples; or between inflectional categories of the same lemma, as:
звезды 4: звёзды [nom pl], звезда — звезды' [gen sg], звезда
or between variant spellings of the same inflectional category of the same lemma, as:
кием 2: киём [inst sg], кий — ки'ем [inst sg], кий
The results produced by the program depend on the contents of zaliznjak_forms_e.txt, which in turn reflect the contents of Usachev's file. A number of features of this file may adversely affect the results, and are noted here. It would be advisable to check these points against the printed dictionary.
The StarLing site. An alternative way to fetch paradigms from this site is exemplified by
http://starling.rinet.ru/cgi-bin/morph.cgi?flags=endnnnnn&root=config&word=%F0%E0%E7%E2%E5%F1%E5%EB%FB%E9 for развеселый
http://starling.rinet.ru/cgi-bin/morph.cgi?flags=endnnnnn&root=config&word=%F1%FA%E5%F5%E0%F2%FC%F1%FF for съехаться
http://starling.rinet.ru/cgi-bin/morph.cgi?flags=endnnnnn&root=config&word=%E8%F1%EA%F0%E5%ED%ED%E8%E9 for искренний
http://starling.rinet.ru/cgi-bin/morph.cgi?flags=endnnnnn&root=config&word=%F2%EE%EB%EE%F7%FC for толочь
This method requires the word to be supplied as CP1251 hexadecimal codes (for which see, eg. here). As there is no code for “ё” in CP1251, this character cannot be included in a search request, but should be replaced by “е”.
More up-to-date than the on-line StarLing system are the downloadable versions for installation on local machines, for a variety of operating systems, and with English-language interfaces. However, I am unaware of any correspondingly updated version of Usachev's file.
Another on-line site for Russian morphology, with a more modern interface, is Морфология.ру, but the material there is not marked for stress.
A further useful source is the Russian Wiktionary.
There are online services which insert primary stress into Russian text, and which change “e” to “ё” — for example, Морфер. Ambiguous tokens are treated by listing the alternatives, eg. все is replaced by всё|все and господа by го́спода|господа́, and such ambiguities can only be resolved manually.
Thanks are due to Maria Gouskova for advice on the use of Usachev's file.