RLC Subcorpus


The Russian Learner Corpus of Academic Writing, or RULEC, is a product of collaborative work between language and pedagogy researchers Olesya Kisselev and Anna Alsufieva of Portland State University and Ekaterina Rakhilina and Timofey Arkhangelskiy of National Research University Higher School of Economics. RULEC is a longitudinal corpus of Russian learner language that includes written papers produced by advanced American students of Russian as a Foreign or Heritage Language. The corpus represents a new tool for those interested in the study of linguistics, Second Language Acquisition, and language pedagogy.

What is RULEC?

The materials for RULEC were collected over a period of 4 years from students studying Russian at an American university in a special program that was designed for advanced-level Second Language (L2) or Heritage Language (HL) speakers of Russian. The corpus is relatively small: the texts were authored by 36 learners: 17 of the 36 are mainstream American learners who started learning Russian as adults; 19 are heritage speakers of Russian, born in a Russian-speaking country and brought to the US as children or born in the US and raised in a Russian-speaking family. Each author, however, is well-presented in the body of the corpus. RULEC now includes approximately 3,800 written papers ranging from a short paragraph to 8-page research papers (no grammatical or lexical exercises are included in the corpus).


Although RULEC was originally compiled chiefly with local needs in mind, the richness of design criteria should allow for a variety of possible research questions and analyses. The following meta-categories are assigned to each text:

  1. student’s name (pseudonym),
  2. gender,
  3. language background and language experience of the student (L2 or HL),
  4. student’s linguistic level (established through external tests),
  5. time stamp (week and academic year when the paper was written),
  6. time limit under which the paper was written (timed or non-timed),
  7. text type (one paragraph or a long research paper),
  8. text function (e.g. narration, argumentation), and
  9. whether a paper was written individually or in a group.

These categories are reflected in the Header Identification Box (Header ID) of each text in the corpus. It is possible to automatically create sub-corpora based in these meta-categories.

All words in RULEC are assigned morphological tags that contain morphological information such as part of speech, gender, case, aspect. In case when the word is misspelled, the tag will also contain the mark “bastard” (i.e. irregular form). The corpus interface allows for search by grammatical and lexical categories.


As mentioned above, RULEC has a small number of authors; although this may preclude the researchers from drawing generalized conclusions about an “average” advanced learner of Russian as a Foreign or Heritage Language, the relatively large number of works representing each learner may become an advantage for longitudinal studies, ethnographic studies or studies that require close tracking of interlanguage development. Moreover, any insights drawn from the analyses of RULEC materials may serve as hypotheses for studies on larger corpora. RULEC is a versatile resource and may be used in a variety of studies and for a variety of purposes.