This directory contains electronic resources related to the paper: Vlado Keselj and Danko Sipka. A Suffix Subsumption-based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources. In INFOTHECA, Journal of Informatics and Librarianship, No 1-2, Volume IX, May 2008. Contents: 0_README.TXT - This file, containing descriptions of the files and some basic information in English. 0_PROCXITAJ.TXT - A Serbian translation of the file 0_README.TXT. all.zip - All files listed here, except all.zip, zipped in one package. Basic-Serbian-Lexical-Resource.zip The basic lexical resource for Serbian language, used as the starting resource. The zipped file contains the following files: list-w-l - list of word-form/lemma pairs list-w - list of word-forms list-l - list of lemmas The files contain the words that were added in section 5.2 of the paper. The current statistics of the resource are: list-l: 47489 lemmas (0.47 KB) list-w: 675140 word-forms (7.3 MB) list-w-l: 696454 word-form/lemma pairs (14.6 MB) vreme-words.zip Text corpus of the news magazine "Vreme" from a period of five years 2001-5. The corpus is processed so the files contains only words, in order, one word per line, of the corpus. The file contains 6.6 million words (42MB). stem-classes.zip Produces stem classes as described in Step 4.1 (Sec 5.3) of the paper. 677,868 words in 41,681 stem-classes. out-word-stem.zip Generated word-stem pairs (677,868 pairs, 12.6 MB) out-stems.zip Generated stems with frequencies (dictionary frequencies) (39,322 stems). out-suffixes.zip Generated suffixes with dictionary frequencies (17840 suffixes). out-greedy-rules.zip Generated suffix greedy rules (1000 rules, based on 4.4b method in the paper). The rules are applied as described in the paper (subsumption precedence, i.e., longer suffixes = higher precedence). out-opt-rules.zip Generated suffix optimal rules (17839 rules, based on 4.4c method in the paper). The "optimality" does not necessarily implay that the stemmer would have the most optimal performance in general, see the paper. The rules are applied as described in the paper (subsumption precedence, i.e., longer suffixes = higher precedence). stemmer-greedy.pl Greedy stemmer, as desribed in 4.4b method of the paper, written in Perl. It is a standalone Perl program that reads the standard input (or files, whose names are provided in the command line), and prints stemmed output. stemmer-opt.pl* Optimal suffix stemmer, as desribed in the method 4.4c in the paper, written in Perl. It is a standalone Perl program that reads the standard input (or files, whose names are provided in the command line), and prints stemmed output. ---