This directory contains electronic resources related to the paper:

    Vlado Keselj and Danko Sipka. A Suffix Subsumption-based Approach
    to Building Stemmers and Lemmatizers for Highly Inflectional
    Languages with Sparse Resources.  In INFOTHECA, Journal of
    Informatics and Librarianship, No 1-2, Volume IX, May 2008.

Contents:

0_README.TXT - This file, containing descriptions of the files and
    some basic information in English.

0_PROCXITAJ.TXT - A Serbian translation of the file 0_README.TXT.

all.zip - All files listed here, except all.zip, zipped in one
    package.

Basic-Serbian-Lexical-Resource.zip
    The basic lexical resource for Serbian language, used as the
    starting resource.  The zipped file contains the following files:
      list-w-l - list of word-form/lemma pairs
      list-w   - list of word-forms
      list-l   - list of lemmas
    The files contain the words that were added in section 5.2 of the
    paper.  The current statistics of the resource are:
      list-l:    47489 lemmas (0.47 KB)
      list-w:   675140 word-forms (7.3 MB)
      list-w-l: 696454 word-form/lemma pairs (14.6 MB)

vreme-words.zip
    Text corpus of the news magazine "Vreme" from a period of five
    years 2001-5.  The corpus is processed so the files contains only
    words, in order, one word per line, of the corpus.
    The file contains 6.6 million words (42MB).

stem-classes.zip
    Produces stem classes as described in Step 4.1 (Sec 5.3) of the
    paper. 677,868 words in 41,681 stem-classes.

out-word-stem.zip
    Generated word-stem pairs (677,868 pairs, 12.6 MB)

out-stems.zip
    Generated stems with frequencies (dictionary frequencies)
    (39,322 stems).

out-suffixes.zip
    Generated suffixes with dictionary frequencies (17840 suffixes).

out-greedy-rules.zip
    Generated suffix greedy rules (1000 rules, based on 4.4b method
    in the paper).  The rules are applied as described in the paper
    (subsumption precedence, i.e., longer suffixes = higher
    precedence).

out-opt-rules.zip
    Generated suffix optimal rules (17839 rules, based on 4.4c method
    in the paper).  The "optimality" does not necessarily implay that
    the stemmer would have the most optimal performance in general,
    see the paper.  The rules are applied as described in the paper
    (subsumption precedence, i.e., longer suffixes = higher
    precedence).

stemmer-greedy.pl
    Greedy stemmer, as desribed in 4.4b method of the paper, written
    in Perl.  It is a standalone Perl program that reads the standard
    input (or files, whose names are provided in the command line),
    and prints stemmed output.

stemmer-opt.pl*
    Optimal suffix stemmer, as desribed in the method 4.4c in the
    paper, written in Perl.  It is a standalone Perl program that reads
    the standard input (or files, whose names are provided in the
    command line), and prints stemmed output.

---