✦ LIBER ✦

The effectiveness of stemming for natural-language access to Slovene textual data

✍ Scribed by Popovič, Mirko ;Willett, Peter

Publisher: John Wiley and Sons
Year: 1992
Tongue: English
Weight: 1015 KB
Volume: 43
Category: Article
ISSN: 0002-8231
DOI: 10.1002/(sici)1097-4571(199206)43:5<384::aid-asi6>3.0.co;2-l

No coin nor oath required. For personal study only.

✦ Synopsis

There have been several studies of the use of stemming algorithms for conflating morphological variants in freetext retrieval systems. Comparison of stemmed and nonconflated searches suggests that there are no significant increases in the effectiveness of retrieval when stemming is applied to English-language documents and queries. This article reports the use of stemming on Slovene-language documents and queries, and demonstrates that the use of an appropriate stemming algorithm results in a large, and statistically significant, increase in retrieval effectiveness when compared with nonconflated processing; similar comments apply to the use of manual, right-hand truncation. A comparison is made with stemming of English versions of the same documents and queries and it is concluded that the effectiveness of a stemming algorithm is determined by the morphological complexity of the language that it is designed to process.

Introduction: Use of Stemming Algorithms

Morphological variation is one of the many characteristics of natural language that must be taken into account when designing a free-text retrieval system, since there may be some, or many, different forms of a given word, these forms resulting from the addition of different suffixes to a common word stem according to the dictates of grammar. For example, the stem COMPUT* can give rise to COMPUTERS, COMPUTING, and COMPUTATIONALLY, inter alia (where the symbol "*" denotes a variable-length don't-care match). Such variant word forms are likely to be of comparable importance in determining the relevance of a document to a user query that specifies just a single form, and this *To whom all correspondence should be addressed.