single |
Snowball stemming library collection for Python
===============================================
Python 3 (>= 3.3) is supported. We no longer support Python 2 as the
Python
developers stopped supporting it at the start of 2020. Snowball 2.1.0 was
the
last release to officially support Python 2; Snowball 3.0.0 was the last
release which had the code to support Python 2, but we were no longer
testing
it.
What is Stemming?
-----------------
Stemming maps different forms of the same word to a common "stem" - for
example, the English stemmer maps *connection*, *connections*,
*connective*,
*connected*, and *connecting* to *connect*. So a search for *connected*
would also find documents which only have the other forms.
This stem form is often a word itself, but this is not always the case as
this
is not a requirement for text search systems, which are the intended field
of
use. We also aim to conflate words with the same meaning, rather than all
words with a common linguistic root (so *awe* and *awful* don't have the
same
stem), and over-stemming is more problematic than under-stemming so we tend
not
to stem in cases that are hard to resolve. If you want to always reduce
words
to a root form and/or get a root form which is itself a word then
Snowball's
stemming algorithms likely aren't the right answer.
How to use library
------------------
The stemming algorithms generally expect the input text to use composed
accents
(Unicode NFC or NFKC) and to have been folded to lower case already.
The snowballstemmer module has two functions.
The ``snowballstemmer.algorithms`` function returns a list of available
algorithm names.
The ``snowballstemmer.stemmer function takes an algorithm name and returns
a
Stemmer`` object.
Stemmer objects have a ``Stemmer.stemWord(word) method and a
Stemmer.stemWords(word[])`` method.
.. code-block:: python
import snowballstemmer
stemmer = snowballstemmer.stemmer('english');
print(stemmer.stemWords("We are the world".split()));
Generally you should create a stemmer object and reuse it rather than
creating
a fresh object for each word stemmed (since there's some cost to creating
and
destroying the object).
The stemmer code is re-entrant, but not thread-safe if the same stemmer
object
is used concurrently in different threads.
If you want to perform stemming concurrently in different threads, we
suggest
creating a new stemmer object for each thread. The alternative is to share
stemmer objects between threads and protect access using a mutex or similar
(e.g. `threading.Lock` in Python) but that's liable to slow your program
down
as threads can end up waiting for the lock.
Automatic Acceleration
----------------------
[PyStemmer] is a wrapper module for
Snowball's libstemmer_c and should provide results 100% compatible to
**snowballstemmer**.
**PyStemmer** is faster because it wraps generated C versions of the
stemmers;
**snowballstemmer** uses generate Python code and is slower but offers a
pure
Python solution.
If PyStemmer is installed, ``snowballstemmer.stemmer returns a PyStemmer
Stemmer object which provides the same Stemmer.stemWord() and
Stemmer.stemWords()`` methods.
Benchmark
~~~~~~~~~
|