CRAN Task View: Natural Language Processing
| Maintainer: | Ingo Feinerer and Fridolin Wild |
| Contact: | Fridolin.Wild at wu-wien.ac.at |
| Version: | 2008-09-16 |
This CRAN Task View contains a list of packages useful for
natural language processing.
Lexical Databases:
-
wordnet
provides an R interface
to
WordNet
, a large
lexical database of English.
Keyword Extraction and General String Manipulation:
-
R's base package already provides a rich set of character manipulation
routines. See
help.search(keyword = "character", package = "base")
for more information on these capabilities.
-
RKEA
provides an R interface to
KEA
(Version 5.0). KEA (for
Keyphrase Extraction Algorithm) allows for extracting keyphrases from
text documents. It can be either used for free indexing or for indexing
with a controlled vocabulary.
-
gsubfn
can be used for certain parsing tasks such as
extracting words from strings by content rather than by delimiters.
demo("gsubfn-gries")
shows an example of this in a natural language
processing context.
Natural Language Processing:
-
openNLP
provides an R interface
to
OpenNLP
, a
collection of natural language processing tools including a
sentence detector, tokenizer, pos-tagger, shallow and full
syntactic parser, and named-entity detector, using the Maxent
Java package for training and using maximum entropy
models.
-
openNLPmodels
ships trained models for English and Spanish to be used
with
openNLP.
-
RWeka
is a interface
to
Weka
which is a collection of machine learning algorithms for data
mining tasks written in Java. Especially useful in the context
of natural language processing is its functionality for
tokenization and stemming.
-
Snowball
provides the Snowball stemmers which contain the Porter
stemmer and several other stemmers for different
languages. See
the
Snowball
webpage for details.
-
Alternatively,
the
Omegahat
package
Rstem
provides an R interface to a C version of Porter's word
stemming algorithm.
String Kernels:
-
kernlab
allows to create and compute with string kernels, like full string,
spectrum, or bounded range string kernels. It can directly use
the document format used
by
tm
as input.
Text Mining:
-
tm
provides a comprehensive text mining framework for
R. The
Journal of Statistical Software
article
Text Mining
Infrastructure in R
gives a detailed overview and presents
techniques for count-based analysis methods, text clustering,
text classification and string kernels.
-
lsa
provides routines for performing a latent semantic analysis with R.
The basic idea of latent semantic analysis (LSA) is,
that text do have a higher order (=latent semantic) structure which,
however, is obscured by word usage (e.g. through the use of synonyms
or polysemy). By using conceptual indices that are derived statistically
via a truncated singular value decomposition (a two-mode factor analysis)
over a given document-term matrix, this variability problem can be overcome.
The article
Investigating
Unstructured Texts with Latent Semantic Analysis
gives a detailed overview and demonstrates the use of the package
with examples from the are of technology-enhanced learning.
-
corpora
offers utility functions for the statistical analysis of corpus frequency data.
-
languageR
provides data sets and functions exemplifying statistical methods, and some
facilitatory utility functions used in the book by R. H. Baayen: "Analyzing Linguistic Data: a Practical
Introduction to Statistics Using R", Cambridge University Press, 2008.
-
zipfR
offers some statistical models for word frequency distributions. The
utilities include functions for loading, manipulating and visualizing word frequency data and
vocabulary growth curves. The package also implements several statistical models for the
distribution of word frequencies in a population. (The name of this library derives from the
most famous word frequency distribution, Zipf's law.)
CRAN packages:
Related links: