CRAN Task View: Machine Learning & Statistical Learning
| Maintainer: | Torsten Hothorn |
| Contact: | Torsten.Hothorn at R-project.org |
| Version: | 2008-10-01 |
Several add-on packages implement ideas and methods developed at the
borderline between computer science and statistics - this field of research
is usually referred to as machine learning.
The packages can be roughly structured into the following topics:
-
Neural Networks
: Single-hidden-layer neural network are
implemented in package nnet as part of the
VR
bundle (shipped with base R).
-
Recursive Partitioning
: Tree-structured models for
regression, classification and survival analysis, following the
ideas in the CART book, are
implemented in
rpart
(shipped with base R) and
tree.
Package
rpart
is recommended for computing CART-like
trees.
A rich toolbox of partitioning algorithms is available in
Weka
,
package
RWeka
provides an interface to this
implementation, including the J4.8-variant of C4.5 and M5.
Two recursive partitioning algorithms with unbiased variable
selection and statistical stopping criterion are implemented in
package
party. Function
ctree()
is based on
non-parametrical conditional inference procedures for testing
independence between response and each input variable whereas
mob()
can be used to partition parametric models.
Extensible tools for visualizing binary trees
and node distributions of the response are available in package
party
as well.
An adaptation of
rpart
for multivariate responses
is available in package
mvpart. A tree algorithm fitting
nearest neighbors in each node is implemented in package
knnTree. For problems with binary input variables
the package
LogicReg
implements logic regression.
Graphical tools for the visualization of
trees are available in packages
maptree
and
pinktoe.
-
Random Forests
: The reference implementation of the random
forest algorithm for regression and classification is available in
package
randomForest. Package
ipred
has bagging
for regression, classification and survival analysis as well as
bundling, a combination of multiple models via
ensemble learning. In addition, a random forest variant for
response variables measured at arbitrary scales based on
conditional inference trees is implemented in package
party.
randomSurvivalForest
offers a random forest algorithm for
censored data.
The
varSelRF
package focuses on variable selection by means
for random forest algorithms.
-
Regularized and Shrinkage Methods
: Regression models with some
constraint on the parameter estimates can be fitted with the
lasso2
and
lars
packages.
The L1 regularization path for generalized linear models and
Cox models can be obtained from functions available in package
glmpath. The
penalized
package provides
an alternative implementation of lasso (L1) and ridge (L2)
penalized regression models (both GLM and Cox models).
The shrunken
centroids classifier and utilities for gene expression analyses are
implemented in package
pamr. An implementation
of multivariate adaptive regression splines is available
in package
earth.
-
Boosting
: Various forms of gradient boosting are
implemented in packages
gbm
(tree-based functional gradient
descent boosting) and
boost
(including LogitBoost
and L2Boost). Package
GAMBoost
can be used to fit generalized additive models
by a boosting algorithm. An extensible boosting framework for
generalized linear, additive and nonparametric models is available in
package
mboost.
-
Support Vector Machines and Kernel Methods
: The function
svm()
from
e1071
offers an interface to the LIBSVM library and
package
kernlab
implements a flexible framework
for kernel learning (including SVMs, RVMs and other kernel
learning algorithms). An interface to the SVMlight implementation
(only for one-against-all classification) is provided in package
klaR.
The relevant dimension in kernel feature spaces can be estimated
using
rdetools
which also offers procedures for model selection
and prediction.
-
Bayesian Methods
: Bayesian Additive Regression Trees (BART),
where the final model is defined in terms of the sum over
many weak learners (not unlike ensemble methods),
are implemented in package
BayesTree.
Bayesian nonstationary, semiparametric nonlinear regression
and design by treed Gaussian processes including Bayesian CART and
treed linear models are made available by package
tgp.
Bayesian logistic regression models that consider the high-order interactions
are available from package
BPHO
and Bayesian naive Bayes models
for binary classification with bias corrected feature selection is implemented in
package
predbayescor.
-
Optimization using Genetic Algorithms
Packages
gafit
and
rgenoud
offer optimization routines based on genetic algorithms.
-
Association Rules
: Package
arules
provides both data structures for efficient
handling of sparse binary data as well as interfaces to
implementations of Apriori and Eclat for mining
frequent itemsets, maximal frequent itemsets, closed
frequent itemsets and association rules.
-
Model selection and validation
: Package
e1071
has function
tune()
for hyper parameter tuning and
function
errorest()
(ipred) can be used for
error rate estimation. The cost parameter C for support vector
machines can be chosen utilizing the functionality of package
svmpath.
Functions for ROC analysis and other visualisation techniques
for comparing candidate classifiers are available from package
ROCR.
Package
caret
provides miscellaneous functions for
building predictive models, including parameter tuning and
variable importance measures. The
caretLSF
and
caretNWS
packages provide parallel processing
implementations of
caret.
-
Elements of Statistical Learning
: Data sets, functions and
examples from the book
The Elements of Statistical Learning: Data Mining,
Inference, and Prediction
by Trevor Hastie, Robert Tibshirani and
Jerome Friedman have been packaged and are available as
ElemStatLearn.
CRAN packages:
Related links: