Title: | Random Forest Prediction Decomposition and Feature Importance Measure |
---|---|
Description: | An R re-implementation of the 'treeinterpreter' package on PyPI <https://pypi.org/project/treeinterpreter/>. Each prediction can be decomposed as 'prediction = bias + feature_1_contribution + ... + feature_n_contribution'. This decomposition is then used to calculate the Mean Decrease Impurity (MDI) and Mean Decrease Impurity using out-of-bag samples (MDI-oob) feature importance measures based on the work of Li et al. (2019) <arXiv:1906.10845>. |
Authors: | Qingyao Sun |
Maintainer: | Qingyao Sun <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.1 |
Built: | 2025-01-20 03:56:02 UTC |
Source: | https://github.com/nalzok/tree.interpreter |
Contribution of each feature to the prediction.
featureContribTree(tidy.RF, tree, X) featureContrib(tidy.RF, X)
featureContribTree(tidy.RF, tree, X) featureContrib(tidy.RF, X)
tidy.RF |
A tidy random forest. The random forest to make predictions with. |
tree |
An integer. The index of the tree to look at. |
X |
A data frame. Features of samples to be predicted. |
Recall that each node in a decision tree has a prediction associated with it. For regression trees, it's the average response in that node, whereas in classification trees, it's the frequency of each response class, or the most frequent response class in that node.
For a tree in the forest, the contribution of each feature to the prediction
of a sample is the sum of differences between the predictions of nodes which
split on the feature and those of their children, i.e. the sum of changes in
node prediction caused by spliting on the feature. This is the calculated by
featureContribTree
.
For a forest, the contribution of each feature to the prediction if a sample
is the average contribution across all trees in the forest. This is because
the prediction of a forest is the average of the predictions of its trees.
This is calculated by featureContrib
.
Together with trainsetBias(Tree)
, they can decompose the prediction
by feature importance:
where MODEL can be either a tree or a forest.
A cube (3D array). The content depends on the type of the response.
Regression: A P-by-1-by-N array, where P is the number of features
in X
, and N the number of samples in X
. The pth row of
the nth slice stands for the contribution of feature p to the
prediction for response n.
Classification: A P-by-D-by-N array, where P is the number of
features in X
, D is the number of response classes, and N is
the number of samples in X
. The pth row of the nth slice stands
for the contribution of feature p to the prediction of each response
class for response n.
featureContribTree
: Feature contribution to prediction within a
single tree
featureContrib
: Feature contribution to prediction within the
whole forest
Interpreting random forests http://blog.datadive.net/interpreting-random-forests/
Random forest interpretation with scikit-learn http://blog.datadive.net/random-forest-interpretation-with-scikit-learn/
library(ranger) test.id <- 50 * seq(3) rfobj <- ranger(Species ~ ., iris[-test.id, ], keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[-test.id, -5], iris[-test.id, 5]) featureContribTree(tidy.RF, 1, iris[test.id, -5]) featureContrib(tidy.RF, iris[test.id, -5])
library(ranger) test.id <- 50 * seq(3) rfobj <- ranger(Species ~ ., iris[-test.id, ], keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[-test.id, -5], iris[-test.id, 5]) featureContribTree(tidy.RF, 1, iris[test.id, -5]) featureContrib(tidy.RF, iris[test.id, -5])
Calculate the MDI-oob feature importance measure.
MDIoobTree(tidy.RF, tree, trainX, trainY) MDIoob(tidy.RF, trainX, trainY)
MDIoobTree(tidy.RF, tree, trainX, trainY) MDIoob(tidy.RF, trainX, trainY)
tidy.RF |
A tidy random forest. The random forest to calculate MDI-oob from. |
tree |
An integer. The index of the tree to look at. |
trainX |
A data frame. Train set features, such that the |
trainY |
A data frame. Train set responses, such that the |
It has long been known that MDI incorrectly assigns high importance to noisy features, leading to systematic bias in feature selection. To address this issue, Li et al. proposed a debiased MDI feature importance measure using out-of-bag samples, called MDI-oob, which has achieved state-of-the-art performance in feature selection for both simulated and real data.
See vignette('MDI', package='tree.interpreter')
for more context.
A matrix. The content depends on the type of the response.
Regression: A P-by-1 matrix, where P is the number of features in
X
. The pth row contains the MDI-oob of feature p.
Classification: A P-by-D matrix, where P is the number of features
in X
and D is the number of response classes. The dth column of
the pth row contains the MDI-oob of feature p to class d. You can get
the MDI-oob of each feature by calling rowSums
on the result.
MDIoobTree
: Debiased mean decrease in impurity within a single tree
MDIoob
: Debiased mean decrease in impurity within the whole
forest
A Debiased MDI Feature Importance Measure for Random Forests https://arxiv.org/abs/1906.10845
vignette('MDI', package='tree.interpreter')
library(ranger) rfobj <- ranger(Species ~ ., iris, keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[, -5], iris[, 5]) MDIoobTree(tidy.RF, 1, iris[, -5], iris[, 5]) MDIoob(tidy.RF, iris[, -5], iris[, 5])
library(ranger) rfobj <- ranger(Species ~ ., iris, keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[, -5], iris[, 5]) MDIoobTree(tidy.RF, 1, iris[, -5], iris[, 5]) MDIoob(tidy.RF, iris[, -5], iris[, 5])
Calculate the MDI feature importance measure.
MDITree(tidy.RF, tree, trainX, trainY) MDI(tidy.RF, trainX, trainY)
MDITree(tidy.RF, tree, trainX, trainY) MDI(tidy.RF, trainX, trainY)
tidy.RF |
A tidy random forest. The random forest to calculate MDI from. |
tree |
An integer. The index of the tree to look at. |
trainX |
A data frame. Train set features, such that the |
trainY |
A data frame. Train set responses, such that the |
MDI stands for Mean Decrease in Impurity. It is a widely adopted measure of feature importance in random forests. In this package, we calculate MDI with a new analytical expression derived by Li et al. (See references)
See vignette('MDI', package='tree.interpreter')
for more context.
A matrix. The content depends on the type of the response.
Regression: A P-by-1 matrix, where P is the number of features in
X
. The pth row contains the MDI of feature p.
Classification: A P-by-D matrix, where P is the number of features
in X
and D is the number of response classes. The dth column of
the pth row contains the MDI of feature p to class d. You can get the
MDI of each feature by calling rowSums
on the result.
MDITree
: Mean decrease in impurity within a single tree
MDI
: Mean decrease in impurity within the whole forest
A Debiased MDI Feature Importance Measure for Random Forests https://arxiv.org/abs/1906.10845
vignette('MDI', package='tree.interpreter')
library(ranger) rfobj <- ranger(Species ~ ., iris, keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[, -5], iris[, 5]) MDITree(tidy.RF, 1, iris[, -5], iris[, 5]) MDI(tidy.RF, iris[, -5], iris[, 5])
library(ranger) rfobj <- ranger(Species ~ ., iris, keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[, -5], iris[, 5]) MDITree(tidy.RF, 1, iris[, -5], iris[, 5]) MDI(tidy.RF, iris[, -5], iris[, 5])
Converts random forest objects from various libraries into a common
structure, i.e. a “tidy” random forest, calculating absent auxiliary
information on demand, in order to provide a uniform interface for other
tree.interpreter
functions. Note that the output is of a private
format, and is subject to change.
tidyRF(rfobj, trainX, trainY)
tidyRF(rfobj, trainX, trainY)
rfobj |
A random forest object. Currently, supported libraries include
|
trainX |
A data frame. Train set features. |
trainY |
A data frame. Train set responses. |
A list of class tidyRF
, with entries:
num.trees
An integer. Number of trees.
feature.names
A vector. Names of features.
num.classes
An integer. For classification trees, number of classes of the response. For regression trees, this field will always be 1.
class.names
A vector. Names of response classes.
inbag.counts
A list. For each tree, a vector of the number of times the observations are in-bag in the trees.
left.children
A list. For each tree, a vector of the left children of its nodes.
right.children
A list. For each tree, a vector of the right children of its nodes.
split.features
A list. For each tree, a vector of the indices of features used at its nodes. Indices start from 0. A value of 0 means the node is terminal. This does not cause ambiguity, because the root node will never be a child of other nodes.
split.values
A list. For each tree, a vector of the values of features used at its nodes.
node.sizes
A list. For each tree, a vector of the sizes of its nodes.
node.resp
A list. For each tree, a vector of the responses of its nodes.
delta.node.resp.left
A list. For each tree, a vector of the difference between the responses of the left children of its nodes and themselves.
delta.node.resp.right
A list. For each tree, a vector of the difference between the responses of the right children of its nodes and themselves.
library(ranger) rfobj <- ranger(Species ~ ., iris, keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[, -5], iris[, 5]) str(tidy.RF, max.level=1)
library(ranger) rfobj <- ranger(Species ~ ., iris, keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[, -5], iris[, 5]) str(tidy.RF, max.level=1)
For a tree in the forest, trainset bias is the prediction of its root node, or the unconditional prediction of the tree, or the average response of the samples used to train the tree.
trainsetBiasTree(tidy.RF, tree) trainsetBias(tidy.RF)
trainsetBiasTree(tidy.RF, tree) trainsetBias(tidy.RF)
tidy.RF |
A tidy random forest. The random forest to extract train set bias from. |
tree |
An integer. The index of the tree to look at. |
For a forest, the trainset bias is simply the average trainset bias across all trees. This is because the prediction of a forest is the average of the predictions of its trees.
Together with featureContrib(Tree)
, they can decompose the prediction
by feature importance:
where MODEL can be either a tree or a forest.
A matrix. The content depends the type of the response.
Regression: A 1-by-1 matrix. The trainset bias for the prediction of the response.
Classification: A 1-by-D matrix, where D is the number of response classes. Each column of the matrix stands for the trainset bias for the prediction of each response class.
trainsetBiasTree
: Trainset bias within a single tree
trainsetBias
: Trainset bias within the whole forest
Interpreting random forests http://blog.datadive.net/interpreting-random-forests/
Random forest interpretation with scikit-learn http://blog.datadive.net/random-forest-interpretation-with-scikit-learn/
library(ranger) rfobj <- ranger(Species ~ ., iris, keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[, -5], iris[, 5]) trainsetBiasTree(tidy.RF, 1) trainsetBias(tidy.RF)
library(ranger) rfobj <- ranger(Species ~ ., iris, keep.inbag=TRUE) tidy.RF <- tidyRF(rfobj, iris[, -5], iris[, 5]) trainsetBiasTree(tidy.RF, 1) trainsetBias(tidy.RF)
An R re-implementation of the 'treeinterpreter' package on PyPI. <https://pypi.org/project/treeinterpreter/>. Each prediction can be decomposed as 'prediction = bias + feature_1_contribution + ... + feature_n_contribution'. This decomposition is then used to calculate the Mean Decrease Impurity (MDI) and Mean Decrease Impurity using out-of-bag samples (MDI-oob) feature importance measures based on the work of Li et al. (2019) <arXiv:1906.10845>.
tidyRF
The function tidyRF
can turn a randomForest
or ranger
object into a package-agnostic random forest object. All other functions
in this package operate on such a tidyRF
object.
featureContrib
and trainsetBias
familiesThe featureContrib
and trainsetBias
families can decompose the
prediction of regression/classification trees/forests into bias and feature
contribution components.
MDI
and MDIoob
familiesThe MDI
family can calculate the good old MDI feature importance
measure, which unfortunately has some feature selection bias. MDI-oob is a
debiased MDI feature importance measure that has achieved state-of-the-art
performance in feature selection for both simulated and real data. It can be
calculated with functions from the MDIoob
family.
library(ranger) rfobj <- ranger(mpg ~ ., mtcars, keep.inbag = TRUE) tidy.RF <- tidyRF(rfobj, mtcars[, -1], mtcars[, 1]) MDIoob(tidy.RF, mtcars[, -1], mtcars[, 1])
library(ranger) rfobj <- ranger(mpg ~ ., mtcars, keep.inbag = TRUE) tidy.RF <- tidyRF(rfobj, mtcars[, -1], mtcars[, 1]) MDIoob(tidy.RF, mtcars[, -1], mtcars[, 1])