\name{evimp}
\alias{evimp}
\title{Estimate variable importances in an "earth" object}
\description{
Estimate variable importances in an \code{\link{earth}} object
}
\usage{
evimp(obj, trim=TRUE)
}
\arguments{
  \item{obj}{
    An \code{\link{earth}} object.
  }
  \item{trim}{
    If TRUE (default), delete rows in the returned matrix for
    variables that don't appear in any subsets.
  }
}
\value{
A matrix showing the relative importances of the variables in the model.
There is a a row for each variable.
The row name is the variable name, but with \code{-unused} appended 
if the variable does not appear in the final model.\cr

The columns of the matrix are:\cr
\code{col}: column index of the variable in the \code{x} argument to \code{earth}.\cr
\code{used}: 1 if the variable is used in the final model, else 0.
Equivalently, 0 if the row name has a \code{-unused} suffix. \cr
\code{nsubsets}: variable importance using the "number of subsets" criterion.
Is the number of subsets that include the variable (see below).\cr
\code{gcv}: variable importance using the GCV criterion (see below).\cr
\code{rss}: ditto but for the RSS criterion.\cr

The rows are sorted on the \code{nsubsets} criterion.
This means that values in the \code{nsubsets} column decrease as you go down the column
(more accurately, they are non-increasing).
The values in the \code{gcv} and \code{rss} columns decrease except where the
\code{gcv} or \code{rss} ranking differs from the \code{nsubsets} ranking.\cr

Additionally, there are unnamed columns after the \code{gcv} column and the \code{rss} column.
These have a 0 where the ranking using the \code{gcv} or \code{rss} criteria differs from
that using the \code{nsubsets} criterion.
In other words, there is a 0 for values that increase as you go 
down the \code{gcv} or \code{rss} column.
}
\note{
\bold{Estimating variable importance}

Establishing predictor importance is in general a tricky and even controversial problem.
There is no completely reliable way to estimate the importance of the variables
in a standard MARS model,
unless you make further lengthy tests after the model is built
(lengthy tests such as leave-one-out techniques).
The \code{evimp} function just makes an educated (and in practice useful)
guess as described below.

\bold{Three criteria for estimating variable importance}

The \code{evimp} functions uses three criteria for estimating variable importance.

1. The \code{nsubsets} criterion counts the number of model subsets that include the variable.
Variables that are included in more subsets are considered more important.

By "subsets" we mean the subsets of terms generated by the pruning pass.
There is one subset for each model size,
and each subset is the best set of terms for that model size.
(These subsets are specified by \code{$prune.terms} in earth's return value.)
Only subsets that are smaller than or equal in size to the final model are used
for estimating variable importance.

2. The \code{rss} criterion first calculates the decrease in the RSS
for each subset relative to the previous subset.
(For multiple response models, RSS's are calculated over all responses.)
Then for each variable it sums these decreases over all subsets that include the variable.
Finally it scales these decreases so the maximum decrease is 100.
Variables which cause larger net decreases in the RSS are considered more important.

3. The \code{gcv} criterion is the same, but using the GCV instead of the RSS.
Adding a variable can sometimes \emph{increase} the GCV.
When this happens, the variable could even have a negative total importance,
and thus appear less important than unused variables.

Note that using RSq's and GRSq's instead of RSS's and GCV's
would give identical estimates of variable importance.

\bold{Example}

\preformatted{
a <- earth(O3 ~ ., data=ozone1, degree=2)
evimp(a, trim=FALSE)
}
Yields  the following matrix:
\preformatted{
              col used nsubsets    gcv      rss
    temp        4    1       10 100.00 1 100.00 1
    humidity    3    1        8  12.68 1  14.78 1
    ibt         7    1        8  12.68 1  14.78 1
    doy         9    1        7  11.26 1  12.93 1
    dpg         6    1        5   6.75 1   7.84 1
    ibh         5    1        4   9.58 0  10.46 0
    vis         8    1        4   4.38 1   5.30 1
    wind        2    1        1   0.74 1   0.98 1
    vh-unused   1    0        0   0.00 1   0.00 1
}
The rows are sorted on \code{nsubsets}.
We see that \code{temp} is considered the most important variable,
followed by \code{humidity}, and so on.
We see that \code{vh} is unused in the final model, 
and thus is given an \code{unused} suffix and a 0 in the \code{used} column.

The \code{col} column gives the the column indices of the variables
in the \code{x} argument to \code{earth} after factors have been expanded.

The \code{nsubsets} column is the number of subsets that included the corresponding variable.
For example, \code{temp} appears in 10 subsets and \code{humidity} in 8.

The \code{gcv} and \code{rss} columns are scaled so
the largest net decrease is 100.

The unnamed columns after the \code{gcv} and \code{rss}
columns have a 0 if the corresponding criterion increases instead of decreasing
(i.e. the ranking disagrees with the \code{nsubsets} ranking).
We see that \code{ibh} is considered less important than \code{dpg} using the \code{nsubsets}
criterion, but not with the \code{gcv} and \code{rss} criteria.

\bold{Other techniques}

Running \code{\link{plotmo}} with \code{ylim=NULL} (the default)
gives an idea of which predictors make the largest changes to the predicted value
(but only with all other predictors at their median values).

You can also use \code{\link{drop1}} (assuming you are using the formula interface to earth).
Calling \code{drop1(my.earth.model)} will delete each predictor in turn from your model,
rebuild the model from scratch each time, and calculate the GCV each time.
You will get warnings that the earth library function \code{extractAIC.earth} is
returning GCVs instead of AICs --- but that is what you want so you can
ignore the warnings.
(You can turn off just these warnings by passing \code{warn=FALSE} to \code{\link{drop1}}).
The column labeled \code{AIC} in the printed response
from \code{\link{drop1}} will actually be a column of GCVs not AICs.
The \code{Df} column is not much use in this context.

Note that \code{\link{drop1}} drops \emph{predictors} from the model
while earth's pruning pass drops \emph{terms}.

Remember that this technique only tells you how important
a variable is with the other variables already in the model.
It does not tell you the effect of a variable in isolation.

You will get lots of output from \code{\link{drop1}} if you built your original earth
model with \code{trace>0}.
You can set \code{trace=0} by updating your model before calling \code{\link{drop1}}.
Do it like this:\cr
\code{my.model <- \link{update.earth}(my.model, trace=0)}.

\bold{Remarks}

This function is useful in practice but the following issues can make it misleading.

MARS models have a high variance --- if the data changes a little,
the set of basis terms created by the forward pass can change a lot.
So estimates of predictor importance can be unreliable
because they can vary with even slightly different training data.

Colinear (or related) variables can mask each other's importance, just as in linear models.
This means that if two predictors are closely related, the forward pass will
somewhat arbitrarily choose one over the other.
The chosen predictor will incorrectly appear more important.

For interaction terms, each variable gets credit for the entire term ---
thus interaction terms are counted more than once
and get a total higher weighting than additive terms (questionably).
Each variable gets equal credit in interaction terms even though
one variable in that term may be far more important than the other.

One can question if it is valid
to estimate variable importance using model subsets that are not part of the final model.
It is even possible for a variable to be rated as important yet not appear in the final model.

An example of conflicting importances
(however, the results are fine with the default \code{pmethod}):\cr
\code{evimp(earth(mpg~., data=mtcars, pmethod="none"))}

\bold{Acknowledgment}

Thanks to Max Kuhn for the original \code{evimp} code and for helpful discussions.
}
\seealso{
  \code{\link{earth}},
  \code{\link{plot.evimp}}
}
\examples{
data(ozone1)
a <- earth(O3 ~ ., data=ozone1, degree=2)
ev <- evimp(a, trim=FALSE)
plot(ev)
print(ev)
}
\keyword{models}
