% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/SMLE.R
\name{SMLE}
\alias{SMLE}
\title{Joint feature screening via sparse maximum likelihood estimation for GLMs}
\usage{
SMLE(
  Y,
  X,
  k = NULL,
  family = c("gaussian", "binomial", "poisson"),
  categorical = NULL,
  keyset = NULL,
  intercept = TRUE,
  group = TRUE,
  codingtype = NULL,
  maxit = 50,
  tol = 10^(-2),
  selection = F,
  standardize = TRUE,
  fast = FALSE,
  U_rate = 0.5,
  penalize_mod = TRUE
)
}
\arguments{
\item{Y}{The response vector of dimension \eqn{n \times 1}. Quantitative for
\code{family ='gaussian'}, non-negative counts for \code{family ='poisson'},
binary (0-1) for \code{family ='binomial'}. Input Y should be \code{'numeric'}.}

\item{X}{The \eqn{n \times p} feature matrix with each column denoting a feature
(covariate) and each row denoting an observation vector. The input should be
the object of "matrix" for numerical data, and "data.frame" for categorical
data (or a mixture of numerical and categorical data). The algorithm will
treat covariates having class "factor" as categorical data and extend the data
frame dimension by the dummy columns needed for coding the categorical features.}

\item{k}{Total number of features (including 'keyset') to be retained after screening.
Default is \eqn{\frac{1}{2}\log(n)n^{1/3}}.}

\item{family}{Model assumption between Y and X; the default model is Gaussian
linear.}

\item{categorical}{Logical flag whether the input feature matrix includes
categorical features. If \code{categorical= TRUE}, a model intercept will
be used in the screening process. Default is NULL.}

\item{keyset}{A vector to indicate a set of key features that do not
participate in feature screening and are forced to remain in the model.
Default is null.}

\item{intercept}{A vector to indicate whether to an intercept be used in
the model. An intercept will not participate in screening.}

\item{group}{Logical flag for whether to treat the dummy covariates of a
categorical feature as a group. (Only for categorical data, see details).
Default is TRUE.}

\item{codingtype}{Coding types for categorical features; default is "DV".
\code{Codingtype = "all"} Convert each level to a 0-1 vector.
\code{Codingtype = "DV"} conducts deviation coding for each level in
comparison with the grand mean.
\code{Codingtype = "standard"} conducts standard dummy coding for each level
in comparison with the reference level (first level).}

\item{maxit}{Maximum number of iteration steps. Default is 500. Set
\code{maxit= NULL} to loosen this protective stopping criterion.}

\item{tol}{A tolerance level to stop the iteration, when the squared sum of
differences between two successive coefficient updates is below it.
Default is \eqn{10^{-2}}. Set \code{tol= NULL} to loosen this stopping criterion.}

\item{selection}{A logical flag to indicate whether an elaborate selection
is to be conducted by \code{smle_select} after screening (Using default arguments). Default is FALSE.}

\item{standardize}{Logical flag for feature standardization, prior to
performing (iterative) feature screening.  The resulting coefficients are
always returned on the original scale. Default is \code{standardize=TRUE}.
If features are in the same units already, you might not wish to
standardize.}

\item{fast}{Set to TRUE to enable early stop for SMLE-screening. It may help
to boost the screening efficiency with a little sacrifice of accuracy.
Default is FALSE, see details.}

\item{U_rate}{Decreasing rate in tuning step parameter \eqn{u^{-1}} in IHT
algorithm. See details.}

\item{penalize_mod}{A logical flag to indicate whether adjustment is used in
ranking groups of features. This augment is applicable only when
\code{categorical= TRUE} with \code{group=T}; the default is true:
a factor of \eqn{\sqrt{J}} is divided from the \eqn{L_2} effect of a group with J members.}
}
\value{
Returns a '\code{smle}' object with
\item{I}{A list of iteration information.

\code{Y}: Same as input Y.

\code{CM}: Design matrix of class \code{matrix} for numeric features (or data.frame with categorical features).

\code{DM}: A matrix with dummy variable featrues added. (only if there are categorical features).

\code{IM}: Iteration path matrix with columns recording IHT coefficient updates.

\code{nlevel}: Number of levels for all categorical features.

\code{CI}: Indices of categorical features in \code{CM}.

\code{Beta0}: Inital value of regression coefficient for IHT.

\code{DFI}: Indices of categorical features in \code{IM}.

\code{codingtype}: Same as input.
 }

\item{ID_Retained}{A vector indicating the features retained after SMLE screening.
The output includes both features retained by SMLE and the features specified in \code{keyset}.}

\item{Coef_Retained}{The vector of coefficients for the retained features.}

\item{Path_Retained}{Iteration path matrix with columns recording the coefficient updates over the IHT procedure.}

\item{Num_Retained}{Number of retained featrues after screening.}

\item{Intercept}{The value, if Intercept = TRUE.}

\item{steps}{Number of iterations.}

\item{LH}{A list of log-likelihood updates over the IHT iterations }

\item{Uchecks}{Number of times in searching a proper \eqn{u^{-1}} at each step over the IHT iterations.}
}
\description{
Input a \eqn{n \times 1} response Y and a \eqn{n \times p} feature matrix X;
the function uses SMLE to retain only a set of \eqn{k<n} features that seem
to be most relevant for a GLM. It thus serves as a pre-processing step for an
elaborative analysis. In SMLE, the joint effects between features are naturally
accounted; this makes the screening more reliable. The function uses the
efficient iterative hard thresholding (IHT) algorithm with step parameter
adaptively tuned for fast convergence. Users can choose to further conduct
an elaborative selection after SMLE-screening. See \code{smle_select} for more details.
}
\details{
With the input Y and X, \code{SMLE} conducts joint feature screening by running
iterative hard thresholding algorithm (IHT), where the initial value is set to
be the Lasso estimate with the sparsity closest to the sample size minus one.

In \code{SMLE}, the step parameter \eqn{u^{-1}} in IHT is adaptively tuned in
the same way as described in Xu and Chen (2014). Specifically, at each step,
we set the initial \code{u} as the max row sum of X and recursively decrease
the value of \eqn{u^{-1}} by \code{U_rate} to guarantee the likelihood increment.

\code{SMLE} terminates IHT iterations when either \code{tol} or \code{maxit} is
satisfied. When \code{fast=TRUE}, the algorithm also stops when the non-zero
members of the coefficient estimates remain the same for \eqn{1_0} successive
iterations.

In \code{SMLE}, categorical features are coded by dummy covariates with the
method specified in \code{codingtype}. Users can use \code{group} to specify
whether to treat those dummy covariates as a single group feature or as
individual features.
When \code{group=TRUE} with \code{penalize_mod=TRUE}, the effect for a group
of \eqn{J} dummy covariates is computed by

\deqn{ \beta_i = \frac{1}{\sqrt{J}} \cdot \sqrt{(\beta_1)^2+...+(\beta_J)^2}}

which will be treated as a single feature in IHT iterations.

Since feature screening is usually a preprocessing step, users may wish to
further conduct an elaborative feature selection after screening. This can
be done by setting \code{selection=TRUE} in SMLE or applying any existing
selection method on the output of \code{SMLE}.
}
\examples{

#Example
set.seed(123.456)
Data<-Gen_Data(n=100, p=5000, family = "gaussian", correlation="ID")
Data
fit<-SMLE(Data$Y, Data$X, k=9, family = "gaussian")
fit
## The important features we missed:
setdiff(Data$index,fit$ID_Retained)
## Check if the important featrues are retained.
Data$index \%in\% fit$ID_Retained
plot(fit)


}
\references{
UCLA Statistical Consulting Group. \emph{coding systems for categorical
variables in regression analysis}. \url{https://stats.idre.ucla.edu/spss
/faq/coding-systems-for-categorical-variables-in-regression-analysis-2/}.
Retrieved May 28, 2020.

Xu, C. and Chen, J. (2014). The Sparse MLE for Ultrahigh-Dimensional Feature
Screening, \emph{Journal of the American Statistical Association}.
}
