% Generated by roxygen2: do not edit by hand
% Please edit documentation in R/rnndescent.R
\name{nnd_knn}
\alias{nnd_knn}
\title{Find nearest neighbors using nearest neighbor descent}
\usage{
nnd_knn(
  data,
  k = NULL,
  metric = "euclidean",
  init = "rand",
  init_args = NULL,
  n_iters = NULL,
  max_candidates = NULL,
  delta = 0.001,
  low_memory = TRUE,
  weight_by_degree = FALSE,
  use_alt_metric = TRUE,
  n_threads = 0,
  verbose = FALSE,
  progress = "bar",
  obs = "R",
  ret_forest = FALSE
)
}
\arguments{
\item{data}{Matrix of \code{n} items to generate neighbors for, with observations
in the rows and features in the columns. Optionally, input can be passed
with observations in the columns, by setting \code{obs = "C"}, which should be
more efficient. Possible formats are \code{\link[base:data.frame]{base::data.frame()}}, \code{\link[base:matrix]{base::matrix()}}
or \code{\link[Matrix:sparseMatrix]{Matrix::sparseMatrix()}}. Sparse matrices should be in \code{dgCMatrix}
format. Dataframes will be converted to \code{numerical} matrix format
internally, so if your data columns are \code{logical} and intended to be used
with the specialized binary \code{metric}s, you should convert it to a logical
matrix first (otherwise you will get the slower dense numerical version).}

\item{k}{Number of nearest neighbors to return. Optional if \code{init} is
specified.}

\item{metric}{Type of distance calculation to use. One of:
\itemize{
\item \code{"braycurtis"}
\item \code{"canberra"}
\item \code{"chebyshev"}
\item \code{"correlation"} (1 minus the Pearson correlation)
\item \code{"cosine"}
\item \code{"dice"}
\item \code{"euclidean"}
\item \code{"hamming"}
\item \code{"hellinger"}
\item \code{"jaccard"}
\item \code{"jensenshannon"}
\item \code{"kulsinski"}
\item \code{"sqeuclidean"} (squared Euclidean)
\item \code{"manhattan"}
\item \code{"rogerstanimoto"}
\item \code{"russellrao"}
\item \code{"sokalmichener"}
\item \code{"sokalsneath"}
\item \code{"spearmanr"} (1 minus the Spearman rank correlation)
\item \code{"symmetrickl"} (symmetric Kullback-Leibler divergence)
\item \code{"tsss"} (Triangle Area Similarity-Sector Area Similarity or TS-SS
metric)
\item \code{"yule"}
}

For non-sparse data, the following variants are available with
preprocessing: this trades memory for a potential speed up during the
distance calculation. Some minor numerical differences should be expected
compared to the non-preprocessed versions:
\itemize{
\item \code{"cosine-preprocess"}: \code{cosine} with preprocessing.
\item \code{"correlation-preprocess"}: \code{correlation} with preprocessing.
}

For non-sparse binary data passed as a \code{logical} matrix, the following
metrics have specialized variants which should be substantially faster than
the non-binary variants (in other cases the logical data will be treated as
a dense numeric vector of 0s and 1s):
\itemize{
\item \code{"dice"}
\item \code{"hamming"}
\item \code{"jaccard"}
\item \code{"kulsinski"}
\item \code{"matching"}
\item \code{"rogerstanimoto"}
\item \code{"russellrao"}
\item \code{"sokalmichener"}
\item \code{"sokalsneath"}
\item \code{"yule"}
}}

\item{init}{Name of the initialization strategy or initial \code{data} neighbor
graph to optimize. One of:
\itemize{
\item \code{"rand"} random initialization (the default).
\item \code{"tree"} use the random projection tree method of Dasgupta and Freund
(2008).
\item a pre-calculated neighbor graph. A list containing:
\itemize{
\item \code{idx} an \code{n} by \code{k} matrix containing the nearest neighbor indices.
\item \code{dist} (optional) an \code{n} by \code{k} matrix containing the nearest
neighbor distances. If the input distances are omitted, they will be
calculated for you.'
}
}

If \code{k} and \code{init} are specified as arguments to this function, and the
number of neighbors provided in \code{init} is not equal to \code{k} then:
\itemize{
\item if \code{k} is smaller, only the \code{k} closest values in \code{init} are retained.
\item if \code{k} is larger, then random neighbors will be chosen to fill \code{init} to
the size of \code{k}. Note that there is no checking if any of the random
neighbors are duplicates of what is already in \code{init} so effectively fewer
than \code{k} neighbors may be chosen for some observations under these
circumstances.
}}

\item{init_args}{a list containing arguments to pass to the random partition
forest initialization. See \code{\link[=rpf_knn]{rpf_knn()}} for possible arguments. To avoid
inconsistences with the tree calculation and subsequent nearest neighbor
descent optimization, if you attempt to provide a \code{metric} or
\code{use_alt_metric} option in this list it will be ignored.}

\item{n_iters}{Number of iterations of nearest neighbor descent to carry out.
By default, this will be chosen based on the number of observations in
\code{data}.}

\item{max_candidates}{Maximum number of candidate neighbors to try for each
item in each iteration. Use relative to \code{k} to emulate the "rho"
sampling parameter in the nearest neighbor descent paper. By default, this
is set to \code{k} or \code{60}, whichever is smaller.}

\item{delta}{The minimum relative change in the neighbor graph allowed before
early stopping. Should be a value between 0 and 1. The smaller the value,
the smaller the amount of progress between iterations is allowed. Default
value of \code{0.001} means that at least 0.1\% of the neighbor graph must
be updated at each iteration.}

\item{low_memory}{If \code{TRUE}, use a lower memory, but more
computationally expensive approach to index construction. If set to
\code{FALSE}, you should see a noticeable speed improvement, especially
when using a smaller number of threads, so this is worth trying if you have
the memory to spare.}

\item{weight_by_degree}{If \code{TRUE}, then candidates for the local join are
weighted according to their in-degree, so that if there are more than
\code{max_candidates} in a candidate list, candidates with a smaller degree are
favored for retention. This prevents items with large numbers of edges
crowding out other items and for high-dimensional data is likely to provide
a small improvement in accuracy. Because this incurs a small extra cost of
counting the degree of each node, and because it tends to delay early
convergence, by default this is \code{FALSE}.}

\item{use_alt_metric}{If \code{TRUE}, use faster metrics that maintain the
ordering of distances internally (e.g. squared Euclidean distances if using
\code{metric = "euclidean"}), then apply a correction at the end. Probably
the only reason to set this to \code{FALSE} is if you suspect that some
sort of numeric issue is occurring with your data in the alternative code
path.}

\item{n_threads}{Number of threads to use.}

\item{verbose}{If \code{TRUE}, log information to the console.}

\item{progress}{Determines the type of progress information logged if
\code{verbose = TRUE}. Options are:
\itemize{
\item \code{"bar"}: a simple text progress bar.
\item \code{"dist"}: the sum of the distances in the approximate knn graph at the
end of each iteration.
}}

\item{obs}{set to \code{"C"} to indicate that the input \code{data} orientation stores
each observation as a column. The default \code{"R"} means that observations are
stored in each row. Storing the data by row is usually more convenient, but
internally your data will be converted to column storage. Passing it
already column-oriented will save some memory and (a small amount of) CPU
usage.}

\item{ret_forest}{If \code{TRUE} and \code{init = "tree"} then the RP forest used to
initialize the nearest neighbors will be returned with the nearest neighbor
data. See the \code{Value} section for details. The returned forest can be used
as part of initializing the search for new data: see \code{\link[=rpf_knn_query]{rpf_knn_query()}} and
\code{\link[=rpf_filter]{rpf_filter()}} for more details.}
}
\value{
the approximate nearest neighbor graph as a list containing:
\itemize{
\item \code{idx} an n by k matrix containing the nearest neighbor indices.
\item \code{dist} an n by k matrix containing the nearest neighbor distances.
\item \code{forest} (if \code{init = "tree"} and \code{ret_forest = TRUE} only): the RP forest
used to initialize the neighbor data.
}
}
\description{
Uses the Nearest Neighbor Descent method due to Dong and co-workers (2011)
to optimize an approximate nearest neighbor graph.
}
\details{
If no initial graph is provided, a random graph is generated, or you may also
specify the use of a graph generated from a forest of random projection
trees, using the method of Dasgupta and Freund (2008).
}
\examples{
# Find 4 (approximate) nearest neighbors using Euclidean distance
# If you pass a data frame, non-numeric columns are removed
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean")

# Manhattan (l1) distance
iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan")

# Multi-threading: you can choose the number of threads to use: in real
# usage, you will want to set n_threads to at least 2
iris_nn <- nnd_knn(iris, k = 4, metric = "manhattan", n_threads = 1)

# Use verbose flag to see information about progress
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", verbose = TRUE)

# Nearest neighbor descent uses random initialization, but you can pass any
# approximation using the init argument (as long as the metrics used to
# calculate the initialization are compatible with the metric options used
# by nnd_knn).
iris_nn <- random_knn(iris, k = 4, metric = "euclidean")
iris_nn <- nnd_knn(iris, init = iris_nn, metric = "euclidean", verbose = TRUE)

# Number of iterations controls how much optimization is attempted. A smaller
# value will run faster but give poorer results
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 2)

# You can also control the amount of work done within an iteration by
# setting max_candidates
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", max_candidates = 50)

# Optimization may also stop early if not much progress is being made. This
# convergence criterion can be controlled via delta. A larger value will
# stop progress earlier. The verbose flag will provide some information if
# convergence is occurring before all iterations are carried out.
set.seed(1337)
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0.5)

# To ensure that descent only stops if no improvements are made, set delta = 0
set.seed(1337)
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", n_iters = 5, delta = 0)

# A faster version of the algorithm is available that avoids repeated
# distance calculations at the cost of using more RAM. Set low_memory to
# FALSE to try it.
set.seed(1337)
iris_nn <- nnd_knn(iris, k = 4, metric = "euclidean", low_memory = FALSE)

# Using init = "tree" is usually more efficient than random initialization.
# arguments to the tree initialization method can be passed via the init_args
# list
set.seed(1337)
iris_nn <- nnd_knn(iris, k = 4, init = "tree", init_args = list(n_trees = 5))
}
\references{
Dasgupta, S., & Freund, Y. (2008, May).
Random projection trees and low dimensional manifolds.
In \emph{Proceedings of the fortieth annual ACM symposium on Theory of computing}
(pp. 537-546).
\doi{10.1145/1374376.1374452}.

Dong, W., Moses, C., & Li, K. (2011, March).
Efficient k-nearest neighbor graph construction for generic similarity measures.
In \emph{Proceedings of the 20th international conference on World Wide Web}
(pp. 577-586).
ACM.
\doi{10.1145/1963405.1963487}.
}
