"cosine"
,
"jaccard"
or "hellinger"
metrics could give
incorrect results if maximally-dissimilar items were in the nearest
neighbors. Thank you to Maciej
Beręsewicz for the report (https://github.com/jlmelville/rnndescent/issues/14).nnd_knn
and rnnd_build
:
weight_by_degree
. If set to TRUE
, then the
candidate list in nearest neighbor descent is weighted in favor of
low-degree items, which should make for a more diverse local join. There
is a minor increase in computation but also a minor increase in
accuracy.rnnd_build
generated an error when preparing the search
graph for some metrics (notably cosine
and
jaccard
).prepare_search_graph
:
use_alt_metric
. This behaves like the existing
use_alt_metric
parameters in other functions and may speed
up the index preparation step in e.g. rnnd_build
.rnnd_build
and
prepare_search_graph
: prune_reverse
. If set to
TRUE
the reverse graph will be pruned using the
pruning_degree_multiplier
parameter before any
diversification. This can help prevent an excessive amount of time being
spent in the diversification step in the case where an item has a large
number of neighbors (in the reverse graph this can be as large as the
number of items in the dataset). Pruning of the merged graph still
occurs, so this is an additional pruning step. This should have little
effect on search results, but for backwards compatibility, the default
is FALSE
.CRAN resubmission to fix lingering UBSAN errors.
Initial CRAN submission.
rnnd_build
now always prepares the search graph.rnnd_prepare
function has been removed. The option
to not prepare the search graph during index building only made sense if
you were only interested in the k-nearest neighbor graph. Now that
rnnd_knn
exists for that purpose (see below), the logic of
index building has been substantially simplified.nn_to_sparse
function has been removed.merge_knn
function has been removed, and
merge_knnl
has been renamed to merge_knn
. If
you were running e.g. merge_knn(nn1, nn2)
, you must now use
merge_knn(list(nn1, nn2))
. Also the parameter
nn_graphs
has been renamed graphs
.rnnd_knn
. Behaves a lot like
rnnd_build
, but only returns the knn graph with no
index built. The index can be very large in size for high dimensional or
large datasets, so this function is useful if you only care about the
knn graph and won’t ever want to query new data.neighbor_overlap
. Measures the overlap of
two knn graphs via their shared indices. A similar function was used
extensively in some vignettes so it may have sufficient utility to be
useful to others.rnnd_query
and
graph_knn_query
: max_search_fraction
. This
parameter controls the maximum number of nodes that can be searched
during querying. If the number of nodes searched exceeds this fraction
of the total number of nodes in the graph, the search will be
terminated. This can be used in combination with epsilon
to
avoid excessive search times.spearmanr
distance has been fixed.n_threads = 0
,
progress/interrupt monitoring was not occurring.init
parameter of rnnd_query
.rnnd_query
: if verbose = TRUE
, a summary
of the min, max and average number of distance queries will be logged.
This can help tune epsilon
and
max_search_fraction
.local_scale_nn
has been removed, for similar
reasons to the removal of the standalone distance functions. It remains
in the localscale
branch of the github repo.prepare_search_graph
is
now transposed. This prevents having to repeatedly transpose inside
every call to graph_knn_query
if multiple queries are being
made. You will need to either regenerate any saved search graphs or
transpose them with Matrix::t(search_graph)
.rnnd_build
, rnnd_query
and
rnnd_prepare
. These functions streamline the process of
building a k-nearest neighbor graph, preparing a search graph and
querying it.The bhamming
metric no longer exists as a
specialized metric. Instead, if you pass a logical
matrix
to data
, reference
or query
parameter (depending on the function) and specify
metric = "hamming"
you will automatically get the
binary-specific version of the hamming metric.
The hamming
and bhamming
metrics are
now normalized with respect to the number of features, to be consistent
with the other binary-style metrics (and PyNNDescent). If you need the
old distances, multiply the distance matrix by the number of columns,
e.g. do something like:
<- brute_force_knn(X, metric = "hamming")
res $dist <- res$dist * ncol(X) res
The metric l2sqr
has been renamed
sqeuclidean
to be consistent with PyNNDescent.
metric
parameter now accepts a
much larger number of metrics. See the rdoc for the full list of
supported metrics. Currently, most of the metrics from PyNNDescent which
don’t require extra parameters are supported. The number of specialized
binary metrics has also been expanded.rpf_knn
and rpf_build
:
max_tree_depth
this controls the depth of the tree and was
set to 100 internally. This default has been doubled to 200 and can now
be user-controlled. If verbose = TRUE
and the largest leaf
in the forest exceeds the leaf_size
parameter, a message
warning you about this will be logged and indicates that the maximum
tree depth has been exceeded. Increasing max_tree_depth
may
not be the answer: it’s more likely there is something unusual about the
distribution of the distances in your dataset and a random
initialization might be a better use of your time.dgCMatrix
to the
data
, reference
or query
parameters where you would usually use a dense matrix or data frame.
cosine
, euclidean
, manhattan
,
hamming
and correlation
are all available, but
alternative versions in the dense case,
e.g. cosine-preprocess
or the binary-specific
bhamming
for dense data is not.init
option for graph_knn_query
: you
can now pass an RP forest and initialize with that, e.g. from
rpf_build
, or by setting ret_forest = TRUE
on
nnd_knn
or rpf_knn
. You may want to cut down
the size of the forest used for initialization with
rpf_filter
first, though (a single tree may be enough).
This will also use the metric data in the forest, so setting
metric
(or use_alt_metric
) in the function
itself will be ignored.prepare_search_graph
or to
graph_knn_query
contains missing data, this will no longer
cause an error (it still might not be the best idea though).rpf_knn
. Calculates the approximate
k-nearest neighbors using a random partition forest.rpf_build
. Builds a random partition
forest.rpf_knn_query
. Queries a random partition
forest (built with rpf_build
to find the approximate
nearest neighbors for the query points.rpf_filter
. Retains only the best
“scoring” trees in a forest, where each tree is scored based on how well
it reproduces a given knn.nnd_knn
:
init = "tree"
. Uses the RP Forest initialization
method.nnd_knn
: ret_forest
.
Returns the search forest used if init = "tree"
so it can
be used for future searching or filtering.nnd_knn
: init_opts
.
Options that can be passed to the RP forest initialization (same as in
rpf_knn
).nnd_knn
with
n_threads > 0
was reporting double the actual number of
iterations. This made the progress bar way too optimistic.metric
: "cosine"
and
"correlation"
have been renamed
"cosine-preprocess"
and
"correlation-preprocess"
respectively. This reflects that
they do some preprocessing of the data up front to make subsequent
distance calculations faster. I have endeavored to avoid unnecessary
allocations or copying in this preprocessing, but there is still a
chance of more memory usage.cosine
and correlation
metrics are
still available as an option, but now use an implementation that doesn’t
do any preprocessing. The preprocessing and non-preprocessing version
should give the same numerical results, give or take some minor
numerical differences, but when the distance should be zero, the
preprocessing versions may give values which are slightly different from
zero (e.g. 1e-7).correlation_distance
,
cosine_distance
, euclidean_distance
,
hamming_distance
, l2sqr_distance
,
manhattan_distance
for calculating the distance between two
vectors, which may be useful for more arbitrary distance calculations
than the nearest neighbor routines here, although they won’t be as
efficient (they do call the same C++ code, though). The cosine and
correlation calculations here use the non-preprocessing
implementations.hamming
metric to a standard definition. The
old implementation of hamming
metric which worked on binary
data only was renamed into bhamming
. (contributed by Vitalie Spinu)obs
has been added to most functions: set
obs = "C"
and you can pass the input data in
column-oriented format.random_knn
function used to always return each item
as its own neighbor, so that only n_nbrs - 1
of the
returned neighbors were actually selected at random. Even I forgot it
did that and it doesn’t make a lot of sense, so now you really do just
get back n_nbrs
random selections.init
parameter to nnd_knn
or graph_knn_query
:
previously, if k
was specified and larger than the number
of neighbors included in init
, this gave an error. Now,
init
will be augmented with random neighbors to reach the
desired k
. This could be useful as a way to “restart” a
neighbor search from a better-than-random location if k
has
been found to have been too small initially. Note that the random
selection does not take into account the identities of the already
chosen neighbors, so duplicates may be included in the augmented result,
which will reduce the effective size of the initialized number of
neighbors.block_size
and grain_size
parameters from functions. These were related to the amount of work done
per thread, but it’s not obvious to an outside user how to set
these.verbose = TRUE
) and respond to
user-requested cancellation.nnd_knn_query
has been renamed to
graph_knn_query
and now more closely follows the current
pynndescent graph search method (including backtracking search).prepare_search_graph
for preparing a
search graph from a neighbor graph for use in
graph_knn_query
, by using reverse nearest neighbors,
occlusion pruning and truncation.graph_knn_query
.There was a major rewrite of the internal organization of the C++ to be less R-specific.
The license for rnndescent has changed from GPLv3 to GPLv3 or later.
"correlation"
. This is (1 minus) the
Pearson correlation.k_occur
which counts the k-occurrences of
each item in the idx
matrix, which is the number of times
an item appears in the k-nearest neighbor list in the dataset. The
distribution of the k-occurrences can be used to diagnose the “hubness”
of a dataset. Items with a large k-occurrence (>> k, e.g. 10k),
may indicate low accuracy of the approximate nearest neighbor
result.To avoid undefined behavior issues, rnndescent now uses an internal
implementation of RcppParallel’s parallelFor
loop that
works with std::thread
and does not load Intel’s TBB
library.
dqrng
sample routines from inside a thread, despite it
clearly using the R API extensively. It’s not ok and causes lots of
crashes. There is now a re-implementation of dqrng
’s sample
routines using plain std::vector
s in
src/rnn_sample.h
. That file is licensed under the AGPL
(rnndescent
as a whole remains GPL3).merge_knn
, to combine two nearest
neighbor graphs. Useful for combining the results of multiple runs of
nnd_knn
or random_knn
. Also,
merge_knnl
, which operates on a list of multiple neighbor
graphs, and can provide a speed up over merge_knn
if you
don’t mind storing multiple graphs in memory at once.nnd_knn
with
n_threads > 1
and random_knn
with
n_threads > 1
and
order_by_distance = TRUE
.nnd_knn
with
n_threads > 1
due to the use of a mutex pool.Mainly an internal clean-up to reduce duplication.
nnd_knn
and nnd_knn_query
use
the same progress bar as the brute force and random neighbor functions.
Bring back the old per-iteration logging that also showed the current
distance sum of the knn with the progress = "dist"
option.random_knn
and random_knn_query
, when
order_by_distance = TRUE
and n_threads > 0
,
the final sorting of the knn graph will be multi-threaded.n_threads > 0
.nnd_knn_query
being the most useful, but
brute_force_knn_query
and random_knn_query
are
also available. This allows for query
data to search
reference
data, i.e. the returned indices and distances are
relative to the reference
data, not any other member of
query
. These methods are also available in multi-threaded
mode, and nnd_knn_query
has a low and high memory
version.l2
metric has been renamed to l2sqr
to
more accurately reflect what it is: the square of the L2 (Euclidean)
metric.use_alt_metric
. Set to FALSE
if
you don’t want alternative, faster metrics (which keep the distance
ordering of metric
) to be used in internal calculations.
Currently only applies to metric = "euclidean"
, where the
squared Euclidean distance is used internally. Only worth setting this
to FALSE
if you think the alternative is causing numerical
issues (which is a bug, so please report it!).block_size
for parallel methods, which
determines the amount of work done in parallel before checking for user
interrupt request and updating any progress.random_knn
now returns its results in sorted order. You
can turn this off with order_distances = FALSE
, if you
don’t need the sorting (e.g. you are using the results as input to
something else).brute_force
and
random
methods should now be correct.brute_force_knn
.random_knn
.verbose = TRUE
.fast_rand
option has been removed, as it only applied
to single-threading, and had a negligible effect.Also, a number of changes inspired by recent work in https://github.com/lmcinnes/pynndescent:
rho
sampling parameter has been removed. The size
of the candidates (general neighbors) list is now controlled entirely by
max_candidates
.max_candidates
has been reduced to 20.use_set
logical flag has been replaced by
low_memory
, which has the opposite meaning. It now also
works when using multiple threads. While it follows the pynndescent
implementation, it’s still experimental, so
low_memory = TRUE
by default for the moment.low_memory = FALSE
implementation for
n_threads = 0
(originally equivalent to
use_set = TRUE
) is faster.block_size
, which balances interleaving
of queuing updates versus applying them to the current graph.