The philentropy
package has several mechanisms to calculate distances between
probability density functions. The main one is to use the the
distance() function, which enables to compute 46 different
distances/similarities between probability density functions (see
?philentropy::distance and a
companion vignette for details). Alternatively, it is possible to
call each distance/dissimilarity function directly. For example, the
euclidean() function will compute the euclidean distance,
while jaccard - the Jaccard distance. The complete list of
available distance measures are available with the
philentropy::getDistMethods() function.
Both of the above approaches have their pros and cons. The
distance() function is more flexible as it allows users to
use any distance measure and can return either a matrix or
a dist object. It also has several defensive programming
checks implemented, and thus, it is more appropriate for regular users.
Single distance functions, such as euclidean() or
jaccard(), can be, on the other hand, slightly faster as
they directly call the underlining C++ code.
Now, we introduce three new low-level functions that are
intermediaries between distance() and single distance
functions. They are fairly flexible, allowing to use of any implemented
distance measure, but also usually faster than calling the
distance() functions (especially, if it is needed to use
many times). These functions are:
dist_one_one() - expects two vectors (probability
density functions), returns a single valuedist_one_many() - expects one vector (a probability
density function) and one matrix (a set of probability density
functions), returns a vector of valuesdist_many_many() - expects two matrices (two sets of
probability density functions), returns a matrix of valuesLet’s start testing them by attaching the philentropy package.
dist_one_one()dist_one_one() is a lower level equivalent to
distance(). However, instead of accepting a numeric
data.frame or matrix, it expects two vectors
representing probability density functions. In this example, we create
two vectors, P and Q.
To calculate the euclidean distance between them we can use several
approaches - (a) build-in R dist() function, (b)
philentropy::distance(), (c)
philentropy::euclidean(), or the new
dist_one_one().
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
dist(rbind(P, Q), method = "euclidean"),
distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
euclidean(P, Q, FALSE),
dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
)## Unit: microseconds
## expr
## dist(rbind(P, Q), method = "euclidean")
## distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE)
## euclidean(P, Q, FALSE)
## dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 11.908 12.5430 15.07808 12.9645 13.5800 173.768 100
## 52.398 53.9905 58.13930 54.9065 55.9840 311.013 100
## 1.082 1.1920 1.49392 1.3070 1.4775 15.834 100
## 1.743 1.9020 2.75477 2.0180 2.1935 70.415 100
All of them return the same, single value. However, as you can see in the benchmark above, some are more flexible, and others are faster.
dist_one_many()The role of dist_one_many() is to calculate distances
between one probability density function (in a form of a
vector) and a set of probability density functions (as rows
in a matrix).
Firstly, let’s create our example data.
P is our input vector and M is our input
matrix.
Distances between the P vector and probability density
functions in M can be calculated using several approaches.
For example, we could write a for loop (adding a new code)
or just use the existing distance() function and extract
only one row (or column) from the results. The
dist_one_many() allows for this calculation directly as it
goes through each row in M and calculates a given distance
measure between P and values in this row.
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
dist_one_many(P, M, method = "euclidean", testNA = FALSE)
)## Unit: microseconds
## expr
## as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1]
## distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1]
## dist_one_many(P, M, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 151.385 196.0320 220.4645 218.6200 231.3345 432.814 100
## 198.596 232.0705 271.5624 250.4025 314.5280 427.946 100
## 13.911 19.6845 41.2512 22.0430 25.1470 1613.981 100
The dist_one_many() returns a vector of values. It is,
in this case, much faster than distance(), and visibly
faster than dist() while allowing for more possible
distance measures to be used.
dist_many_many()dist_many_many() calculates distances between two sets
of probability density functions (as rows in two matrix
objects). dist_many_many() calculates distances between two
sets of probability density functions (as rows in two
matrix objects). This is useful when you have two different
sets of distributions, say M1 and M2, and you
want to compute the distance from every distribution in M1
to every distribution in M2. The main
distance() function cannot do this, as it only computes
pairwise distances within a single matrix.
Let’s create two new matrix example data. Let’s create
two matrix examples.
set.seed(2020-08-20)
M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
M2 <- t(replicate(10, sample(1:10, size = 10) / 55))M1 is our first input matrix and M2 is our
second input matrix. I am not aware of any function build-in R that
allows calculating distances between rows of two matrices, and thus, to
solve this problem, we can create our own -
many_dists()…
many_dists = function(m1, m2){
r = matrix(nrow = nrow(m1), ncol = nrow(m2))
for (i in seq_len(nrow(m1))){
for (j in seq_len(nrow(m2))){
x = rbind(m1[i, ], m2[j, ])
r[i, j] = distance(x, method = "euclidean", mute.message = TRUE)
}
}
r
}… and compare it to dist_many_many().
dist_many_many() is fully implemented in C++ and can use
multiple threads. For this comparison we will use the default
num.threads = NULL which will use 2 threads unless the
RCPP_PARALLEL_NUM_THREADS environment variable is set.
There are trade-offs with selecting the number of threads. Using too
many threads, more than needed for the workload, will incur additional
overhead and may not get to the result any faster than a sequential
approach.
# install.packages("microbenchmark")
bm <- microbenchmark::microbenchmark(
`many_dists` = many_dists(M1, M2),
`dist_many_many` = dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
)
bm## Unit: microseconds
## expr min lq mean median uq max neval
## many_dists 5663.396 5812.151 6355.48201 5890.1925 5966.622 11975.816 100
## dist_many_many 13.430 20.130 32.35226 28.2475 42.884 106.839 100
Both many_dists()and dist_many_many()
return a matrix.
The above benchmark concludes that dist_many_many() is
about 200 times faster than our custom many_dists()
approach. If we were to calculate the distance manually, as opposed to
using the optimized distance() function which calls
compiled code, we would see an even bigger difference.