The philentropy
package has several mechanisms to calculate distances between
probability density functions. The main one is to use the the
distance() function, which enables to compute 46 different
distances/similarities between probability density functions (see
?philentropy::distance and a
companion vignette for details). Alternatively, it is possible to
call each distance/dissimilarity function directly. For example, the
euclidean() function will compute the euclidean distance,
while jaccard - the Jaccard distance. The complete list of
available distance measures are available with the
philentropy::getDistMethods() function.
Both of the above approaches have their pros and cons. The
distance() function is more flexible as it allows users to
use any distance measure and can return either a matrix or
a dist object. It also has several defensive programming
checks implemented, and thus, it is more appropriate for regular users.
Single distance functions, such as euclidean() or
jaccard(), can be, on the other hand, slightly faster as
they directly call the underlining C++ code.
Now, we introduce three new low-level functions that are
intermediaries between distance() and single distance
functions. They are fairly flexible, allowing to use of any implemented
distance measure, but also usually faster than calling the
distance() functions (especially, if it is needed to use
many times). These functions are:
dist_one_one() - expects two vectors (probability
density functions), returns a single valuedist_one_many() - expects one vector (a probability
density function) and one matrix (a set of probability density
functions), returns a vector of valuesdist_many_many() - expects two matrices (two sets of
probability density functions), returns a matrix of valuesLet’s start testing them by attaching the philentropy package.
dist_one_one()dist_one_one() is a lower level equivalent to
distance(). However, instead of accepting a numeric
data.frame or matrix, it expects two vectors
representing probability density functions. In this example, we create
two vectors, P and Q.
To calculate the euclidean distance between them we can use several
approaches - (a) build-in R dist() function, (b)
philentropy::distance(), (c)
philentropy::euclidean(), or the new
dist_one_one().
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
dist(rbind(P, Q), method = "euclidean"),
distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
euclidean(P, Q, FALSE),
dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
)## Unit: microseconds
## expr
## dist(rbind(P, Q), method = "euclidean")
## distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE)
## euclidean(P, Q, FALSE)
## dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 11.458 12.1340 14.53845 12.4950 13.2325 147.751 100
## 51.686 52.9215 58.60634 53.7725 55.6740 328.100 100
## 1.070 1.1745 1.75758 1.2880 1.3560 45.614 100
## 1.641 1.8265 1.97002 1.9105 2.0390 3.434 100
All of them return the same, single value. However, as you can see in the benchmark above, some are more flexible, and others are faster.
dist_one_many()The role of dist_one_many() is to calculate distances
between one probability density function (in a form of a
vector) and a set of probability density functions (as rows
in a matrix).
Firstly, let’s create our example data.
P is our input vector and M is our input
matrix.
Distances between the P vector and probability density
functions in M can be calculated using several approaches.
For example, we could write a for loop (adding a new code)
or just use the existing distance() function and extract
only one row (or column) from the results. The
dist_one_many() allows for this calculation directly as it
goes through each row in M and calculates a given distance
measure between P and values in this row.
# install.packages("microbenchmark")
microbenchmark::microbenchmark(
as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
dist_one_many(P, M, method = "euclidean", testNA = FALSE)
)## Unit: microseconds
## expr
## as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1]
## distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1]
## dist_one_many(P, M, method = "euclidean", testNA = FALSE)
## min lq mean median uq max neval
## 138.424 163.5530 182.02943 178.6330 195.5315 349.417 100
## 192.726 221.5960 250.48492 231.9745 250.2710 458.468 100
## 13.195 16.1285 32.36888 18.3050 21.9075 1190.795 100
The dist_one_many() returns a vector of values. It is,
in this case, much faster than distance(), and visibly
faster than dist() while allowing for more possible
distance measures to be used.
dist_many_many()dist_many_many() calculates distances between two sets
of probability density functions (as rows in two matrix
objects). dist_many_many() calculates distances between two
sets of probability density functions (as rows in two
matrix objects). This is useful when you have two different
sets of distributions, say M1 and M2, and you
want to compute the distance from every distribution in M1
to every distribution in M2. The main
distance() function cannot do this, as it only computes
pairwise distances within a single matrix.
Let’s create two new matrix example data. Let’s create
two matrix examples.
set.seed(2020-08-20)
M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
M2 <- t(replicate(10, sample(1:10, size = 10) / 55))M1 is our first input matrix and M2 is our
second input matrix. I am not aware of any function build-in R that
allows calculating distances between rows of two matrices, and thus, to
solve this problem, we can create our own -
many_dists()…
many_dists = function(m1, m2){
r = matrix(nrow = nrow(m1), ncol = nrow(m2))
for (i in seq_len(nrow(m1))){
for (j in seq_len(nrow(m2))){
x = rbind(m1[i, ], m2[j, ])
r[i, j] = distance(x, method = "euclidean", mute.message = TRUE)
}
}
r
}… and compare it to dist_many_many().
dist_many_many() is fully implemented in C++ and can use
multiple threads. For this comparison we will use the default
num.threads = NULL which will use 2 threads unless the
RCPP_PARALLEL_NUM_THREADS environment variable is set.
There are trade-offs with selecting the number of threads. Using too
many threads, more than needed for the workload, will incur additional
overhead and may not get to the result any faster than a sequential
approach.
# install.packages("microbenchmark")
bm <- microbenchmark::microbenchmark(
`many_dists` = many_dists(M1, M2),
`dist_many_many` = dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
)
bm## Unit: microseconds
## expr min lq mean median uq max neval
## many_dists 5464.810 5641.6020 6208.95565 5807.343 5936.113 11633.295 100
## dist_many_many 13.141 18.4075 35.66887 34.030 51.046 103.687 100
Both many_dists()and dist_many_many()
return a matrix.
The above benchmark concludes that dist_many_many() is
about 170 times faster than our custom many_dists()
approach. If we were to calculate the distance manually, as opposed to
using the optimized distance() function which calls
compiled code, we would see an even bigger difference.