Comparing many probability density functions

The philentropy package has several mechanisms to calculate distances between probability density functions. The main one is to use the the distance() function, which enables to compute 46 different distances/similarities between probability density functions (see ?philentropy::distance and a companion vignette for details). Alternatively, it is possible to call each distance/dissimilarity function directly. For example, the euclidean() function will compute the euclidean distance, while jaccard - the Jaccard distance. The complete list of available distance measures are available with the philentropy::getDistMethods() function.

Both of the above approaches have their pros and cons. The distance() function is more flexible as it allows users to use any distance measure and can return either a matrix or a dist object. It also has several defensive programming checks implemented, and thus, it is more appropriate for regular users. Single distance functions, such as euclidean() or jaccard(), can be, on the other hand, slightly faster as they directly call the underlining C++ code.

Now, we introduce three new low-level functions that are intermediaries between distance() and single distance functions. They are fairly flexible, allowing to use of any implemented distance measure, but also usually faster than calling the distance() functions (especially, if it is needed to use many times). These functions are:

  • dist_one_one() - expects two vectors (probability density functions), returns a single value
  • dist_one_many() - expects one vector (a probability density function) and one matrix (a set of probability density functions), returns a vector of values
  • dist_many_many() - expects two matrices (two sets of probability density functions), returns a matrix of values

Let’s start testing them by attaching the philentropy package.

library(philentropy)

dist_one_one()

dist_one_one() is a lower level equivalent to distance(). However, instead of accepting a numeric data.frame or matrix, it expects two vectors representing probability density functions. In this example, we create two vectors, P and Q.

P <- 1:10 / sum(1:10)
Q <- 20:29 / sum(20:29)

To calculate the euclidean distance between them we can use several approaches - (a) build-in R dist() function, (b) philentropy::distance(), (c) philentropy::euclidean(), or the new dist_one_one().

# install.packages("microbenchmark")
microbenchmark::microbenchmark(
  dist(rbind(P, Q), method = "euclidean"),
  distance(rbind(P, Q), method = "euclidean", test.na = FALSE, mute.message = TRUE),
  euclidean(P, Q, FALSE),
  dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
)
## Unit: microseconds
##                                                                                    expr
##                                                 dist(rbind(P, Q), method = "euclidean")
##  distance(rbind(P, Q), method = "euclidean", test.na = FALSE,      mute.message = TRUE)
##                                                                  euclidean(P, Q, FALSE)
##                                dist_one_one(P, Q, method = "euclidean", testNA = FALSE)
##     min      lq     mean  median      uq     max neval
##  11.908 12.5430 15.07808 12.9645 13.5800 173.768   100
##  52.398 53.9905 58.13930 54.9065 55.9840 311.013   100
##   1.082  1.1920  1.49392  1.3070  1.4775  15.834   100
##   1.743  1.9020  2.75477  2.0180  2.1935  70.415   100

All of them return the same, single value. However, as you can see in the benchmark above, some are more flexible, and others are faster.

dist_one_many()

The role of dist_one_many() is to calculate distances between one probability density function (in a form of a vector) and a set of probability density functions (as rows in a matrix).

Firstly, let’s create our example data.

set.seed(2020-08-20)
P <- 1:10 / sum(1:10)
M <- t(replicate(100, sample(1:10, size = 10) / 55))

P is our input vector and M is our input matrix.

Distances between the P vector and probability density functions in M can be calculated using several approaches. For example, we could write a for loop (adding a new code) or just use the existing distance() function and extract only one row (or column) from the results. The dist_one_many() allows for this calculation directly as it goes through each row in M and calculates a given distance measure between P and values in this row.

# install.packages("microbenchmark")
microbenchmark::microbenchmark(
  as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1],
  distance(rbind(P, M), method = "euclidean", test.na = FALSE, mute.message = TRUE)[1, ][-1],
  dist_one_many(P, M, method = "euclidean", testNA = FALSE)
)
## Unit: microseconds
##                                                                                             expr
##                                      as.matrix(dist(rbind(P, M), method = "euclidean"))[1, ][-1]
##  distance(rbind(P, M), method = "euclidean", test.na = FALSE,      mute.message = TRUE)[1, ][-1]
##                                        dist_one_many(P, M, method = "euclidean", testNA = FALSE)
##      min       lq     mean   median       uq      max neval
##  151.385 196.0320 220.4645 218.6200 231.3345  432.814   100
##  198.596 232.0705 271.5624 250.4025 314.5280  427.946   100
##   13.911  19.6845  41.2512  22.0430  25.1470 1613.981   100

The dist_one_many() returns a vector of values. It is, in this case, much faster than distance(), and visibly faster than dist() while allowing for more possible distance measures to be used.

dist_many_many()

dist_many_many() calculates distances between two sets of probability density functions (as rows in two matrix objects). dist_many_many() calculates distances between two sets of probability density functions (as rows in two matrix objects). This is useful when you have two different sets of distributions, say M1 and M2, and you want to compute the distance from every distribution in M1 to every distribution in M2. The main distance() function cannot do this, as it only computes pairwise distances within a single matrix.

Let’s create two new matrix example data. Let’s create two matrix examples.

set.seed(2020-08-20)
M1 <- t(replicate(10, sample(1:10, size = 10) / 55))
M2 <- t(replicate(10, sample(1:10, size = 10) / 55))

M1 is our first input matrix and M2 is our second input matrix. I am not aware of any function build-in R that allows calculating distances between rows of two matrices, and thus, to solve this problem, we can create our own - many_dists()

many_dists = function(m1, m2){
  r = matrix(nrow = nrow(m1), ncol = nrow(m2))
  for (i in seq_len(nrow(m1))){
    for (j in seq_len(nrow(m2))){
      x = rbind(m1[i, ], m2[j, ])
      r[i, j] = distance(x, method = "euclidean", mute.message = TRUE)
    }
  }
  r
}

… and compare it to dist_many_many().

dist_many_many() is fully implemented in C++ and can use multiple threads. For this comparison we will use the default num.threads = NULL which will use 2 threads unless the RCPP_PARALLEL_NUM_THREADS environment variable is set. There are trade-offs with selecting the number of threads. Using too many threads, more than needed for the workload, will incur additional overhead and may not get to the result any faster than a sequential approach.

# install.packages("microbenchmark")
bm <- microbenchmark::microbenchmark(
  `many_dists` = many_dists(M1, M2),
  `dist_many_many` = dist_many_many(M1, M2, method = "euclidean", testNA = FALSE)
)
bm
## Unit: microseconds
##            expr      min       lq       mean    median       uq       max neval
##      many_dists 5663.396 5812.151 6355.48201 5890.1925 5966.622 11975.816   100
##  dist_many_many   13.430   20.130   32.35226   28.2475   42.884   106.839   100

Both many_dists()and dist_many_many() return a matrix.

The above benchmark concludes that dist_many_many() is about 200 times faster than our custom many_dists() approach. If we were to calculate the distance manually, as opposed to using the optimized distance() function which calls compiled code, we would see an even bigger difference.