Exercises

(1) The trim argument for the mean function is two-sided. Let's build a one- sided trimmed mean function, and one that uses counts instead of percentiles. Call it mean.minus.top.n. For example mean.minus.top.n(x, 5) will throw out the highest 5 values of x before computing the average. HINT: you can sort x using the sort function.

mean.minus.top.n(c(1, 5, 3, 99), 1) # should return 3

We just leared that the probs argument of quantile can be a vector. So instead of getting multiple quantiles separately, such as

c(quantile(nyc_taxi$trip_distance, probs = .9),
  quantile(nyc_taxi$trip_distance, probs = .6),
  quantile(nyc_taxi$trip_distance, probs = .3))

we can get them all at once by passing the percentiles we want as a single vector to probs:

quantile(nyc_taxi$trip_distance, probs = c(.3, .6, .9))

As it turns out, there's a considerable difference in efficiency between the first and second approach. We explore this in this exercise:

There are two important tools we can use when considering efficiency:

  • profiling is a helpful tool if we need to understand what a function does under the hood (good for finding bottlenecks)
  • benchmarking is the process of comparing multiple functions to see which is faster

Both of these tools can be slow when working with large datasets (especially the benchmarking tool), so instead we create a vector of random numbers and use that for testing (alternatively, we could use a sample of the data). We want the vector to be big enough that test result are stable (not due to chance), but small enough that they will run within a reasonable time frame.

random.vec <- rnorm(10^6) # a million random numbers generated from a standard normal distribution

Let's begin by profiling, for which we rely on the profr library:

library(profr)
my_test_function <- function(){
  quantile(random.vec, p = seq(0, 1, by = .01))
}
p <- profr(my_test_function())
plot(p)

Profiling

(2) Describe what the plot is telling us: what is the bottleneck in getting quantiles?


Now onto benchmarking, we compare two functions: first and scond. first finds the 30th, 60th, and 90th percentiles of the data in one function call, but scond uses three separate function calls, one for each percentile. From the profiling tool, we now know that every time we compute percentiles, we need to sort the data, and that sorting the data is the most time-consuming part of the calculation. The benchmarking tool should show that first is three times more efficient than scond, because first sorts the data once and finds all three percentiles, whereas scond sorts the data three different times and finds one of the percentiles every time.

first <- function(x) quantile(x, probs = c(.3, .6, .9)) # get all percentiles at the same time
scond <- function(x) {
  c(
    quantile(x, probs = .9),
    quantile(x, probs = .6),
    quantile(x, probs = .3))
}

library(microbenchmark) # makes benchmarking easy
print(microbenchmark(
  first(random.vec), # vectorized version
  scond(random.vec), # non-vectorized
  times = 10))
Unit: milliseconds
              expr   min    lq  mean median    uq   max neval
 first(random.vec)  62.7  68.9  78.2   75.8  89.8  97.4    10
 scond(random.vec) 119.8 130.3 139.3  140.7 146.9 157.6    10

(3) Describe what the results say? Do the runtimes bear out our intuition?

results matching ""

    No results matching ""