Writing summary functions

As we use R more and more, we will see that a lot of R functions return a list as output (or something that is fundamentally a list but looks cosmetically different). In fact, as it happens a data.frame is also just a kind a list, with each element of the list corresponding to a column of the data.frame, and all elements having the same length. Why would a data.frame be a list and not a matrix? Because like a vector, a matirx or any array is atomic, meaning that its elements must be of the same type (usually numeric). Notice what happens if we try to force a vector to have one character element and one numeric one:

c("one", 1)
[1] "one" "1"

The second element was coerced into the string "1". A list will not complain about this:

list("one", 1)
[[1]]
[1] "one"

[[2]]
[1] 1

Since columns of a data.frame can be of different types, it makes sense that under the hood a data.frame is really just a list. We can check that a data.frame is a kind of list under the hood by using the typeof function instead of class:

class(nyc_taxi)
[1] "data.frame"
typeof(nyc_taxi)
[1] "list"

This flexibility is the reason functions that return lots of loosely-related results return them as a single list. This includes most functions that perform various statistical tests, such as the lm function.

We can also write our own summary functions and demonstrate this. In section 6.1, we focused on single summaries (such as mean), or multiple related ones (such as quantile), but now we want to write a function that combines different summaries and returns all of them at once. The trick is basically to wrap everything into a list and return the list. The function my.summary shown here is an example of such a function. It consists of mostly of separate but related summaries that are calculated piece-wise and then put together into a list and returned by the function.

my.summary <- function(grp_1, grp_2, resp) {
  # `grp_1` and `grp_2` are `character` or `factor` columns
  # `resp` is a numeric column

  mean <- mean(resp, na.rm = TRUE) # mean
  sorted_resp <- sort(resp)
  n <- length(resp)
  mean_minus_top = mean(sorted_resp[1:(n-19)], na.rm = TRUE) # average after throwing out highest 20 values

  tt_1 <- table(grp_1, grp_2) # the total count
  ptt_1 <- prop.table(tt_1, 1) # proportions for each level of the response
  ptt_2 <- prop.table(tt_1, 2) # proportions for each level of the response

  tt_2 <- tapply(resp, list(grp_1, grp_2), mean, na.rm = TRUE)

  # return everything as a list:
  list(mean = mean, 
       trimmed_mean = mean_minus_top,
       row_proportions = ptt_1,
       col_proportions = ptt_2,
       average_by_group = tt_2
  )
}

my.summary(nyc_taxi$pickup_dow, nyc_taxi$pickup_hour, nyc_taxi$tip_amount) # test the function
$mean
[1] 2.1

$trimmed_mean
[1] 2.1

$row_proportions
     grp_2
grp_1 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
  Sun  0.1183  0.0796   0.1498   0.2099  0.1056   0.1818   0.1550
  Mon  0.0360  0.1849   0.1484   0.2002  0.1194   0.2331   0.0780
  Tue  0.0295  0.1855   0.1427   0.1907  0.1157   0.2539   0.0820
  Wed  0.0326  0.1878   0.1421   0.1833  0.1092   0.2536   0.0913
  Thu  0.0407  0.1784   0.1402   0.1796  0.1066   0.2492   0.1054
  Fri  0.0468  0.1677   0.1363   0.1744  0.1067   0.2451   0.1230
  Sat  0.0925  0.0872   0.1393   0.1920  0.1063   0.2239   0.1588

$col_proportions
     grp_2
grp_1 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
  Sun  0.2809  0.0710   0.1427   0.1503  0.1306   0.1051   0.1833
  Mon  0.0797  0.1541   0.1320   0.1338  0.1379   0.1258   0.0860
  Tue  0.0701  0.1660   0.1362   0.1368  0.1435   0.1471   0.0972
  Wed  0.0796  0.1723   0.1392   0.1349  0.1389   0.1507   0.1110
  Thu  0.1064  0.1755   0.1471   0.1417  0.1453   0.1588   0.1374
  Fri  0.1263  0.1702   0.1476   0.1419  0.1501   0.1611   0.1654
  Sat  0.2569  0.0910   0.1552   0.1607  0.1537   0.1513   0.2196

$average_by_group
    1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun    2.08    2.05     1.86     2.09    2.08     2.10     2.08
Mon    2.44    2.06     2.06     2.11    2.06     2.13     2.34
Tue    2.37    2.02     2.10     2.15    2.11     2.14     2.34
Wed    2.33    2.00     2.13     2.28    2.21     2.18     2.29
Thu    2.32    2.02     2.12     2.28    2.20     2.19     2.29
Fri    2.29    2.06     2.09     2.24    2.13     2.05     2.19
Sat    2.12    1.97     1.77     1.87    1.85     1.90     2.08

Looking at the above result, we can see that something went wrong with the trimmed mean: the trimmed mean and the mean appear to be the same, which is very unlikely. It's not obvious what the bug is. Take a moment and try find out what the problem is and propose a fix.

One thing that makes it hard to debug the function is that we do not have direct access to its environment. We need a way to "step inside" the function and run it line by line so we can see where the problem is. This is what debug is for.

debug(my.summary) # puts the function in debug mode

Now, anytime we run the function we leave our current "global" environment and step into the function's environment, where we have access to all the local variables in the function as we run the code line-by-line.

my.summary(nyc_taxi$pickup_dow, nyc_taxi$pickup_hour, nyc_taxi$tip_amount)

We start at the beginning, where the only things evaluated are the function's arguments. We can press ENTER to run the next line. After running each line, we can query the object to see if it looks like it should. We can always go back to the global environment by pressing Q and ENTER. If you were unsuccessful at fixing the bug earlier, take a second stab at it now. (HINT: it has something to do with NAs.)

Once we resolve the issue, we run undebug so the function can now run normally.

undebug(my.summary)

To run my.summary on multiple numeric columns at once, we can use lapply:

res <- lapply(nyc_taxi[ , trip_metrics], my.summary, grp_1 = nyc_taxi$pickup_dow, grp_2 = nyc_taxi$pickup_hour)

res is just a nested list and we can 'drill into' any individual piece we want with the right query. At the first level are the column names.

res$tip_amount
$mean
[1] 2.1

$trimmed_mean
[1] 2.1

$row_proportions
     grp_2
grp_1 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
  Sun  0.1183  0.0796   0.1498   0.2099  0.1056   0.1818   0.1550
  Mon  0.0360  0.1849   0.1484   0.2002  0.1194   0.2331   0.0780
  Tue  0.0295  0.1855   0.1427   0.1907  0.1157   0.2539   0.0820
  Wed  0.0326  0.1878   0.1421   0.1833  0.1092   0.2536   0.0913
  Thu  0.0407  0.1784   0.1402   0.1796  0.1066   0.2492   0.1054
  Fri  0.0468  0.1677   0.1363   0.1744  0.1067   0.2451   0.1230
  Sat  0.0925  0.0872   0.1393   0.1920  0.1063   0.2239   0.1588

$col_proportions
     grp_2
grp_1 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
  Sun  0.2809  0.0710   0.1427   0.1503  0.1306   0.1051   0.1833
  Mon  0.0797  0.1541   0.1320   0.1338  0.1379   0.1258   0.0860
  Tue  0.0701  0.1660   0.1362   0.1368  0.1435   0.1471   0.0972
  Wed  0.0796  0.1723   0.1392   0.1349  0.1389   0.1507   0.1110
  Thu  0.1064  0.1755   0.1471   0.1417  0.1453   0.1588   0.1374
  Fri  0.1263  0.1702   0.1476   0.1419  0.1501   0.1611   0.1654
  Sat  0.2569  0.0910   0.1552   0.1607  0.1537   0.1513   0.2196

$average_by_group
    1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
Sun    2.08    2.05     1.86     2.09    2.08     2.10     2.08
Mon    2.44    2.06     2.06     2.11    2.06     2.13     2.34
Tue    2.37    2.02     2.10     2.15    2.11     2.14     2.34
Wed    2.33    2.00     2.13     2.28    2.21     2.18     2.29
Thu    2.32    2.02     2.12     2.28    2.20     2.19     2.29
Fri    2.29    2.06     2.09     2.24    2.13     2.05     2.19
Sat    2.12    1.97     1.77     1.87    1.85     1.90     2.08
res$tip_amount$col_proportions # the next level has the statistics that the function outputs.
     grp_2
grp_1 1AM-5AM 5AM-9AM 9AM-12PM 12PM-4PM 4PM-6PM 6PM-10PM 10PM-1AM
  Sun  0.2809  0.0710   0.1427   0.1503  0.1306   0.1051   0.1833
  Mon  0.0797  0.1541   0.1320   0.1338  0.1379   0.1258   0.0860
  Tue  0.0701  0.1660   0.1362   0.1368  0.1435   0.1471   0.0972
  Wed  0.0796  0.1723   0.1392   0.1349  0.1389   0.1507   0.1110
  Thu  0.1064  0.1755   0.1471   0.1417  0.1453   0.1588   0.1374
  Fri  0.1263  0.1702   0.1476   0.1419  0.1501   0.1611   0.1654
  Sat  0.2569  0.0910   0.1552   0.1607  0.1537   0.1513   0.2196
res$tip_amount$col_proportions["Mon", "9AM-12PM"]
[1] 0.132

Since res contains a lot of summaries, it might be a good idea to save it using save. Any R object can be saved with save. This way if our R session crashes we can reload the object with the load function.

save(res, file = "res.RData") # save this result
rm(res) # it is now safe to delete `res` from the current session
load(file = "res.RData") # we can use `load` to reopen it anytime we need it again

results matching ""

    No results matching ""