Data summary in base R

One of the most important sets of functions in base R are the apply family of functions: we learned about apply earlier, and learn about sapply, lapply, and tapply in this section (there are more of them, but we won't cover them all).

  • We already learned how apply runs a summary function across any dimension of an array
  • sapply and lapply allow us to apply a summary function to multiple column of the data at once using them means we can type less and avoid writing loops.
  • tapply is used to run a summary function on a column of the data, but group the result by other columns of the data

Say we were interested in obtained summary statistics for all the columns listed in the vector trip_metrics:

trip_metrics <- c('passenger_count', 'trip_distance', 'fare_amount', 'tip_amount', 'trip_duration', 'tip_percent')

We can use either sapply or lapply for this task. In fact, sapply and lapply have an identical syntax, but the difference is in the type output return. Let's first look at sapply: sapply generally organizes the results in a tidy format (unsually a vector or a matrix):

s_res <- sapply(nyc_taxi[ , trip_metrics], mean)
s_res
passenger_count   trip_distance     fare_amount      tip_amount   trip_duration 
           1.68           15.31           12.71              NA              NA 
    tip_percent 
             NA

One of the great advantages of the apply-family of functions is that in addition to the statistical summary, we can pass any secondary argument the function takes to the function. Notice how we pass na.rm = TRUE to sapply hear so that we can remove missing values from the data before we compute the means.

s_res <- sapply(nyc_taxi[ , trip_metrics], mean, na.rm = TRUE)
s_res
passenger_count   trip_distance     fare_amount      tip_amount   trip_duration 
           1.68           15.31           12.71            2.10          929.62 
    tip_percent 
          13.87

The object sapply returns in this case is a vector: mean is a summary function that returns a single number, and sapply applies mean to multiple columns, returning a named vector with the means as its elements and the original column names preserved. Because s_res is a named vector, we can query it by name:

s_res["passenger_count"] # we can query the result object by name
passenger_count 
           1.68

Now let's see what lapply does: unlike sapply, lapply makes no attempt to organize the results. Instead, it always returns a list as its output. A list is a very "flexible" data type, in that anything can be "dumped" into it.

l_res <- lapply(nyc_taxi[ , trip_metrics], mean)
l_res
$passenger_count
[1] 1.68

$trip_distance
[1] 15.3

$fare_amount
[1] 12.7

$tip_amount
[1] NA

$trip_duration
[1] NA

$tip_percent
[1] NA

In this case, we can 'flatten' the list with the unlist function to get the same result as sapply.

unlist(l_res) # this 'flattens' the `list` and returns what `sapply` returns
passenger_count   trip_distance     fare_amount      tip_amount   trip_duration 
           1.68           15.31           12.71              NA              NA 
    tip_percent 
             NA

Querying a list is a bit more complicated. We use one bracket to query a list, but the return object is still a list, in other words, with a single bracket, we get a sublist.

l_res["passenger_count"] # this is still a `list`
$passenger_count
[1] 1.68

If we want to return the object itself, we use two brackets.

l_res[["passenger_count"]] # this is the average count itself
[1] 1.68

The above distinction is not very important when all we want to do is look at the result. But when we need to perform more computations on the results we obtained, the distinction is crucial. For example, recall that both s_res and l_res store column averages for the data. Say now that we wanted to take the average for passenger count and add 1 to it, so that the count includes the driver too. With s_res we do the following:

s_res["passenger_count"] <- s_res["passenger_count"] + 1
s_res
passenger_count   trip_distance     fare_amount      tip_amount   trip_duration 
           2.68           15.31           12.71            2.10          929.62 
    tip_percent 
          13.87

With l_res using a single bracket fails, because l_res["passenger_count"] is still a list and we can't add 1 to a list.

l_res["passenger_count"] <- l_res["passenger_count"] + 1
Error in l_res["passenger_count"] + 1 : 
  non-numeric argument to binary operator

So we need to use two brackets to perform the same operation on l_res.

l_res[["passenger_count"]] <- l_res[["passenger_count"]] + 1
l_res
$passenger_count
[1] 2.68

$trip_distance
[1] 15.3

$fare_amount
[1] 12.7

$tip_amount
[1] NA

$trip_duration
[1] NA

$tip_percent
[1] NA

Let's look at our last function in the apply family now, namely tapply: We use tapply to apply a function to the a column, but group the results by the values other columns.

tapply(nyc_taxi$tip_amount, nyc_taxi$pickup_nhood, mean, trim = 0.1, na.rm = TRUE) # trimmed average tip, by pickup neighborhood
       West Village        East Village        Battery Park       Carnegie Hill 
               1.70                1.71                2.29                1.49 
           Gramercy                Soho         Murray Hill        Little Italy 
               1.59                1.75                1.61                1.72 
       Central Park   Greenwich Village             Midtown Morningside Heights 
               1.50                1.61                1.62                1.72 
...

We can group the results by pickup and dropoff neighborhood pairs, by combining those two columns into one. For example, the paste function concatenates the pick-up and drop-off neighborhoods into a single string. The result is a flat vector with one element for each pick-up and drop-off neighborhood combination.

flat_array <- tapply(nyc_taxi$tip_amount, 
           paste(nyc_taxi$pickup_nhood, nyc_taxi$dropoff_nhood, sep = " to "), 
           mean, trim = 0.1, na.rm = TRUE)

head(flat_array)
 Battery Park to Battery Park Battery Park to Carnegie Hill 
                         1.07                          4.39 
 Battery Park to Central Park       Battery Park to Chelsea 
                         3.55                          2.10 
    Battery Park to Chinatown       Battery Park to Clinton 
                         1.50                          2.37

By putting both grouping columns in a list we can get an array (a 2D array or matrix in this case) instead of the flat vector we got earlier.

square_array <- tapply(nyc_taxi$tip_amount, 
           list(nyc_taxi$pickup_nhood, nyc_taxi$dropoff_nhood), 
           mean, trim = 0.1, na.rm = TRUE)

square_array[1:5, 1:5]
              West Village East Village Battery Park Carnegie Hill Gramercy
West Village         0.975        1.635         1.59         3.491     1.64
East Village         1.559        0.988         2.28         2.918     1.23
Battery Park         1.574        2.417         1.07         4.392     2.84
Carnegie Hill        3.724        3.281         4.49         0.922     2.64
Gramercy             1.559        1.276         2.53         2.354     1.06

results matching ""

    No results matching ""