Data summary in base
R
One of the most important sets of functions in base
R are the apply
family of functions: we learned about apply
earlier, and learn about sapply
, lapply
, and tapply
in this section (there are more of them, but we won't cover them all).
- We already learned how
apply
runs a summary function across any dimension of anarray
sapply
andlapply
allow us to apply a summary function to multiple column of the data at once using them means we can type less and avoid writing loops.tapply
is used to run a summary function on a column of the data, but group the result by other columns of the data
Say we were interested in obtained summary statistics for all the columns listed in the vector trip_metrics
:
trip_metrics <- c('passenger_count', 'trip_distance', 'fare_amount', 'tip_amount', 'trip_duration', 'tip_percent')
We can use either sapply
or lapply
for this task. In fact, sapply
and lapply
have an identical syntax, but the difference is in the type output return. Let's first look at sapply
: sapply
generally organizes the results in a tidy format (unsually a vector or a matrix):
s_res <- sapply(nyc_taxi[ , trip_metrics], mean)
s_res
passenger_count trip_distance fare_amount tip_amount trip_duration
1.68 15.31 12.71 NA NA
tip_percent
NA
One of the great advantages of the apply
-family of functions is that in addition to the statistical summary, we can pass any secondary argument the function takes to the function. Notice how we pass na.rm = TRUE
to sapply
hear so that we can remove missing values from the data before we compute the means.
s_res <- sapply(nyc_taxi[ , trip_metrics], mean, na.rm = TRUE)
s_res
passenger_count trip_distance fare_amount tip_amount trip_duration
1.68 15.31 12.71 2.10 929.62
tip_percent
13.87
The object sapply
returns in this case is a vector: mean
is a summary function that returns a single number, and sapply
applies mean
to multiple columns, returning a named vector with the means as its elements and the original column names preserved. Because s_res
is a named vector, we can query it by name:
s_res["passenger_count"] # we can query the result object by name
passenger_count
1.68
Now let's see what lapply
does: unlike sapply
, lapply
makes no attempt to organize the results. Instead, it always returns a list
as its output. A list
is a very "flexible" data type, in that anything can be "dumped" into it.
l_res <- lapply(nyc_taxi[ , trip_metrics], mean)
l_res
$passenger_count
[1] 1.68
$trip_distance
[1] 15.3
$fare_amount
[1] 12.7
$tip_amount
[1] NA
$trip_duration
[1] NA
$tip_percent
[1] NA
In this case, we can 'flatten' the list
with the unlist
function to get the same result as sapply
.
unlist(l_res) # this 'flattens' the `list` and returns what `sapply` returns
passenger_count trip_distance fare_amount tip_amount trip_duration
1.68 15.31 12.71 NA NA
tip_percent
NA
Querying a list
is a bit more complicated. We use one bracket to query a list
, but the return object is still a list
, in other words, with a single bracket, we get a sublist.
l_res["passenger_count"] # this is still a `list`
$passenger_count
[1] 1.68
If we want to return the object itself, we use two brackets.
l_res[["passenger_count"]] # this is the average count itself
[1] 1.68
The above distinction is not very important when all we want to do is look at the result. But when we need to perform more computations on the results we obtained, the distinction is crucial. For example, recall that both s_res
and l_res
store column averages for the data. Say now that we wanted to take the average for passenger count and add 1 to it, so that the count includes the driver too. With s_res
we do the following:
s_res["passenger_count"] <- s_res["passenger_count"] + 1
s_res
passenger_count trip_distance fare_amount tip_amount trip_duration
2.68 15.31 12.71 2.10 929.62
tip_percent
13.87
With l_res
using a single bracket fails, because l_res["passenger_count"]
is still a list
and we can't add 1 to a list
.
l_res["passenger_count"] <- l_res["passenger_count"] + 1
Error in l_res["passenger_count"] + 1 :
non-numeric argument to binary operator
So we need to use two brackets to perform the same operation on l_res
.
l_res[["passenger_count"]] <- l_res[["passenger_count"]] + 1
l_res
$passenger_count
[1] 2.68
$trip_distance
[1] 15.3
$fare_amount
[1] 12.7
$tip_amount
[1] NA
$trip_duration
[1] NA
$tip_percent
[1] NA
Let's look at our last function in the apply
family now, namely tapply
: We use tapply
to apply a function to the a column, but group the results by the values other columns.
tapply(nyc_taxi$tip_amount, nyc_taxi$pickup_nhood, mean, trim = 0.1, na.rm = TRUE) # trimmed average tip, by pickup neighborhood
West Village East Village Battery Park Carnegie Hill
1.70 1.71 2.29 1.49
Gramercy Soho Murray Hill Little Italy
1.59 1.75 1.61 1.72
Central Park Greenwich Village Midtown Morningside Heights
1.50 1.61 1.62 1.72
...
We can group the results by pickup and dropoff neighborhood pairs, by combining those two columns into one. For example, the paste
function concatenates the pick-up and drop-off neighborhoods into a single string. The result is a flat vector with one element for each pick-up and drop-off neighborhood combination.
flat_array <- tapply(nyc_taxi$tip_amount,
paste(nyc_taxi$pickup_nhood, nyc_taxi$dropoff_nhood, sep = " to "),
mean, trim = 0.1, na.rm = TRUE)
head(flat_array)
Battery Park to Battery Park Battery Park to Carnegie Hill
1.07 4.39
Battery Park to Central Park Battery Park to Chelsea
3.55 2.10
Battery Park to Chinatown Battery Park to Clinton
1.50 2.37
By putting both grouping columns in a list
we can get an array
(a 2D array
or matrix
in this case) instead of the flat vector we got earlier.
square_array <- tapply(nyc_taxi$tip_amount,
list(nyc_taxi$pickup_nhood, nyc_taxi$dropoff_nhood),
mean, trim = 0.1, na.rm = TRUE)
square_array[1:5, 1:5]
West Village East Village Battery Park Carnegie Hill Gramercy
West Village 0.975 1.635 1.59 3.491 1.64
East Village 1.559 0.988 2.28 2.918 1.23
Battery Park 1.574 2.417 1.07 4.392 2.84
Carnegie Hill 3.724 3.281 4.49 0.922 2.64
Gramercy 1.559 1.276 2.53 2.354 1.06