Data summary with dplyr
When it comes to summarizing data, we have a lot of options. We covered just a few in the last section, but there are many more functions both in base
R and packages. We will cover dplyr
in this section, as an example of a third-party package. What makes dplyr
very popular is the simple and streight-farward notation for creating increasing complex data pipelines.
First let's review important functions in dplyr
: filter
, mutate
, transmute
, group_by
, select
, slice
, summarize
, distinct
, arrange
, rename
, inner_join
, outer_join
, left_join
. With each of the above function, we can either pass the data directly to the function or infer it from the the pipeline. Here's an example of filter
being used in both ways. In the first case we pass the data as the first argument to filter
.
library(dplyr)
head(filter(nyc_taxi, fare_amount > 500)) # pass data directly to the function
pickup_datetime dropoff_datetime passenger_count trip_distance
1 2015-01-22 14:18:00 2015-01-22 14:18:00 1 0
2 2015-01-18 13:46:00 2015-01-18 13:46:00 1 0
3 2015-01-06 16:44:08 2015-01-06 16:44:47 1 0
4 2015-01-08 11:47:20 2015-01-08 11:47:56 1 0
5 2015-01-23 14:09:00 2015-01-23 14:09:00 1 0
6 2015-02-05 14:41:00 2015-02-05 14:41:00 1 0
pickup_longitude pickup_latitude rate_code_id dropoff_longitude dropoff_latitude
1 0 NA negotiated NA NA
2 0 NA negotiated NA NA
3 -74 40.7 negotiated -74 40.7
4 -74 40.7 negotiated -74 40.7
5 0 NA negotiated NA NA
6 0 NA negotiated NA NA
payment_type fare_amount extra mta_tax tip_amount tolls_amount
1 card 563 0 0 0 0
2 card 560 0 0 0 0
3 card 900 0 0 0 0
4 card 900 0 0 0 0
5 card 570 0 0 0 0
6 card 673 0 0 0 0
improvement_surcharge total_amount pickup_hour pickup_dow dropoff_hour
1 0.0 563 12PM-4PM Thu 12PM-4PM
2 0.0 560 12PM-4PM Sun 12PM-4PM
3 0.3 900 12PM-4PM Tue 12PM-4PM
4 0.3 900 9AM-12PM Thu 9AM-12PM
5 0.0 570 12PM-4PM Fri 12PM-4PM
6 0.0 673 12PM-4PM Thu 12PM-4PM
dropoff_dow trip_duration pickup_nhood dropoff_nhood tip_percent
1 Thu 0 <NA> <NA> 0
2 Sun 0 <NA> <NA> 0
3 Tue 39 Little Italy Little Italy 0
4 Thu 36 Lower East Side Lower East Side 0
5 Fri 0 <NA> <NA> 0
6 Thu 0 <NA> <NA> 0
In the second case, we start a pipeline with the data, followed by the piping function %>%
, followed by filter
which now inherits the data from the previous step and only needs the filtering condition.
nyc_taxi %>% filter(fare_amount > 500) %>% head # infer the data from the pipeline
Piping is especially useful for longer pipelines. Here's an example of a query without piping.
summarize( # (3)
group_by( # (2)
filter(nyc_taxi, fare_amount > 500), # (1)
payment_type),
ave_duration = mean(trip_duration), ave_distance = mean(trip_distance))
# A tibble: 3 x 3
payment_type ave_duration ave_distance
<fctr> <dbl> <dbl>
1 card 14.5 0.0
2 cash 630.0 2.5
3 NA 12.0 0.0
To understand the query, we need to work from the inside out:
- First filter the data to show only fare amounts above $500
- Group the resulting data by payment type
- For each group find average trip duration and trip distance
The same query, using piping, looks like this:
nyc_taxi %>%
filter(fare_amount > 500) %>% # (1)
group_by(payment_type) %>% # (2)
summarize(ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)) # (3)
# A tibble: 3 x 3
payment_type ave_duration ave_distance
<fctr> <dbl> <dbl>
1 card 14.5 0.0
2 cash 630.0 2.5
3 NA 12.0 0.0
Instead of working from the inside out, piping allows us to read the code from top to bottom. This makes it easier (1) to understand what the query does and (2) to build upon the query.
The best way to learn dplyr
is by example. So instead of covering functions one by one, we state some interesting queries and use dplyr
to implement them. There are obvious parallels between dplyr
and the SQL language, but important differences exist too. We point out some of those differences along the way.