Data summary with dplyr

When it comes to summarizing data, we have a lot of options. We covered just a few in the last section, but there are many more functions both in base R and packages. We will cover dplyr in this section, as an example of a third-party package. What makes dplyr very popular is the simple and streight-farward notation for creating increasing complex data pipelines.

First let's review important functions in dplyr: filter, mutate, transmute, group_by, select, slice, summarize, distinct, arrange, rename, inner_join, outer_join, left_join. With each of the above function, we can either pass the data directly to the function or infer it from the the pipeline. Here's an example of filter being used in both ways. In the first case we pass the data as the first argument to filter.

library(dplyr)
head(filter(nyc_taxi, fare_amount > 500)) # pass data directly to the function
      pickup_datetime    dropoff_datetime passenger_count trip_distance
1 2015-01-22 14:18:00 2015-01-22 14:18:00               1             0
2 2015-01-18 13:46:00 2015-01-18 13:46:00               1             0
3 2015-01-06 16:44:08 2015-01-06 16:44:47               1             0
4 2015-01-08 11:47:20 2015-01-08 11:47:56               1             0
5 2015-01-23 14:09:00 2015-01-23 14:09:00               1             0
6 2015-02-05 14:41:00 2015-02-05 14:41:00               1             0
  pickup_longitude pickup_latitude rate_code_id dropoff_longitude dropoff_latitude
1                0              NA   negotiated                NA               NA
2                0              NA   negotiated                NA               NA
3              -74            40.7   negotiated               -74             40.7
4              -74            40.7   negotiated               -74             40.7
5                0              NA   negotiated                NA               NA
6                0              NA   negotiated                NA               NA
  payment_type fare_amount extra mta_tax tip_amount tolls_amount
1         card         563     0       0          0            0
2         card         560     0       0          0            0
3         card         900     0       0          0            0
4         card         900     0       0          0            0
5         card         570     0       0          0            0
6         card         673     0       0          0            0
  improvement_surcharge total_amount pickup_hour pickup_dow dropoff_hour
1                   0.0          563    12PM-4PM        Thu     12PM-4PM
2                   0.0          560    12PM-4PM        Sun     12PM-4PM
3                   0.3          900    12PM-4PM        Tue     12PM-4PM
4                   0.3          900    9AM-12PM        Thu     9AM-12PM
5                   0.0          570    12PM-4PM        Fri     12PM-4PM
6                   0.0          673    12PM-4PM        Thu     12PM-4PM
  dropoff_dow trip_duration    pickup_nhood   dropoff_nhood tip_percent
1         Thu             0            <NA>            <NA>           0
2         Sun             0            <NA>            <NA>           0
3         Tue            39    Little Italy    Little Italy           0
4         Thu            36 Lower East Side Lower East Side           0
5         Fri             0            <NA>            <NA>           0
6         Thu             0            <NA>            <NA>           0

In the second case, we start a pipeline with the data, followed by the piping function %>%, followed by filter which now inherits the data from the previous step and only needs the filtering condition.

nyc_taxi %>% filter(fare_amount > 500) %>% head # infer the data from the pipeline

Piping is especially useful for longer pipelines. Here's an example of a query without piping.

summarize( # (3)
  group_by( # (2)
    filter(nyc_taxi, fare_amount > 500), # (1)
    payment_type), 
  ave_duration = mean(trip_duration), ave_distance = mean(trip_distance))
# A tibble: 3 x 3
  payment_type ave_duration ave_distance
        <fctr>        <dbl>        <dbl>
1         card         14.5          0.0
2         cash        630.0          2.5
3           NA         12.0          0.0

To understand the query, we need to work from the inside out:

  1. First filter the data to show only fare amounts above $500
  2. Group the resulting data by payment type
  3. For each group find average trip duration and trip distance

The same query, using piping, looks like this:

nyc_taxi %>%
  filter(fare_amount > 500) %>% # (1)
  group_by(payment_type) %>% # (2)
  summarize(ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)) # (3)
# A tibble: 3 x 3
  payment_type ave_duration ave_distance
        <fctr>        <dbl>        <dbl>
1         card         14.5          0.0
2         cash        630.0          2.5
3           NA         12.0          0.0

Instead of working from the inside out, piping allows us to read the code from top to bottom. This makes it easier (1) to understand what the query does and (2) to build upon the query.

The best way to learn dplyr is by example. So instead of covering functions one by one, we state some interesting queries and use dplyr to implement them. There are obvious parallels between dplyr and the SQL language, but important differences exist too. We point out some of those differences along the way.

results matching ""

    No results matching ""