Solutions

(1) Without the pipeline function, we would have arrange as the outermost function:

q1 <- arrange( # (4)
  summarize( # (3)
  group_by( # (2)
    filter(nyc_taxi, fare_amount > 500), # (1)
    payment_type), 
  ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)),
desc(ave_duration))

With the pipeline function, we simply add the pipe to the end of summarize and add arrange as a new line to the end of the code:

q1 <- nyc_taxi %>%
  filter(fare_amount > 500) %>% # (1)
  group_by(payment_type) %>% # (2)
  summarize(ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)) %>% # (3)
  arrange(desc(ave_duration)) # (4)

head(q1)
# A tibble: 3 x 3
  payment_type ave_duration ave_distance
        <fctr>        <dbl>        <dbl>
1         cash        630.0          2.5
2         card         14.5          0.0
3           NA         12.0          0.0

(2) What are the times of the day and the days of the week with the highest fare per mile of ride?

q2 <- nyc_taxi %>%
  filter(trip_distance > 0) %>%
  group_by(pickup_dow, pickup_hour) %>%
  summarize(ave_fare_per_mile = mean(fare_amount / trip_distance, na.rm = TRUE), count = n()) %>%
  group_by() %>% # we 'reset', or remove, the group by, otherwise sorting won't work
  arrange(desc(ave_fare_per_mile))


head(q2)
# A tibble: 6 x 4
  pickup_dow pickup_hour ave_fare_per_mile  count
      <fctr>      <fctr>             <dbl>  <int>
1        Tue    9AM-12PM              7.69  74344
2        Wed    12PM-4PM              7.67  97783
3        Fri    12PM-4PM              7.62 102899
4        Tue    12PM-4PM              7.47  99179
5        Thu    12PM-4PM              7.47 102712
6        Wed    9AM-12PM              7.40  75997

(3) For each pick-up neighborhood, find the number and percentage of trips that "fan out" into other neighborhoods. Sort results by pickup neighborhood and descending percentage. Limit results to top 50 percent coverage. In other words, show only the top 50 percent of destinations for each pick-up neighborhood.

q3 <- nyc_taxi %>%
  filter(!is.na(pickup_nhood) & !is.na(dropoff_nhood)) %>%
  group_by(pickup_nhood, dropoff_nhood) %>%
  summarize(count = n()) %>%
  group_by(pickup_nhood) %>%
  mutate(proportion = prop.table(count),
         cum.prop = order_by(desc(proportion), cumsum(proportion))) %>%
  group_by() %>%
  arrange(pickup_nhood, desc(proportion)) %>%
  group_by(pickup_nhood) %>%
  filter(row_number() < 11 | cum.prop < .50)

head(q3)
Source: local data frame [6 x 5]
Groups: pickup_nhood [1]

  pickup_nhood     dropoff_nhood count proportion cum.prop
        <fctr>            <fctr> <int>      <dbl>    <dbl>
1 West Village           Chelsea 13420     0.1558    0.156
2 West Village           Midtown 10318     0.1198    0.276
3 West Village Greenwich Village  7713     0.0896    0.365
4 West Village          Gramercy  7006     0.0814    0.447
5 West Village  Garment District  4964     0.0576    0.504
...

(4) Are any dates missing from the data?

There are many ways to answer this query and we cover three because each way highlights an important point. The first way consists sorting the data by date and using the lag function to find the difference between each date and the date proceeding it. If this difference is greater than 1, then we skipped one or more days.

nyc_taxi %>%
  select(pickup_datetime) %>%
  distinct(date = as.Date(pickup_datetime)) %>%
  arrange(date) %>% # this is an important step!
  mutate(diff = date - lag(date)) %>%
  filter(diff > 1)
[1] date diff
<0 rows> (or 0-length row.names)

The second solution is more involved. First we create a data.frame of all dates available in nyc_taxi.

nyc_taxi %>%
  select(pickup_datetime) %>%
distinct(date = as.Date(pickup_datetime)) %>%
filter(!is.na(date)) -> data_dates

Then we create a new data.frame of all dates that span the time range in the data. We can use seq to do that.

start_date <- min(data_dates$date)
end_date <- max(data_dates$date)
all_dates <- data.frame(date = seq(start_date, end_date, by = '1 day'))

Finally, we ask for the "anti-join" of the two datasets. An anti-join is the opposite of an left join: any keys present in left dataset but not the right are returned.

anti_join(all_dates, data_dates, by = 'date') # an anti-join is the reverse of an left join
[1] date
<0 rows> (or 0-length row.names)

The third solution consists of comparing the number of days between the earliest and latest dates in the data to the number of days we expect to see if no days were missing.

nyc_taxi %>%
    distinct(date = as.Date(pickup_datetime)) %>%
    filter(is.na(date) == FALSE) %>%
    summarize(min_date = min(date), max_date = max(date), n = n()) %>%
    mutate(diff = max_date - min_date + 1)
    min_date   max_date   n     diff
1 2015-01-01 2015-07-01 182 182 days

(5) Find the 3 consecutive days with the most total number of trips?

This is a hard exercise. In query 3, we need to compute rolling statistics (rolling sums in this case). There are functions in R that we can use for that purpose, but one of the advantages of R is that writing our own functions is not always that hard. Write a function called rolling_sum that takes in two arguments: x and nlag: x is a numeric vector nlags is a positive integer for the number of days we're rolling by. The function returns a vector of the same length as x, of the rolling sum of x over nlag elements.

For example, given x <- 1:6 and n <- 2 as inputs, the function returns c(NA, NA, 6, 9, 12, 15)

rolling_sum <- function(x, nlag) {
  stopifnot(nlag > 0, nlag < length(x))
  c(rep(NA, nlag), sapply((nlag + 1):length(x), function(ii) sum(x[(ii - nlag):ii])))
}
# Here's an easy test to see if things seem to be working:
rolling_sum(rep(1, 100), 10) # Should return 10 NAs followed by 90 entries that are all 11

We can even go one step further. Let's rename the function to rolling and add a third argument to it called FUN, allowing us to specify any rolling function, not just sum.

rolling <- function(x, nlag, FUN) {
  stopifnot(nlag > 0, nlag < length(x))
  c(rep(NA, nlag), sapply((nlag + 1):length(x), function(ii) FUN(x[(ii - nlag):ii])))
}

We can now use rolling to find the 3 consecutive days with the most total number of trips.

nlag <- 2

q5 <- nyc_taxi %>%
  filter(!is.na(pickup_datetime)) %>%
  transmute(end_date = as.Date(pickup_datetime)) %>%
  group_by(end_date) %>%
  summarize(n = n()) %>%
  group_by() %>%
  mutate(start_date = end_date - nlag, cn = rolling(n, nlag, sum)) %>%
  arrange(desc(cn)) %>%
  select(start_date, end_date, n, cn) %>%
  top_n(10, cn)

head(q5)
# A tibble: 6 x 4
  start_date   end_date     n    cn
      <date>     <date> <int> <int>
1 2015-02-13 2015-02-15 24554 74270
2 2015-01-30 2015-02-01 24264 74000
3 2015-02-27 2015-03-01 23112 72504
4 2015-02-26 2015-02-28 25569 72342
5 2015-04-17 2015-04-19 23999 72271
6 2015-02-06 2015-02-08 23506 72220

As it turns out, there's already a similar function we could have used, called rollapply in the zoo package. Sometimes it pays off to take the time and search for the right function, especially if what we're trying to do is common enough that we should not have to "reinvent the wheel".

rollapply(1:10, 3, sum, fill = NA, align = 'right')
[1] NA NA  6  9 12 15 18 21 24 27

Here's an alternative solution: We could have run the above query without rolling by just using the lag function, but the code is more complicated and harder to automate it for different values of nlag or for different functions other than sum. Here's how we can rewrite the above query with lag:

q5 <- nyc_taxi %>%
  filter(!is.na(pickup_datetime)) %>%
  transmute(end_date = as.Date(pickup_datetime)) %>%
  group_by(end_date) %>%
  summarize(n = n()) %>%
  group_by() %>%
  mutate(start_date = end_date - 3, 
         n_lag_1 = lag(n), n_lag_2 = lag(n, 2), 
         cn = n + n_lag_1 + n_lag_2) %>%
  arrange(desc(cn)) %>%
  select(start_date, end_date, n, cn) %>%
  top_n(10, cn)

head(q5)
# A tibble: 6 x 4
  start_date   end_date     n    cn
      <date>     <date> <int> <int>
1 2015-02-12 2015-02-15 24554 74270
2 2015-01-29 2015-02-01 24264 74000
3 2015-02-26 2015-03-01 23112 72504
4 2015-02-25 2015-02-28 25569 72342
5 2015-04-16 2015-04-19 23999 72271
6 2015-02-05 2015-02-08 23506 72220

(6) Get the average, standard deviation, and mean absolute deviation of trip_distance and trip_duration, as well as the ratio of trip_duration over trip_distance. Results should be broken up by pickup_nhood and dropoff_nhood.

Here's how we compute the mean absolute deviation:

mad <- function(x) mean(abs(x - median(x))) # one-liner functions don't need curly braces

This query can easily be written with the tools we learned so far.

q6 <- nyc_taxi %>%
  filter(!is.na(pickup_nhood) & !is.na(dropoff_nhood)) %>%
  group_by(pickup_nhood, dropoff_nhood) %>%
  summarize(mean_trip_distance = mean(trip_distance, na.rm = TRUE),
            mean_trip_duration = mean(trip_duration, na.rm = TRUE),
            sd_trip_distance = sd(trip_distance, na.rm = TRUE),
            sd_trip_duration = sd(trip_duration, na.rm = TRUE),
            mad_trip_distance = mad(trip_distance),
            mad_trip_duration = mad(trip_duration))

head(q6)
Source: local data frame [6 x 8]
Groups: pickup_nhood [1]

  pickup_nhood dropoff_nhood mean_trip_distance mean_trip_duration
        <fctr>        <fctr>              <dbl>              <dbl>
1 West Village  West Village              0.668                324
2 West Village  East Village              3.149                852
3 West Village  Battery Park              1.956                602
4 West Village Carnegie Hill              5.408               1618
5 West Village      Gramercy              1.724               3579
6 West Village          Soho              1.196                586
  sd_trip_distance sd_trip_duration mad_trip_distance mad_trip_duration
             <dbl>            <dbl>             <dbl>             <dbl>
1            0.798             1850             0.346               187
2          104.084             2799             1.857               290
3            0.513             1964             0.385               191
4            0.679              535             0.555               396
5            0.523           236007             0.369              3059
6            0.409             2148             0.305               234

You may have noticed that the query we wrote in the last exercise was a little tedious and repetitive. Let's now see a way of rewriting the query using some "shortcut" functions available in dplyr:

  • When we apply the same summary function(s) to the same column(s) of the data, we can save a lot of time typing by using summarize_each instead of summarize. There is also a mutate_each function.
  • We can select trip_distance and trip_duration automatically using starts_with('trip_'), since they are the only columns that begin with that prefix, this can be a time-saver if we are selecting lots of columns at once (and we named them in a smart way). There are other helper functions called ends_with and contains.
  • Instead of defining the mad function separately, we can define it in-line. In fact, there's a shortcut whereby we just name the function and provide the body of the function, replacing x with a period.
q6 <- nyc_taxi %>%
  filter(!is.na(pickup_nhood) & !is.na(dropoff_nhood)) %>%
  group_by(pickup_nhood, dropoff_nhood) %>%
  summarize_each(
    funs(mean, sd, mad = mean(abs(. - median(.)))), # all the functions that we apply to the data are listed here
    starts_with('trip_'), # `trip_distance` and `trip_duration` are the only columns that start with `trip_`
    wait_per_mile = trip_duration / trip_distance) # `duration_over_dist` is created on the fly

head(q6)
Source: local data frame [6 x 11]
Groups: pickup_nhood [1]

  pickup_nhood dropoff_nhood trip_distance_mean trip_duration_mean
        <fctr>        <fctr>              <dbl>              <dbl>
1 West Village  West Village              0.668                324
2 West Village  East Village              3.149                852
3 West Village  Battery Park              1.956                602
4 West Village Carnegie Hill              5.408               1618
5 West Village      Gramercy              1.724               3579
6 West Village          Soho              1.196                586
  wait_per_mile_mean trip_distance_sd trip_duration_sd wait_per_mile_sd
               <dbl>            <dbl>            <dbl>            <dbl>
1                -74            0.798             1850          0.00200
2                -74          104.084             2799          0.00215
3                -74            0.513             1964          0.00172
4                -74            0.679              535          0.00186
5                -74            0.523           236007          0.00200
6                -74            0.409             2148          0.00182
  trip_distance_mad trip_duration_mad wait_per_mile_mad
              <dbl>             <dbl>             <dbl>
1             0.346               187           0.00151
2             1.857               290           0.00174
3             0.385               191           0.00134
4             0.555               396           0.00145
5             0.369              3059           0.00156
6             0.305               234           0.00140

We can do far more with dplyr but we leave it at this for an introduction. The goal was to give the user enough dplyr to develop an appreciation and be inspired to learn more.

results matching ""

    No results matching ""