Solutions
(1) Without the pipeline function, we would have arrange
as the outermost function:
q1 <- arrange( # (4)
summarize( # (3)
group_by( # (2)
filter(nyc_taxi, fare_amount > 500), # (1)
payment_type),
ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)),
desc(ave_duration))
With the pipeline function, we simply add the pipe to the end of summarize
and add arrange
as a new line to the end of the code:
q1 <- nyc_taxi %>%
filter(fare_amount > 500) %>% # (1)
group_by(payment_type) %>% # (2)
summarize(ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)) %>% # (3)
arrange(desc(ave_duration)) # (4)
head(q1)
# A tibble: 3 x 3
payment_type ave_duration ave_distance
<fctr> <dbl> <dbl>
1 cash 630.0 2.5
2 card 14.5 0.0
3 NA 12.0 0.0
(2) What are the times of the day and the days of the week with the highest fare per mile of ride?
q2 <- nyc_taxi %>%
filter(trip_distance > 0) %>%
group_by(pickup_dow, pickup_hour) %>%
summarize(ave_fare_per_mile = mean(fare_amount / trip_distance, na.rm = TRUE), count = n()) %>%
group_by() %>% # we 'reset', or remove, the group by, otherwise sorting won't work
arrange(desc(ave_fare_per_mile))
head(q2)
# A tibble: 6 x 4
pickup_dow pickup_hour ave_fare_per_mile count
<fctr> <fctr> <dbl> <int>
1 Tue 9AM-12PM 7.69 74344
2 Wed 12PM-4PM 7.67 97783
3 Fri 12PM-4PM 7.62 102899
4 Tue 12PM-4PM 7.47 99179
5 Thu 12PM-4PM 7.47 102712
6 Wed 9AM-12PM 7.40 75997
(3) For each pick-up neighborhood, find the number and percentage of trips that "fan out" into other neighborhoods. Sort results by pickup neighborhood and descending percentage. Limit results to top 50 percent coverage. In other words, show only the top 50 percent of destinations for each pick-up neighborhood.
q3 <- nyc_taxi %>%
filter(!is.na(pickup_nhood) & !is.na(dropoff_nhood)) %>%
group_by(pickup_nhood, dropoff_nhood) %>%
summarize(count = n()) %>%
group_by(pickup_nhood) %>%
mutate(proportion = prop.table(count),
cum.prop = order_by(desc(proportion), cumsum(proportion))) %>%
group_by() %>%
arrange(pickup_nhood, desc(proportion)) %>%
group_by(pickup_nhood) %>%
filter(row_number() < 11 | cum.prop < .50)
head(q3)
Source: local data frame [6 x 5]
Groups: pickup_nhood [1]
pickup_nhood dropoff_nhood count proportion cum.prop
<fctr> <fctr> <int> <dbl> <dbl>
1 West Village Chelsea 13420 0.1558 0.156
2 West Village Midtown 10318 0.1198 0.276
3 West Village Greenwich Village 7713 0.0896 0.365
4 West Village Gramercy 7006 0.0814 0.447
5 West Village Garment District 4964 0.0576 0.504
...
(4) Are any dates missing from the data?
There are many ways to answer this query and we cover three because each way highlights an important point. The first way consists sorting the data by date and using the lag
function to find the difference between each date and the date proceeding it. If this difference is greater than 1, then we skipped one or more days.
nyc_taxi %>%
select(pickup_datetime) %>%
distinct(date = as.Date(pickup_datetime)) %>%
arrange(date) %>% # this is an important step!
mutate(diff = date - lag(date)) %>%
filter(diff > 1)
[1] date diff
<0 rows> (or 0-length row.names)
The second solution is more involved. First we create a data.frame
of all dates available in nyc_taxi
.
nyc_taxi %>%
select(pickup_datetime) %>%
distinct(date = as.Date(pickup_datetime)) %>%
filter(!is.na(date)) -> data_dates
Then we create a new data.frame
of all dates that span the time range in the data. We can use seq
to do that.
start_date <- min(data_dates$date)
end_date <- max(data_dates$date)
all_dates <- data.frame(date = seq(start_date, end_date, by = '1 day'))
Finally, we ask for the "anti-join" of the two datasets. An anti-join is the opposite of an left join: any keys present in left dataset but not the right are returned.
anti_join(all_dates, data_dates, by = 'date') # an anti-join is the reverse of an left join
[1] date
<0 rows> (or 0-length row.names)
The third solution consists of comparing the number of days between the earliest and latest dates in the data to the number of days we expect to see if no days were missing.
nyc_taxi %>%
distinct(date = as.Date(pickup_datetime)) %>%
filter(is.na(date) == FALSE) %>%
summarize(min_date = min(date), max_date = max(date), n = n()) %>%
mutate(diff = max_date - min_date + 1)
min_date max_date n diff
1 2015-01-01 2015-07-01 182 182 days
(5) Find the 3 consecutive days with the most total number of trips?
This is a hard exercise. In query 3, we need to compute rolling statistics (rolling sums in this case). There are functions in R that we can use for that purpose, but one of the advantages of R is that writing our own functions is not always that hard. Write a function called rolling_sum
that takes in two arguments: x
and nlag
: x
is a numeric vector nlags
is a positive integer for the number of days we're rolling by. The function returns a vector of the same length as x
, of the rolling sum of x
over nlag
elements.
For example, given x <- 1:6
and n <- 2
as inputs, the function returns c(NA, NA, 6, 9, 12, 15)
rolling_sum <- function(x, nlag) {
stopifnot(nlag > 0, nlag < length(x))
c(rep(NA, nlag), sapply((nlag + 1):length(x), function(ii) sum(x[(ii - nlag):ii])))
}
# Here's an easy test to see if things seem to be working:
rolling_sum(rep(1, 100), 10) # Should return 10 NAs followed by 90 entries that are all 11
We can even go one step further. Let's rename the function to rolling
and add a third argument to it called FUN
, allowing us to specify any rolling function, not just sum
.
rolling <- function(x, nlag, FUN) {
stopifnot(nlag > 0, nlag < length(x))
c(rep(NA, nlag), sapply((nlag + 1):length(x), function(ii) FUN(x[(ii - nlag):ii])))
}
We can now use rolling
to find the 3 consecutive days with the most total number of trips.
nlag <- 2
q5 <- nyc_taxi %>%
filter(!is.na(pickup_datetime)) %>%
transmute(end_date = as.Date(pickup_datetime)) %>%
group_by(end_date) %>%
summarize(n = n()) %>%
group_by() %>%
mutate(start_date = end_date - nlag, cn = rolling(n, nlag, sum)) %>%
arrange(desc(cn)) %>%
select(start_date, end_date, n, cn) %>%
top_n(10, cn)
head(q5)
# A tibble: 6 x 4
start_date end_date n cn
<date> <date> <int> <int>
1 2015-02-13 2015-02-15 24554 74270
2 2015-01-30 2015-02-01 24264 74000
3 2015-02-27 2015-03-01 23112 72504
4 2015-02-26 2015-02-28 25569 72342
5 2015-04-17 2015-04-19 23999 72271
6 2015-02-06 2015-02-08 23506 72220
As it turns out, there's already a similar function we could have used, called rollapply
in the zoo
package. Sometimes it pays off to take the time and search for the right function, especially if what we're trying to do is common enough that we should not have to "reinvent the wheel".
rollapply(1:10, 3, sum, fill = NA, align = 'right')
[1] NA NA 6 9 12 15 18 21 24 27
Here's an alternative solution: We could have run the above query without rolling
by just using the lag
function, but the code is more complicated and harder to automate it for different values of nlag
or for different functions other than sum
. Here's how we can rewrite the above query with lag
:
q5 <- nyc_taxi %>%
filter(!is.na(pickup_datetime)) %>%
transmute(end_date = as.Date(pickup_datetime)) %>%
group_by(end_date) %>%
summarize(n = n()) %>%
group_by() %>%
mutate(start_date = end_date - 3,
n_lag_1 = lag(n), n_lag_2 = lag(n, 2),
cn = n + n_lag_1 + n_lag_2) %>%
arrange(desc(cn)) %>%
select(start_date, end_date, n, cn) %>%
top_n(10, cn)
head(q5)
# A tibble: 6 x 4
start_date end_date n cn
<date> <date> <int> <int>
1 2015-02-12 2015-02-15 24554 74270
2 2015-01-29 2015-02-01 24264 74000
3 2015-02-26 2015-03-01 23112 72504
4 2015-02-25 2015-02-28 25569 72342
5 2015-04-16 2015-04-19 23999 72271
6 2015-02-05 2015-02-08 23506 72220
(6) Get the average, standard deviation, and mean absolute deviation of trip_distance
and trip_duration
, as well as the ratio of trip_duration
over trip_distance
. Results should be broken up by pickup_nhood
and dropoff_nhood
.
Here's how we compute the mean absolute deviation:
mad <- function(x) mean(abs(x - median(x))) # one-liner functions don't need curly braces
This query can easily be written with the tools we learned so far.
q6 <- nyc_taxi %>%
filter(!is.na(pickup_nhood) & !is.na(dropoff_nhood)) %>%
group_by(pickup_nhood, dropoff_nhood) %>%
summarize(mean_trip_distance = mean(trip_distance, na.rm = TRUE),
mean_trip_duration = mean(trip_duration, na.rm = TRUE),
sd_trip_distance = sd(trip_distance, na.rm = TRUE),
sd_trip_duration = sd(trip_duration, na.rm = TRUE),
mad_trip_distance = mad(trip_distance),
mad_trip_duration = mad(trip_duration))
head(q6)
Source: local data frame [6 x 8]
Groups: pickup_nhood [1]
pickup_nhood dropoff_nhood mean_trip_distance mean_trip_duration
<fctr> <fctr> <dbl> <dbl>
1 West Village West Village 0.668 324
2 West Village East Village 3.149 852
3 West Village Battery Park 1.956 602
4 West Village Carnegie Hill 5.408 1618
5 West Village Gramercy 1.724 3579
6 West Village Soho 1.196 586
sd_trip_distance sd_trip_duration mad_trip_distance mad_trip_duration
<dbl> <dbl> <dbl> <dbl>
1 0.798 1850 0.346 187
2 104.084 2799 1.857 290
3 0.513 1964 0.385 191
4 0.679 535 0.555 396
5 0.523 236007 0.369 3059
6 0.409 2148 0.305 234
You may have noticed that the query we wrote in the last exercise was a little tedious and repetitive. Let's now see a way of rewriting the query using some "shortcut" functions available in dplyr
:
- When we apply the same summary function(s) to the same column(s) of the data, we can save a lot of time typing by using
summarize_each
instead ofsummarize
. There is also amutate_each
function. - We can select
trip_distance
andtrip_duration
automatically usingstarts_with('trip_')
, since they are the only columns that begin with that prefix, this can be a time-saver if we are selecting lots of columns at once (and we named them in a smart way). There are other helper functions calledends_with
andcontains
. - Instead of defining the
mad
function separately, we can define it in-line. In fact, there's a shortcut whereby we just name the function and provide the body of the function, replacingx
with a period.
q6 <- nyc_taxi %>%
filter(!is.na(pickup_nhood) & !is.na(dropoff_nhood)) %>%
group_by(pickup_nhood, dropoff_nhood) %>%
summarize_each(
funs(mean, sd, mad = mean(abs(. - median(.)))), # all the functions that we apply to the data are listed here
starts_with('trip_'), # `trip_distance` and `trip_duration` are the only columns that start with `trip_`
wait_per_mile = trip_duration / trip_distance) # `duration_over_dist` is created on the fly
head(q6)
Source: local data frame [6 x 11]
Groups: pickup_nhood [1]
pickup_nhood dropoff_nhood trip_distance_mean trip_duration_mean
<fctr> <fctr> <dbl> <dbl>
1 West Village West Village 0.668 324
2 West Village East Village 3.149 852
3 West Village Battery Park 1.956 602
4 West Village Carnegie Hill 5.408 1618
5 West Village Gramercy 1.724 3579
6 West Village Soho 1.196 586
wait_per_mile_mean trip_distance_sd trip_duration_sd wait_per_mile_sd
<dbl> <dbl> <dbl> <dbl>
1 -74 0.798 1850 0.00200
2 -74 104.084 2799 0.00215
3 -74 0.513 1964 0.00172
4 -74 0.679 535 0.00186
5 -74 0.523 236007 0.00200
6 -74 0.409 2148 0.00182
trip_distance_mad trip_duration_mad wait_per_mile_mad
<dbl> <dbl> <dbl>
1 0.346 187 0.00151
2 1.857 290 0.00174
3 0.385 191 0.00134
4 0.555 396 0.00145
5 0.369 3059 0.00156
6 0.305 234 0.00140
We can do far more with dplyr
but we leave it at this for an introduction. The goal was to give the user enough dplyr
to develop an appreciation and be inspired to learn more.