Exercises
(1) In the following query, we want to add a forth step: Sort the results by descending average trip duration. The dplyr
function to sort is arrange
. For example arrange(data, x1, desc(x2))
will sort data
by increasing values of x1
and decreasing values of x2
within each value of x1
.
Implement this forth step to both the code with and without the pipeline, both of which are shown here:
summarize( # (3)
group_by( # (2)
filter(nyc_taxi, fare_amount > 500), # (1)
payment_type),
ave_duration = mean(trip_duration), ave_distance = mean(trip_distance))
nyc_taxi %>%
filter(fare_amount > 500) %>% # (1)
group_by(payment_type) %>% # (2)
summarize(ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)) # (3)
The remaining exercises are questions about the data that need to be translated into a dplyr
pipeline. The goal of the exercise is two-fold: learn to break down a question into multiple pieces and learn to translate each piece into a line in dplyr
, which together comprise the pipeline.
(2) What are the pick-up times of the day and the days of the week with the highest average fare per mile of ride?
# A tibble: 6 x 4
pickup_dow pickup_hour ave_fare_per_mile count
<fctr> <fctr> <dbl> <int>
1 Tue 9AM-12PM 7.69 74344
2 Wed 12PM-4PM 7.67 97783
3 Fri 12PM-4PM 7.62 102899
4 Tue 12PM-4PM 7.47 99179
5 Thu 12PM-4PM 7.47 102712
6 Wed 9AM-12PM 7.40 75997
(3) For each pick-up neighborhood, find the number and percentage of trips that "fan out" into other neighborhoods. Sort results by pickup neighborhood and descending percentage. Limit results to top 50 percent coverage. In other words, show only the top 50 percent of destinations for each pick-up neighborhood.
Source: local data frame [6 x 5]
Groups: pickup_nhood [1]
pickup_nhood dropoff_nhood count proportion cum.prop
<fctr> <fctr> <int> <dbl> <dbl>
1 West Village Chelsea 13420 0.1558 0.156
2 West Village Midtown 10318 0.1198 0.276
3 West Village Greenwich Village 7713 0.0896 0.365
4 West Village Gramercy 7006 0.0814 0.447
5 West Village Garment District 4964 0.0576 0.504
...
(4) Are any dates missing from the data?
(5) Find the 3 consecutive days with the most total number of trips?
(6) Get the average, standard deviation, and mean absolute deviation of trip_distance
and trip_duration
, as well as the ratio of trip_duration
over trip_distance
. Results should be broken up by pickup_nhood
and dropoff_nhood
.