Exercises

(1) In the following query, we want to add a forth step: Sort the results by descending average trip duration. The dplyr function to sort is arrange. For example arrange(data, x1, desc(x2)) will sort data by increasing values of x1 and decreasing values of x2 within each value of x1.

Implement this forth step to both the code with and without the pipeline, both of which are shown here:

summarize( # (3)
  group_by( # (2)
    filter(nyc_taxi, fare_amount > 500), # (1)
    payment_type), 
  ave_duration = mean(trip_duration), ave_distance = mean(trip_distance))
nyc_taxi %>%
  filter(fare_amount > 500) %>% # (1)
  group_by(payment_type) %>% # (2)
  summarize(ave_duration = mean(trip_duration), ave_distance = mean(trip_distance)) # (3)

The remaining exercises are questions about the data that need to be translated into a dplyr pipeline. The goal of the exercise is two-fold: learn to break down a question into multiple pieces and learn to translate each piece into a line in dplyr, which together comprise the pipeline.

(2) What are the pick-up times of the day and the days of the week with the highest average fare per mile of ride?

# A tibble: 6 x 4
  pickup_dow pickup_hour ave_fare_per_mile  count
      <fctr>      <fctr>             <dbl>  <int>
1        Tue    9AM-12PM              7.69  74344
2        Wed    12PM-4PM              7.67  97783
3        Fri    12PM-4PM              7.62 102899
4        Tue    12PM-4PM              7.47  99179
5        Thu    12PM-4PM              7.47 102712
6        Wed    9AM-12PM              7.40  75997

(3) For each pick-up neighborhood, find the number and percentage of trips that "fan out" into other neighborhoods. Sort results by pickup neighborhood and descending percentage. Limit results to top 50 percent coverage. In other words, show only the top 50 percent of destinations for each pick-up neighborhood.

Source: local data frame [6 x 5]
Groups: pickup_nhood [1]

  pickup_nhood     dropoff_nhood count proportion cum.prop
        <fctr>            <fctr> <int>      <dbl>    <dbl>
1 West Village           Chelsea 13420     0.1558    0.156
2 West Village           Midtown 10318     0.1198    0.276
3 West Village Greenwich Village  7713     0.0896    0.365
4 West Village          Gramercy  7006     0.0814    0.447
5 West Village  Garment District  4964     0.0576    0.504
...

(4) Are any dates missing from the data?

(5) Find the 3 consecutive days with the most total number of trips?

(6) Get the average, standard deviation, and mean absolute deviation of trip_distance and trip_duration, as well as the ratio of trip_duration over trip_distance. Results should be broken up by pickup_nhood and dropoff_nhood.

results matching ""

    No results matching ""