Reordering neighborhoods

As our next task, we seek to find patterns between pickup and drop-off neighborhoods and other variables such as fare amount, trip distance, traffic and tipping. To estimate traffic by looking at the ratio of trip duration and trip distance, assuming that traffic is the most common reason for trips taking longer than they should.

For this analysis, we use the rxCube and rxCrossTabs are both very similar to rxSummary but they return fewer statistical summaries and therefore run faster. With y ~ u:v as the formula, rxCrossTabs returns counts and sums, and rxCube returns counts and averages for column y broken up by any combinations of columns u and v. Another important difference between the two functions is that rxCrossTabs returns an array but rxCube returns a data.frame. Depending on the application in question, we may prefer one to the other (and of course we can always convert one form to the other by "reshaping" it, but doing so would involve extra work).

Let's see what this means in action: We start by using rxCrossTabs to get sums and counts for trip_distance, broken up by pickup_nb and dropoff_nb. We can immediately divide the sums by the counts to get averages. The result is called a distance matrix and can be fed to the seriate function in the seriation library to order it so closer neighborhoods appear next to each other (right now neighborhoods are sorted alphabetically, which is what R does by default with factor levels unless otherwise specified).

rxct <- rxCrossTabs(trip_distance ~ pickup_nb:dropoff_nb, mht_xdf)
res <- rxct$sums$trip_distance / rxct$counts$trip_distance

library(seriation)
res[which(is.nan(res))] <- mean(res, na.rm = TRUE)
nb_order <- seriate(res)

We will use nb_order in a little while, but before we do so, let's use rxCube to get counts and averages for trip_distance, a new data point representing minutes spent in the taxi per mile of the trip, and tip_percent. In the above example, we used rxCrossTabs because we wanted a matrix as the return object, so we could feed it to seriate. We now use rxCube to get a data.frame instead, since we intend to use it for plotting with ggplot2, which is more easier to code using a long data.frame as input compared to a wide matirx.

rxc1 <- rxCube(trip_distance ~ pickup_nb:dropoff_nb, mht_xdf)
rxc2 <- rxCube(minutes_per_mile ~ pickup_nb:dropoff_nb, mht_xdf, 
               transforms = list(minutes_per_mile = (trip_duration/60)/trip_distance))
rxc3 <- rxCube(tip_percent ~ pickup_nb:dropoff_nb, mht_xdf)
res <- bind_cols(list(rxc1, rxc2, rxc3))
res <- res[ , c('pickup_nb', 'dropoff_nb', 'trip_distance', 'minutes_per_mile', 'tip_percent')]
head(res)
# A tibble: 6 × 5
      pickup_nb   dropoff_nb trip_distance minutes_per_mile tip_percent
         <fctr>       <fctr>         <dbl>            <dbl>       <dbl>
1  Battery Park Battery Park      1.015857        11.579629   11.394900
2 Carnegie Hill Battery Park      8.570623         3.944350   12.391030
3  Central Park Battery Park      6.277666         5.243241   10.326531
4       Chelsea Battery Park      2.995946         5.169887   11.992151
5     Chinatown Battery Park      1.771597         9.001305   10.292683
6       Clinton Battery Park      3.993806         4.839858    9.794098

We can start plotting the above results to see some interesting trends.

library(ggplot2)
ggplot(res, aes(pickup_nb, dropoff_nb)) + 
  geom_tile(aes(fill = trip_distance), colour = "white") + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  scale_fill_gradient(low = "white", high = "steelblue") + 
  coord_fixed(ratio = .9)

Neighborhoods arranged randomly

The problem with the above plot is the order of the neighborhoods (which is alphabetical), which makes the plot somewhat arbitrary and useless. But as we saw above, using the seriate function we found a more natural ordering for the neighborhoods, so we can use it to reorder the above plot in a more suitable way. To reorder the plot, all we need to do is reorder the factor levels in the order given by nb_order.

newlevs <- levels(res$pickup_nb)[unlist(nb_order)]
res$pickup_nb <- factor(res$pickup_nb, levels = unique(newlevs))
res$dropoff_nb <- factor(res$dropoff_nb, levels = unique(newlevs))

library(ggplot2)
ggplot(res, aes(pickup_nb, dropoff_nb)) + 
  geom_tile(aes(fill = trip_distance), colour = "white") + 
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  scale_fill_gradient(low = "white", high = "steelblue") + 
  coord_fixed(ratio = .9)

Neighborhoods sorted using the distance matrix

results matching ""

    No results matching ""