Total and marginal distribution trips between neighborhoods
Let's focus our attention now the following important questions:
- Between which neighborhoods do the most common trips occur?
- Assuming that a traveler leaves from a given neighborhood, which neighborhoods are they most likely to go to?
- Assuming that someone was just dropped off at a given neighborhood, which neighborhoods are they most likely to have come from?
To answer the above questions, we need to find the distribution (or proportion) of trips between any two neighborhoods, first as a percentage of total trips, then as a percentage of trips leaving from a particular neighborhood, and finally as a percentage of trips going to a particular neighborhood.
rxc <- rxCube( ~ pickup_nb:dropoff_nb, mht_xdf) rxc <- as.data.frame(rxc) library(dplyr) rxc %>% filter(Counts > 0) %>% mutate(pct_all = Counts/sum(Counts) * 100) %>% group_by(pickup_nb) %>% mutate(pct_by_pickup_nb = Counts/sum(Counts) * 100) %>% group_by(dropoff_nb) %>% mutate(pct_by_dropoff_nb = Counts/sum(Counts) * 100) %>% group_by() %>% arrange(desc(Counts)) -> rxcs head(rxcs)
# A tibble: 6 × 6 pickup_nb dropoff_nb Counts pct_all pct_by_pickup_nb <fctr> <fctr> <dbl> <dbl> <dbl> 1 Upper East Side Upper East Side 3299324 5.738650 36.88840 2 Midtown Midtown 2216184 3.854700 21.84268 3 Upper West Side Upper West Side 1924205 3.346849 35.14494 4 Midtown Upper East Side 1646843 2.864422 16.23127 5 Upper East Side Midtown 1607925 2.796730 17.97756 6 Garment District Midtown 1072732 1.865847 28.94205 pct_by_dropoff_nb <dbl> 1 38.28066 2 22.41298 3 35.15770 4 19.10762 5 16.26146 6 10.84888
Based on the first row, we can see that trips from the Upper East Side to the Upper East Side make up about 5% of all trips in Manhattan. Of all the trips that pick up in the Upper East Side, about 36% drop off in the Upper East Side. Of all the trips that drop off in the Upper East Side, 37% and tripped that also picked up in the Upper East Side.
We can take the above numbers and display them in plots that make it easier to digest it all at once. We begin with a plot showing how taxi trips between any pair of neighborhoods are distributed.
ggplot(rxcs, aes(pickup_nb, dropoff_nb)) + geom_tile(aes(fill = pct_all), colour = "white") + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + scale_fill_gradient(low = "white", high = "black") + coord_fixed(ratio = .9)
The plot shows that trips to and from the Upper East Side make up the majority of trips, a somewhat unexpected result. Furthermore, the lion's share of trips are to and from the Upper East Side and the Upper West Side and the midtown neighborhoods (with most of this category having Midtown either as an origin or a destination). Another surprising fact about the above plot is its near symmetry, which suggests that perhaps most passengers use taxis for a "round trip", meaning that they take a taxi to their destination, and another taxi for the return trip. This point warrants further inquiry (perhaps by involving the time of day into the analysis) but for now we leave it at that.
Next we look at how trips leaving a particular neighborhood (a point on the x-axis in the plot below), "spill out" into other neighborhoods (shown by the vertical color gradient along the y-axis at each point on the x-axis).
ggplot(rxcs, aes(pickup_nb, dropoff_nb)) + geom_tile(aes(fill = pct_by_pickup_nb), colour = "white") + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + scale_fill_gradient(low = "white", high = "steelblue") + coord_fixed(ratio = .9)
We can see how most downtown trips are to other downtown neighborhoods or to midtown neighborhoods (especially Midtown). Midtown and the Upper East Side are common destinations from any neighborhood, and the Upper West Side is a common destination for most uptown neighborhoods.
For a trip ending at a particular neighborhood (represented by a point on the y-axis) we now look at the distribution of where the trip originated from (the horizontal color-gradient along the x-axis for each point on the y-axis).
ggplot(rxcs, aes(pickup_nb, dropoff_nb)) + geom_tile(aes(fill = pct_by_dropoff_nb), colour = "white") + theme(axis.text.x = element_text(angle = 60, hjust = 1)) + scale_fill_gradient(low = "white", high = "red") + coord_fixed(ratio = .9)
As we can see, a lot of trips claim Midtown regardless of where they ended. The Upper East Side and Upper West Side are also common origins for trips that drop off in one of the uptown neighborhoods.