Solutions
(1) The easiest way to get the cumulative sum is using the built-in cumsum
function, but we need to group the data by pickup_nb
before applying it. Finally, we use filter
to get on the top destination neighborhoods. The condition pct > 5
will give us only those neighborhoods that are a destination at least 5% of the time, and the condition (cumpct <= 50 | (cumpct > 50 & lag(cumpct) <= 50))
will stop us once the destinations together account for more than 50 percent of trips.
rxcs %>%
select(pickup_nb, dropoff_nb, pct = pct_by_pickup_nb) %>%
arrange(pickup_nb, desc(pct)) %>%
group_by(pickup_nb) %>%
mutate(cumpct = cumsum(pct)) %>%
filter(pct > 5 & (cumpct <= 50 | (cumpct > 50 & lag(cumpct) <= 50))) %>%
as.data.frame -> rxcs_tops
(2) We can use subset
to extract the top drop-off neighborhoods for a given pick-up neighborhood. We use drop = TRUE
the results into a vector. In the rxDataStep
call, we can pass the two objects nb_name
and nb_drop
to the rowSelection
argument by using transformObjects
which is simply a named list. One quirk that we must be aware of here is that the objects (nb_name
and nb_drop
) must be renamed and the new names (nb
and top_drop_for_nb
respectively) go into rowSelection
.
nb_name <- "West Village"
nb_drop <- subset(rxcs_tops, pickup_nb == nb_name, select = "dropoff_nb", drop = TRUE)
pickup_df <- rxDataStep(mht_xdf,
rowSelection = pickup_nb == nb & dropoff_nb %in% top_drop_for_nb,
varsToKeep = c("dropoff_nb", "tpep_pickup_datetime"),
transformObjects = list(nb = nb_name, top_drop_for_nb = nb_drop))
(3) We simply need to change position = "stack"
to position = "fill"
. However, since the y-axis is mislabeled, we use scale_y_continuous(labels = percent_format())
and ylab("percent")
to properly format and label the y-axis.
library(scales)
pickup_df %>%
mutate(pickup_hour = hour(ymd_hms(tpep_pickup_datetime, tz = "UTC"))) %>%
ggplot(aes(x = pickup_hour, fill = dropoff_nb)) +
geom_bar(position = "fill", stat = "count") +
scale_fill_discrete(guide = guide_legend(reverse = TRUE)) +
scale_y_continuous(labels = percent_format()) +
ylab("percent")