Solutions

(1) The easiest way to get the cumulative sum is using the built-in cumsum function, but we need to group the data by pickup_nb before applying it. Finally, we use filter to get on the top destination neighborhoods. The condition pct > 5 will give us only those neighborhoods that are a destination at least 5% of the time, and the condition (cumpct <= 50 | (cumpct > 50 & lag(cumpct) <= 50)) will stop us once the destinations together account for more than 50 percent of trips.

rxcs %>%
  select(pickup_nb, dropoff_nb, pct = pct_by_pickup_nb) %>%
  arrange(pickup_nb, desc(pct)) %>%
  group_by(pickup_nb) %>%
  mutate(cumpct = cumsum(pct)) %>%
  filter(pct > 5 & (cumpct <= 50 | (cumpct > 50 & lag(cumpct) <= 50))) %>%
  as.data.frame -> rxcs_tops

(2) We can use subset to extract the top drop-off neighborhoods for a given pick-up neighborhood. We use drop = TRUE the results into a vector. In the rxDataStep call, we can pass the two objects nb_name and nb_drop to the rowSelection argument by using transformObjects which is simply a named list. One quirk that we must be aware of here is that the objects (nb_name and nb_drop) must be renamed and the new names (nb and top_drop_for_nb respectively) go into rowSelection.

nb_name <- "West Village"
nb_drop <- subset(rxcs_tops, pickup_nb == nb_name, select = "dropoff_nb", drop = TRUE)

pickup_df <- rxDataStep(mht_xdf, 
                        rowSelection = pickup_nb == nb & dropoff_nb %in% top_drop_for_nb, 
                        varsToKeep = c("dropoff_nb", "tpep_pickup_datetime"),
                        transformObjects = list(nb = nb_name, top_drop_for_nb = nb_drop))

(3) We simply need to change position = "stack" to position = "fill". However, since the y-axis is mislabeled, we use scale_y_continuous(labels = percent_format()) and ylab("percent") to properly format and label the y-axis.

library(scales)
pickup_df %>%
  mutate(pickup_hour = hour(ymd_hms(tpep_pickup_datetime, tz = "UTC"))) %>%
  ggplot(aes(x = pickup_hour, fill = dropoff_nb)) +
  geom_bar(position = "fill", stat = "count") +
  scale_fill_discrete(guide = guide_legend(reverse = TRUE)) + 
  scale_y_continuous(labels = percent_format()) +
  ylab("percent")

Hour by hour relative frequencies

results matching ""

    No results matching ""