Choosing between models

A question we might ask ourselves is how important is the interaction between pickup_dow and pickup_hour to the predictions? How much worse would the predictions be if we only kept the interaction between pickup_nb and dropoff_nb and dropped the second interactive term? To answer this, we can build a simpler model with rxLinMod in which we only include pickup_nb:dropoff_nb. We then predict with the simpler model and use cbind to append the new predictions next to the data with the old predictions we made with the more complex model.

form_2 <- as.formula(tip_percent ~ pickup_nb:dropoff_nb)
rxlm_2 <- rxLinMod(form_2, data = mht_xdf, dropFirst = TRUE, covCoef = TRUE)
pred_df_2 <- rxPredict(rxlm_2, data = pred_df_1, computeStdErrors = TRUE, writeModelVars = TRUE)
names(pred_df_2)[1:2] <- paste(c('tip_pred', 'tip_stderr'), 2, sep = "_")

pred_df <- pred_df_2 %>% 
  select(starts_with('tip_')) %>%
  cbind(pred_df_1) %>%
  arrange(pickup_nb, dropoff_nb, pickup_dow, pickup_hour) %>% 
  select(pickup_dow, pickup_hour, pickup_nb, dropoff_nb, starts_with('tip_pred_'))

head(pred_df)

  pickup_dow pickup_hour pickup_nb dropoff_nb tip_pred_2 tip_pred_1
1        Sun     1AM-5AM Chinatown  Chinatown   6.782043   6.796323
2        Sun     5AM-9AM Chinatown  Chinatown   6.782043   5.880284
3        Sun    9AM-12PM Chinatown  Chinatown   6.782043   6.103625
4        Sun    12PM-4PM Chinatown  Chinatown   6.782043   5.913130
5        Sun     4PM-6PM Chinatown  Chinatown   6.782043   6.121957
6        Sun    6PM-10PM Chinatown  Chinatown   6.782043   6.642192

We can see from the results above that the predictions with the simpler model are identical across all the days of the week and all the hours for the same pick-up and drop-off combination. Whereas the predictions by the more complex model are unique for every combination of all four variables. In other words, adding pickup_dow:pickup_hour to the model adds extra variation to the predictions, and what we'd like to know is if this variation contains important signals or if it more or less behaves like noise. To get to the answer, we compare the distribution of the two predictions when we break them up by pickup_dow and pickup_hour.

ggplot(data = pred_df) +
  geom_density(aes(x = tip_pred_1, col = "complex")) +
  geom_density(aes(x = tip_pred_2, col = "simple")) +
  facet_grid(pickup_hour ~ pickup_dow)

The simpler model shows the same distribution all throughout, because these two variables have no effect on its predictions, but the more complex model shows a slightly different distribution for each combination of pickup_dow and pickup_hour, usually in the form of a slight shift in the distribution. That shift represents the effect of pickup_dow and pickup_hour at each given combination of the two variables. Because the shift is directional (not haphazard), it's safe to say that it captures some kind of important signal (although its practical significance is still up for debate). We can simplify the above plot if we apply some business logic to it.

Let's us cut to bin the tip predictions. To choose what the cut-offs should be, we can use the rxQuantile function to guide us.

rxQuantile("tip_percent", data = mht_xdf, probs = seq(0, 1, by = .05))

    0%   5%  10%  15%  20%  25%  30%  35%  40%  45%  50%  55%  60%  65%  70%  75%  80% 
  -1    0    0    0    0    0    0    0    9   12   15   17   17   17   18   18   19 
 85%  90%  95% 100% 
  20   21   23   99

Based on the above results, we can bin tip_percent by whether they are less than 8%, between 8% and 12%, between 12% and 15%, between 15% and 18%, or 18% or higher. We can then plot a bar plot showing the same information as above, but slightly easier to interpret.

pred_df %>%
  mutate_at(vars(tip_pred_1, tip_pred_2), funs(cut(., c(-Inf, 8, 12, 15, 18, Inf)))) %>%
  ggplot() +
  geom_bar(aes(x = tip_pred_1, fill = "complex", alpha = .5)) +
  geom_bar(aes(x = tip_pred_2, fill = "simple", alpha = .5)) +
  facet_grid(pickup_hour ~ pickup_dow) +
  xlab('tip percent prediction') +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Based on the above plot, we can see that compared to the simple model, the complex model tends to predict more high-tipping passengers and fewer average-tipping ones during certain day and time combinations (such as Monday through Thursday during the rush hours).

3.2c Choosing between models

Choosing between models

results matching ""

No results matching ""