Comparing predictions

Before applying the algorithm to the test data, let's apply it to the small dataset with all the combinations of categorical variables and visualize the predictions. This might help us develop some intuition about each algorithm.

pred_df <- expand.grid(ll)
pred_df_1 <- rxPredict(trained.models$linmod, data = pred_df, predVarNames = "pred_linmod")
pred_df_2 <- rxPredict(trained.models$dtree, data = pred_df, predVarNames = "pred_dtree")
pred_df_3 <- rxPredict(trained.models$dforest, data = pred_df, predVarNames = "pred_dforest")
pred_df <- do.call(cbind, list(pred_df, pred_df_1, pred_df_2, pred_df_3))
head(pred_df)
           pickup_nb dropoff_nb pickup_hour pickup_dow pred_linmod pred_dtree pred_dforest
1          Chinatown  Chinatown     1AM-5AM        Sun 6.869645    5.772054   9.008643
2       Little Italy  Chinatown     1AM-5AM        Sun 10.627190   9.221250   10.634590
3            Tribeca  Chinatown     1AM-5AM        Sun 9.063741    9.221250   10.099731
4               Soho  Chinatown     1AM-5AM        Sun 10.107815   8.313437   10.162946
5    Lower East Side  Chinatown     1AM-5AM        Sun 9.728399    9.221250   10.525242
6 Financial District  Chinatown     1AM-5AM        Sun 8.248997    6.937500   8.674807
observed_df <- rxSummary(tip_percent ~ pickup_nb:dropoff_nb:pickup_dow:pickup_hour, mht_xdf)
observed_df <- observed_df$categorical[[1]][ , c(2:6)]
pred_df <- inner_join(pred_df, observed_df, by = names(pred_df)[1:4])

ggplot(data = pred_df) +
  geom_density(aes(x = Means, col = "observed average")) +
  geom_density(aes(x = pred_linmod, col = "linmod")) +
  geom_density(aes(x = pred_dtree, col = "dtree")) +
  geom_density(aes(x = pred_dforest, col = "dforest")) +
  xlim(-1, 30) + 
  xlab("tip percent")

Distribution of predictions for each model

Both the linear model and the random forest give us smooth predictions. We can see that the random forest predictions are the most concentrated. The predictions for the decision tree follow a jagged distribution, probably as a result of overfitting, but we don't know that until we check performance against the test set. Overall, the predictions are more narrow than the observed averages.

results matching ""

    No results matching ""