Solutions

(1) We begin by building the three models on the training data and saving the results in an Rdata file.

linmod <- rxLinMod(tip_percent ~ payment_type + pickup_nb:dropoff_nb + pickup_dow:pickup_hour, 
                   mht_split$train, reportProgress = 0)
dtree <- rxDTree(tip_percent ~ payment_type + pickup_nb + dropoff_nb + pickup_dow + pickup_hour, 
                 mht_split$train, pruneCp = "auto", reportProgress = 0)
dforest <- rxDForest(tip_percent ~ payment_type + pickup_nb + dropoff_nb + pickup_dow + pickup_hour, 
                     mht_split$train, nTree = 10, importance = TRUE, useSparseCube = TRUE, reportProgress = 0)

trained.models <- list(linmod = linmod, dtree = dtree, dforest = dforest)
save(trained.models, file = 'trained_models_2.Rdata')

(2) We can now build a prediction dataset with all the combinations of the input variables. If any of the input variables was numeric, we would have to discretize it so the dataset does not blow up in size. In our case, we don't have numeric inputs. Finally, with do.call we can recursively join the predictions to the original data using the cbind function.

rxs <- rxSummary( ~ payment_type + pickup_nb + dropoff_nb + pickup_hour + pickup_dow, mht_xdf)
ll <- lapply(rxs$categorical, function(x) x[ , 1])
names(ll) <- c('payment_type', 'pickup_nb', 'dropoff_nb', 'pickup_hour', 'pickup_dow')
pred_df <- expand.grid(ll)
pred_df_1 <- rxPredict(trained.models$linmod, data = pred_df, predVarNames = "pred_linmod")
pred_df_2 <- rxPredict(trained.models$dtree, data = pred_df, predVarNames = "pred_dtree")
pred_df_3 <- rxPredict(trained.models$dforest, data = pred_df, predVarNames = "pred_dforest")
pred_df <- do.call(cbind, list(pred_df, pred_df_1, pred_df_2, pred_df_3))

(3) We now feed the above data to ggplot to look at the distribution of the predictions made by each model. It should come as no surprise that with the inclusion of payment_type the predictions have a bimodal distribution one for trips paid in cash and one for trips paid using a card. For trips paid in cash, the actual distribution is not as important, but for trips paid using a card we can see that the random forest model makes predictions that are less spread out than the other two models.

ggplot(data = pred_df) +
  geom_density(aes(x = pred_linmod, col = "linmod")) +
  geom_density(aes(x = pred_dtree, col = "dtree")) +
  geom_density(aes(x = pred_dforest, col = "dforest")) +
  xlab("tip percent") # + facet_grid(pickup_hour ~ pickup_dow)

(4) We now run the predictions on the test data to evaluate each model's performance. To do so, we use rxPredict without the outData argument and make an assignment on the left side so results would go into a data.frame.

test_df <- rxXdfToDataFrame(mht_split$test, varsToKeep = c('tip_percent', 'payment_type', 'pickup_nb', 'dropoff_nb', 'pickup_hour', 'pickup_dow'), maxRowsByCols = 10^9)
test_df_1 <- rxPredict(trained.models$linmod, data = test_df, predVarNames = "tip_pred_linmod")
test_df_2 <- rxPredict(trained.models$dtree, data = test_df, predVarNames = "tip_pred_dtree")
test_df_3 <- rxPredict(trained.models$dforest, data = test_df, predVarNames = "tip_pred_dforest")
test_df <- do.call(cbind, list(test_df, test_df_1, test_df_2, test_df_3))
head(test_df)

  tip_percent payment_type       pickup_nb         dropoff_nb pickup_hour pickup_dow
1           0         cash        Gramercy   Garment District    10PM-1AM        Fri
2           0         cash         Chelsea            Chelsea    10PM-1AM        Fri
3          24         card Upper East Side            Chelsea    10PM-1AM        Fri
4           0         cash Lower East Side    Lower East Side    10PM-1AM        Fri
5          17         card         Chelsea Financial District    10PM-1AM        Fri
6          12         card Lower East Side       West Village    10PM-1AM        Fri
  tip_pred_linmod tip_pred_dtree tip_pred_dforest
1       0.3208239   0.0002443074         2.226467
2       1.2503461   0.0002443074         1.220008
3      15.6659118  15.8644298583        14.627761
4       0.7529378   0.0002443074         1.063518
5      15.7867789  14.9364630503        15.773411
6      16.2032915  16.2778606733        15.261509

Since we have the test data in a data.frame we can also plot the distribution of the predictions on the test data to compare it with the last plot. As we can see, the random forest and linear model both probably waste some computation effort making predictions for trips paid in cash.

ggplot(data = test_df) +
  geom_density(aes(x = tip_pred_linmod, col = "linmod")) +
  geom_density(aes(x = tip_pred_dtree, col = "dtree")) +
  geom_density(aes(x = tip_pred_dforest, col = "dforest")) +
  xlab("tip percent") # + facet_grid(pickup_hour ~ pickup_dow)

(5) Recall from the last section that the predictions made by the three models had an average SSE of about 80. With the inclusion of payment_type we should see a considerable drop in this number.

rxSummary(~ SSE_linmod + SSE_dtree + SSE_dforest, data = test_df,
          transforms = list(SSE_linmod = (tip_percent - tip_pred_linmod)^2,
                            SSE_dtree = (tip_percent - tip_pred_dtree)^2,
                            SSE_dforest = (tip_percent - tip_pred_dforest)^2))

Summary Statistics Results for: ~SSE_linmod + SSE_dtree + SSE_dforest
Data: test_df
Number of valid observations: 43121727 

 Name        Mean     StdDev   Min                Max      ValidObs MissingObs
 SSE_linmod  21.05534 87.48650 0.0000000006167121 8699.937 42949555 172172    
 SSE_dtree   22.07582 89.50709 0.0000000063516358 8463.955 43121727      0    
 SSE_dforest 24.47603 90.66860 0.0000000005298480 8463.965 43121727      0

The average SSE has now dropped to a little over 20, which confirms how the inclusion of the right features (payment_type in this case) can have a significant impact on our model's predictive power.

Solutions

Solutions

results matching ""

No results matching ""