Using other algorithms

So far we've only looked at two models from the same rxLinMod algorithm. When comparing the two, we looked at the way their predictions capture the effects of the variables used to build each model. To do the comparison, we built a dataset with all combinations of the variables used to build the models with, and then scored that dataset with the two models using rxPredict. By doing so we can see how the predictions are distributed, but we still don't know if the predictions are good. The true test of a model's performance is in its ability to predict out of sample, which is why we split the data in two and set aside a portion of it for model testing.

To divide the data into training and testing portions, we first used rxDataStep to create a new factor column called split where each row is "train" or "test" such that a given proportion of the data (here 75 percent) is used to train a model and the rest is used to test the model's predictive power. We then used the rxSplit function to divide the data into the two portions. The rx_split_xdf function we create here combines the two steps into one and sets some arguments to defaults.

dir.create('output', showWarnings = FALSE)
rx_split_xdf <- function(xdf = mht_xdf,
                         split_perc = 0.75,
                         output_path = "output/split",
                         ...) {

  # first create a column to split by
  outFile <- tempfile(fileext = 'xdf')
  rxDataStep(inData = xdf,
             outFile = xdf,
             transforms = list(
               split = factor(ifelse(rbinom(.rxNumRows, size = 1, prob = splitperc), "train", "test"))),
             transformObjects = list(splitperc = split_perc),
             overwrite = TRUE, ...)

  # then split the data in two based on the column we just created
  splitDS <- rxSplit(inData = xdf,
                     outFilesBase = file.path(output_path, "train"),
                     splitByFactor = "split",
                     overwrite = TRUE)

  return(splitDS)
}

# we can now split to data in two
mht_split <- rx_split_xdf(xdf = mht_xdf, varsToKeep = c('payment_type', 'fare_amount', 'tip_amount', 'tip_percent', 'pickup_hour', 
                                                        'pickup_dow', 'pickup_nb', 'dropoff_nb'))
names(mht_split) <- c("train", "test")

We now run three different algorithms on the data:

  • rxLinMod, the linear model from earlier with the terms tip_percent ~ pickup_nb:dropoff_nb + pickup_dow:pickup_hour
  • rxDTree, the decision tree algorithm with the terms tip_percent ~ pickup_nb + dropoff_nb + pickup_dow + pickup_hour (decision trees don't need interactive factors because interactions are built into the algorithm itself)
  • rxDForest, the random forest algorithm with the same terms as decision trees

Since this is not a modeling course, we will not discuss how the algorithms are implemented. Instead we run the algorithms and use them to predict tip percent on the test data so we can see which one works better.

system.time(linmod <- rxLinMod(tip_percent ~ pickup_nb:dropoff_nb + pickup_dow:pickup_hour, 
                               data = mht_split$train, reportProgress = 0))
system.time(dtree <- rxDTree(tip_percent ~ pickup_nb + dropoff_nb + pickup_dow + pickup_hour, 
                             data = mht_split$train, pruneCp = "auto", reportProgress = 0))
system.time(dforest <- rxDForest(tip_percent ~ pickup_nb + dropoff_nb + pickup_dow + pickup_hour, 
                                 mht_split$train, nTree = 10, importance = TRUE, useSparseCube = TRUE, reportProgress = 0))
   user  system elapsed 
   0.00    0.00    1.62 

   user  system elapsed 
   0.03    0.00  778.00 

   user  system elapsed 
   0.02    0.00  644.17

Since running the above algorithms can take a while, it may be worth saving the models that each return.

trained.models <- list(linmod = linmod, dtree = dtree, dforest = dforest)
save(trained.models, file = 'trained_models.Rdata')

results matching ""

    No results matching ""