Using other algorithms
So far we've only looked at two models from the same rxLinMod
algorithm. When comparing the two, we looked at the way their predictions capture the effects of the variables used to build each model. To do the comparison, we built a dataset with all combinations of the variables used to build the models with, and then scored that dataset with the two models using rxPredict
. By doing so we can see how the predictions are distributed, but we still don't know if the predictions are good. The true test of a model's performance is in its ability to predict out of sample, which is why we split the data in two and set aside a portion of it for model testing.
To divide the data into training and testing portions, we first used rxDataStep
to create a new factor
column called split
where each row is "train"
or "test"
such that a given proportion of the data (here 75 percent) is used to train a model and the rest is used to test the model's predictive power. We then used the rxSplit
function to divide the data into the two portions. The rx_split_xdf
function we create here combines the two steps into one and sets some arguments to defaults.
dir.create('output', showWarnings = FALSE)
rx_split_xdf <- function(xdf = mht_xdf,
split_perc = 0.75,
output_path = "output/split",
...) {
# first create a column to split by
outFile <- tempfile(fileext = 'xdf')
rxDataStep(inData = xdf,
outFile = xdf,
transforms = list(
split = factor(ifelse(rbinom(.rxNumRows, size = 1, prob = splitperc), "train", "test"))),
transformObjects = list(splitperc = split_perc),
overwrite = TRUE, ...)
# then split the data in two based on the column we just created
splitDS <- rxSplit(inData = xdf,
outFilesBase = file.path(output_path, "train"),
splitByFactor = "split",
overwrite = TRUE)
return(splitDS)
}
# we can now split to data in two
mht_split <- rx_split_xdf(xdf = mht_xdf, varsToKeep = c('payment_type', 'fare_amount', 'tip_amount', 'tip_percent', 'pickup_hour',
'pickup_dow', 'pickup_nb', 'dropoff_nb'))
names(mht_split) <- c("train", "test")
We now run three different algorithms on the data:
rxLinMod
, the linear model from earlier with the termstip_percent ~ pickup_nb:dropoff_nb + pickup_dow:pickup_hour
rxDTree
, the decision tree algorithm with the termstip_percent ~ pickup_nb + dropoff_nb + pickup_dow + pickup_hour
(decision trees don't need interactive factors because interactions are built into the algorithm itself)rxDForest
, the random forest algorithm with the same terms as decision trees
Since this is not a modeling course, we will not discuss how the algorithms are implemented. Instead we run the algorithms and use them to predict tip percent on the test data so we can see which one works better.
system.time(linmod <- rxLinMod(tip_percent ~ pickup_nb:dropoff_nb + pickup_dow:pickup_hour,
data = mht_split$train, reportProgress = 0))
system.time(dtree <- rxDTree(tip_percent ~ pickup_nb + dropoff_nb + pickup_dow + pickup_hour,
data = mht_split$train, pruneCp = "auto", reportProgress = 0))
system.time(dforest <- rxDForest(tip_percent ~ pickup_nb + dropoff_nb + pickup_dow + pickup_hour,
mht_split$train, nTree = 10, importance = TRUE, useSparseCube = TRUE, reportProgress = 0))
user system elapsed
0.00 0.00 1.62
user system elapsed
0.03 0.00 778.00
user system elapsed
0.02 0.00 644.17
Since running the above algorithms can take a while, it may be worth saving the models that each return.
trained.models <- list(linmod = linmod, dtree = dtree, dforest = dforest)
save(trained.models, file = 'trained_models.Rdata')