Exercises
In this exercise, we will be using the nyc_jan_xdf
data from prior exercises. We also add card_vs_cash
and tip_percent
, pickup_dow
, pickup_hour
, and trip_duration
as new columns to the data. If you need to re-load the data, run the following code:
input_csv <- 'yellow_tripsample_2016-01.csv'
input_xdf <- 'yellow_tripsample_2016-01.xdf'
rxImport(input_csv, input_xdf, overwrite = TRUE)
nyc_jan_xdf <- RxXdfData(input_xdf)
rxDataStep(nyc_jan_xdf, nyc_jan_xdf,
transforms = list(
card_vs_cash = factor(payment_type, levels = 1:2, labels = c('card', 'cash')),
tip_percent = ifelse(tip_amount < fare_amount & fare_amount > 0, tip_amount / fare_amount, NA)
),
overwrite = TRUE)
xforms <- function(data) { # transformation function for extracting some date and time features
weekday_labels <- c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
cut_levels <- c(1, 5, 9, 12, 16, 18, 22)
hour_labels <- c('1AM-5AM', '5AM-9AM', '9AM-12PM', '12PM-4PM', '4PM-6PM', '6PM-10PM', '10PM-1AM')
pickup_datetime <- ymd_hms(data$tpep_pickup_datetime, tz = "UTC")
pickup_hour <- addNA(cut(hour(pickup_datetime), cut_levels))
pickup_dow <- factor(wday(pickup_datetime), levels = 1:7, labels = weekday_labels)
levels(pickup_hour) <- hour_labels
dropoff_datetime <- ymd_hms(data$tpep_dropoff_datetime, tz = "UTC")
data$pickup_hour <- pickup_hour
data$pickup_dow <- pickup_dow
data$trip_duration <- as.integer(as.duration(dropoff_datetime - pickup_datetime))
data
}
rxDataStep(nyc_jan_xdf, nyc_jan_xdf, overwrite = TRUE, transformFunc = xforms, transformPackages = "lubridate")
(1) Build a linear model for predicting tip_percent
using trip_duration
and the interaction of pickup_dow
and pickup_hour
. Find out what your adjusted R-squared is by passing the model object to the summary
function.
formula_1 <- ## your formula goes here
linmod_1 <- ## build a linear model based on the above formula
Let's now try to improve our predictions by creating a better model. To do so, we can think of selecting "better" algorithms, but "better" is usually subjective as we discussed since every algorithm has its pros and cons and choosing between two algorithm can be a balancing act. However, one thing that any model can benefit from is better features. Better features can mean features that have been pre-processed to suit a particular algorithm, or it can refer to using more inputs in the model.
(2) Let's continue with linear models. Let's build a linear model very similar to the one represented by formula_1
, except that we add card_vs_cash
as input (a main effect). Let's call the new formula formula_2
.
formula_2 <- ## formula described above goes here
linmod_2 <- ## build a linear model based on the above formula
(3) Use rxPredict
to put the predictions made by both models into the data as new columns called tip_pred_1
and tip_pred_2
. Then use rxHistogram
to plot the predictions.
(4) It is also possible that are predictions are good, but need to be somewhat calibrated. To recalibrate the predictions, we use the rescale
function in the scales
library. In this case, let's rescale predictions so that both models predict a number between 0 and 25% tip. So use rxDataStep
to write a transformation that does the following:
- first replace predictions below 0 with
NA
and replace predictions above 25 with 25 - then rescale the predictions to be between 0 and 25 using the
rescale
function (e.g.rescale(x, to = c(0, 25))
)
rxDataStep(nyc_jan_xdf, nyc_jan_xdf,
transforms = list(
## transform tip_pred_1 and tip_pred_2 as described above
), overwrite = TRUE, transformPackages = "scales")
Now plot the histograms again and comment on the distribution of each plot. What does the bimodal (two separate concentration) shape of the distribution of tip_pred_2
say about the model linmod_2
?