Exercises
In this exercise, we will be using the nyc_jan_xdf data from prior exercises. We also add card_vs_cash and tip_percent, pickup_dow, pickup_hour, and trip_duration as new columns to the data. If you need to re-load the data, run the following code:
input_csv <- 'yellow_tripsample_2016-01.csv'
input_xdf <- 'yellow_tripsample_2016-01.xdf'
rxImport(input_csv, input_xdf, overwrite = TRUE)
nyc_jan_xdf <- RxXdfData(input_xdf)
rxDataStep(nyc_jan_xdf, nyc_jan_xdf,
transforms = list(
card_vs_cash = factor(payment_type, levels = 1:2, labels = c('card', 'cash')),
tip_percent = ifelse(tip_amount < fare_amount & fare_amount > 0, tip_amount / fare_amount, NA)
),
overwrite = TRUE)
xforms <- function(data) { # transformation function for extracting some date and time features
weekday_labels <- c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
cut_levels <- c(1, 5, 9, 12, 16, 18, 22)
hour_labels <- c('1AM-5AM', '5AM-9AM', '9AM-12PM', '12PM-4PM', '4PM-6PM', '6PM-10PM', '10PM-1AM')
pickup_datetime <- ymd_hms(data$tpep_pickup_datetime, tz = "UTC")
pickup_hour <- addNA(cut(hour(pickup_datetime), cut_levels))
pickup_dow <- factor(wday(pickup_datetime), levels = 1:7, labels = weekday_labels)
levels(pickup_hour) <- hour_labels
dropoff_datetime <- ymd_hms(data$tpep_dropoff_datetime, tz = "UTC")
data$pickup_hour <- pickup_hour
data$pickup_dow <- pickup_dow
data$trip_duration <- as.integer(as.duration(dropoff_datetime - pickup_datetime))
data
}
rxDataStep(nyc_jan_xdf, nyc_jan_xdf, overwrite = TRUE, transformFunc = xforms, transformPackages = "lubridate")
(1) Build a linear model for predicting tip_percent using trip_duration and the interaction of pickup_dow and pickup_hour. Find out what your adjusted R-squared is by passing the model object to the summary function.
formula_1 <- ## your formula goes here
linmod_1 <- ## build a linear model based on the above formula
Let's now try to improve our predictions by creating a better model. To do so, we can think of selecting "better" algorithms, but "better" is usually subjective as we discussed since every algorithm has its pros and cons and choosing between two algorithm can be a balancing act. However, one thing that any model can benefit from is better features. Better features can mean features that have been pre-processed to suit a particular algorithm, or it can refer to using more inputs in the model.
(2) Let's continue with linear models. Let's build a linear model very similar to the one represented by formula_1, except that we add card_vs_cash as input (a main effect). Let's call the new formula formula_2.
formula_2 <- ## formula described above goes here
linmod_2 <- ## build a linear model based on the above formula
(3) Use rxPredict to put the predictions made by both models into the data as new columns called tip_pred_1 and tip_pred_2. Then use rxHistogram to plot the predictions.
(4) It is also possible that are predictions are good, but need to be somewhat calibrated. To recalibrate the predictions, we use the rescale function in the scales library. In this case, let's rescale predictions so that both models predict a number between 0 and 25% tip. So use rxDataStep to write a transformation that does the following:
- first replace predictions below 0 with
NAand replace predictions above 25 with 25 - then rescale the predictions to be between 0 and 25 using the
rescalefunction (e.g.rescale(x, to = c(0, 25)))
rxDataStep(nyc_jan_xdf, nyc_jan_xdf,
transforms = list(
## transform tip_pred_1 and tip_pred_2 as described above
), overwrite = TRUE, transformPackages = "scales")
Now plot the histograms again and comment on the distribution of each plot. What does the bimodal (two separate concentration) shape of the distribution of tip_pred_2 say about the model linmod_2?