# Exercises

In this exercise, we will be using the `nyc_jan_xdf`

data from prior exercises. We also add `card_vs_cash`

and `tip_percent`

, `pickup_dow`

, `pickup_hour`

, and `trip_duration`

as new columns to the data. If you need to re-load the data, run the following code:

```
input_csv <- 'yellow_tripsample_2016-01.csv'
input_xdf <- 'yellow_tripsample_2016-01.xdf'
rxImport(input_csv, input_xdf, overwrite = TRUE)
nyc_jan_xdf <- RxXdfData(input_xdf)
rxDataStep(nyc_jan_xdf, nyc_jan_xdf,
transforms = list(
card_vs_cash = factor(payment_type, levels = 1:2, labels = c('card', 'cash')),
tip_percent = ifelse(tip_amount < fare_amount & fare_amount > 0, tip_amount / fare_amount, NA)
),
overwrite = TRUE)
xforms <- function(data) { # transformation function for extracting some date and time features
weekday_labels <- c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
cut_levels <- c(1, 5, 9, 12, 16, 18, 22)
hour_labels <- c('1AM-5AM', '5AM-9AM', '9AM-12PM', '12PM-4PM', '4PM-6PM', '6PM-10PM', '10PM-1AM')
pickup_datetime <- ymd_hms(data$tpep_pickup_datetime, tz = "UTC")
pickup_hour <- addNA(cut(hour(pickup_datetime), cut_levels))
pickup_dow <- factor(wday(pickup_datetime), levels = 1:7, labels = weekday_labels)
levels(pickup_hour) <- hour_labels
dropoff_datetime <- ymd_hms(data$tpep_dropoff_datetime, tz = "UTC")
data$pickup_hour <- pickup_hour
data$pickup_dow <- pickup_dow
data$trip_duration <- as.integer(as.duration(dropoff_datetime - pickup_datetime))
data
}
rxDataStep(nyc_jan_xdf, nyc_jan_xdf, overwrite = TRUE, transformFunc = xforms, transformPackages = "lubridate")
```

(1) Build a linear model for predicting `tip_percent`

using `trip_duration`

and the interaction of `pickup_dow`

and `pickup_hour`

. Find out what your adjusted R-squared is by passing the model object to the `summary`

function.

```
formula_1 <- ## your formula goes here
linmod_1 <- ## build a linear model based on the above formula
```

Let's now try to improve our predictions by creating a better model. To do so, we can think of selecting "better" algorithms, but "better" is usually subjective as we discussed since every algorithm has its pros and cons and choosing between two algorithm can be a balancing act. However, one thing that any model can benefit from is better features. Better features can mean features that have been pre-processed to suit a particular algorithm, or it can refer to using more inputs in the model.

(2) Let's continue with linear models. Let's build a linear model very similar to the one represented by `formula_1`

, except that we add `card_vs_cash`

as input (a main effect). Let's call the new formula `formula_2`

.

```
formula_2 <- ## formula described above goes here
linmod_2 <- ## build a linear model based on the above formula
```

(3) Use `rxPredict`

to put the predictions made by both models into the data as new columns called `tip_pred_1`

and `tip_pred_2`

. Then use `rxHistogram`

to plot the predictions.

(4) It is also possible that are predictions are good, but need to be somewhat calibrated. To recalibrate the predictions, we use the `rescale`

function in the `scales`

library. In this case, let's rescale predictions so that both models predict a number between 0 and 25% tip. So use `rxDataStep`

to write a transformation that does the following:

- first replace predictions below 0 with
`NA`

and replace predictions above 25 with 25 - then rescale the predictions to be between 0 and 25 using the
`rescale`

function (e.g.`rescale(x, to = c(0, 25))`

)

```
rxDataStep(nyc_jan_xdf, nyc_jan_xdf,
transforms = list(
## transform tip_pred_1 and tip_pred_2 as described above
), overwrite = TRUE, transformPackages = "scales")
```

Now plot the histograms again and comment on the distribution of each plot. What does the bimodal (two separate concentration) shape of the distribution of `tip_pred_2`

say about the model `linmod_2`

?