# A linear model predicting tip percent

Let's begin by building a linear model involving two interactive terms: one between `pickup_nb`

and `dropoff_nb`

and another one between `pickup_dow`

and `pickup_hour`

. The idea here is that we think trip percent is not just influenced by which neighborhood the passengers was pickup up from, or which neighborhood they were dropped off to, but which neighborhood they were picked up from AND dropped off to. Similarly, we intuit that the day of the week and the hour of the day together influence tipping. For example, just because people tip high on Sundays between 9 and 12, doesn't mean that they tend to tip high any day of the week between 9 and 12 PM, or any time of the day on a Sunday. This intuition is encoded in the model formula argument that we pass to the `rxLinMod`

function: `tip_percent ~ pickup_nb:dropoff_nb + pickup_dow:pickup_hour`

where we use `:`

to separate interactive terms and `+`

to separate additive terms.

```
form_1 <- as.formula(tip_percent ~ pickup_nb:dropoff_nb + pickup_dow:pickup_hour)
rxlm_1 <- rxLinMod(form_1, data = mht_xdf, dropFirst = TRUE, covCoef = TRUE)
```

Examining the model coefficients individually is a daunting task because of how many there are. Moreover, when working with big datasets, a lot of coefficients come out as statistically significant by virtue of large sample size, without necessarily being practically significant. Instead for now we just look at how our predictions are looking. We start by extracting each variable's factor levels into a `list`

which we can pass to `expand.grid`

to create a dataset with all the possible combinations of the factor levels. We then use `rxPredict`

to predict `tip_percent`

using the above model.

```
rxs <- rxSummary( ~ pickup_nb + dropoff_nb + pickup_hour + pickup_dow, mht_xdf)
ll <- lapply(rxs$categorical, function(x) x[ , 1])
names(ll) <- c('pickup_nb', 'dropoff_nb', 'pickup_hour', 'pickup_dow')
pred_df_1 <- expand.grid(ll)
pred_df_1 <- rxPredict(rxlm_1, data = pred_df_1, computeStdErrors = TRUE, writeModelVars = TRUE)
names(pred_df_1)[1:2] <- paste(c('tip_pred', 'tip_stderr'), 1, sep = "_")
head(pred_df_1, 10)
```

```
tip_pred_1 tip_stderr_1 pickup_nb dropoff_nb pickup_dow pickup_hour
1 6.796323 0.16432197 Chinatown Chinatown Sun 1AM-5AM
2 10.741766 0.15853956 Little Italy Chinatown Sun 1AM-5AM
3 9.150114 0.09162002 Tribeca Chinatown Sun 1AM-5AM
4 10.174307 0.09819651 Soho Chinatown Sun 1AM-5AM
5 9.706202 0.07365164 Lower East Side Chinatown Sun 1AM-5AM
6 8.475197 0.06354026 Financial District Chinatown Sun 1AM-5AM
7 10.866035 0.07150005 Greenwich Village Chinatown Sun 1AM-5AM
8 10.997276 0.06831955 East Village Chinatown Sun 1AM-5AM
9 9.313165 0.12507373 Battery Park Chinatown Sun 1AM-5AM
10 10.613802 0.11624956 West Village Chinatown Sun 1AM-5AM
```