A linear model predicting tip percent

Let's begin by building a linear model involving two interactive terms: one between pickup_nb and dropoff_nb and another one between pickup_dow and pickup_hour. The idea here is that we think trip percent is not just influenced by which neighborhood the passengers was pickup up from, or which neighborhood they were dropped off to, but which neighborhood they were picked up from AND dropped off to. Similarly, we intuit that the day of the week and the hour of the day together influence tipping. For example, just because people tip high on Sundays between 9 and 12, doesn't mean that they tend to tip high any day of the week between 9 and 12 PM, or any time of the day on a Sunday. This intuition is encoded in the model formula argument that we pass to the rxLinMod function: tip_percent ~ pickup_nb:dropoff_nb + pickup_dow:pickup_hour where we use : to separate interactive terms and + to separate additive terms.

form_1 <- as.formula(tip_percent ~ pickup_nb:dropoff_nb + pickup_dow:pickup_hour)
rxlm_1 <- rxLinMod(form_1, data = mht_xdf, dropFirst = TRUE, covCoef = TRUE)

Examining the model coefficients individually is a daunting task because of how many there are. Moreover, when working with big datasets, a lot of coefficients come out as statistically significant by virtue of large sample size, without necessarily being practically significant. Instead for now we just look at how our predictions are looking. We start by extracting each variable's factor levels into a list which we can pass to expand.grid to create a dataset with all the possible combinations of the factor levels. We then use rxPredict to predict tip_percent using the above model.

rxs <- rxSummary( ~ pickup_nb + dropoff_nb + pickup_hour + pickup_dow, mht_xdf)
ll <- lapply(rxs$categorical, function(x) x[ , 1])
names(ll) <- c('pickup_nb', 'dropoff_nb', 'pickup_hour', 'pickup_dow')
pred_df_1 <- expand.grid(ll)
pred_df_1 <- rxPredict(rxlm_1, data = pred_df_1, computeStdErrors = TRUE, writeModelVars = TRUE)
names(pred_df_1)[1:2] <- paste(c('tip_pred', 'tip_stderr'), 1, sep = "_")
head(pred_df_1, 10)

   tip_pred_1 tip_stderr_1          pickup_nb dropoff_nb pickup_dow pickup_hour
1    6.796323   0.16432197          Chinatown  Chinatown        Sun     1AM-5AM
2   10.741766   0.15853956       Little Italy  Chinatown        Sun     1AM-5AM
3    9.150114   0.09162002            Tribeca  Chinatown        Sun     1AM-5AM
4   10.174307   0.09819651               Soho  Chinatown        Sun     1AM-5AM
5    9.706202   0.07365164    Lower East Side  Chinatown        Sun     1AM-5AM
6    8.475197   0.06354026 Financial District  Chinatown        Sun     1AM-5AM
7   10.866035   0.07150005  Greenwich Village  Chinatown        Sun     1AM-5AM
8   10.997276   0.06831955       East Village  Chinatown        Sun     1AM-5AM
9    9.313165   0.12507373       Battery Park  Chinatown        Sun     1AM-5AM
10  10.613802   0.11624956       West Village  Chinatown        Sun     1AM-5AM

3.2a A linear model for tip percent

A linear model predicting tip percent

results matching ""

No results matching ""