Exercises

When processing a data.frame with R, vectorized functions show up in many places. Without them, our R code would be more verbose, and often (though not always) less efficient. Let's look at another example of this by looking at the relationship between tipping and method of payment. Let's assume that most cash customers tip (but the amount they tip does not show in the data). We further assume that tipping behavior for cash vs card customers is very different in the following way:

  • card customers might tip based on a certain percentage (automatically calculated when they swipe)
  • cash customers might tip by rounding up (and thereby avoid getting small change)

For example, a card customer could tip 10 percent regardless of the fare amount, but a cash customer whose fare is $4.65 would round up to $6, and if the fare is $26.32 they would round up to $30. So the cash customer's tip is also proportional to the fare amount, but partly driven by the need to avoid getting change or doing the math. We want to find a way to simulate this behavior.

In other words, we want to write a function that calculates tip by rounding up the fare amount. Writing such a function from scratch is a little tedious. Fortunately, there is already a function in base R to help us:

findInterval(3.66, c(1, 3, 4.5, 6, 10))

Take a moment to inspect and familiarize yourself with the above function:

  • What does the above function return?
  • What are some ways the function could "misbehave"? In other words, check what the function returns when odd inputs are provided, including NAs.
findInterval(NA, c(1, 3, 4.5, 6, 10))

Let's break up the above code into two parts:

upper_limits <- c(1, 3, 4.5, 6, 10)
findInterval(3.66, upper_limits)

(1) Modify the last line so that we return the first number higher than the number we provide. In this case: the number we provide is 3.66, the first number higher than 3.66 is 4.5, so modify the code to return 4.5 only. (HINT: think of the line as the index to another vector.)

(2) Is the function findInterval vectorized? show by example.

(3) Wrap the above solution into a function called round.up.fare and test it with the following input:

sample_of_fares <- c(.55, 2.33, 4, 6.99, 15.20, 18, 23, 44)
round.up.fare(sample_of_fares)

Here's the result we expect to get:

[1]  1.0  3.0  4.5 10.0   NA   NA   NA   NA

(4) Replace the statistical approach to simulating tip_amount for the cash customers with the rule-based approach implemented in the above function. In the data transformation above (under nyc_taxi <- mutate(...)), replace the line tip_if_heads = rnorm(...) with the transformation corresponding to the rule-based approach, as implemented by round.up.fare. Use the following fare round-up upper limits:

fare_intervals <- c(0:10, seq(12, 20, by = 2), seq(25, 50, by = 5), seq(55, 100, by = 10))
round.up.fare(23, fare_intervals)

Run the new transformation and recreate the plot, comment on the new distribution.

results matching ""

    No results matching ""