Exercises

In this exercise, we will be using the nyc_jan_xdf data from prior exercises. If you need to re-load the data, run the following code:

input_csv <- 'yellow_tripsample_2016-01.csv'
input_xdf <- 'yellow_tripsample_2016-01.xdf'
rxImport(input_csv, input_xdf, overwrite = TRUE)

nyc_jan_xdf <- RxXdfData(input_xdf)

Let's re-create the histogram for trip_distance using rxHistogram:

rxHistogram( ~ trip_distance, nyc_jan_xdf, startVal = 0, endVal = 25, histType = "Percent", numBreaks = 20)

(1) Modify the formula in the line above so that we get a separate histogram for card and cash customers (based on the card_vs_cash column created in the last exercise.

We used rxHistogram to get a histogram of trip_distance, which is a numeric column. We can also feed a factor column to rxHistogram and the result is a bar plot. If a numeric column is heavily skewed, its histogram is often hard to look at because most of the information is squeezed to one side of the plot. In such cases, we can convert the numeric column into a factor column, a process that's also called binning. We do that in R using the cut function, and provide it with a set of breakpoints that are used as boundaries for moving from one bin to the next.

For example, let's say we wanted to know if taxi trips travel zero miles (for whatever business reason), 5 miles or less, between 5 and 10 miles, or 10 or more miles. If a taxi trip travels 8.9 miles, then the following code example will answer it for us.

cut(8.9, breaks = c(-Inf, 0, 5, 10, Inf), labels = c("0", "<5", "5-10", "10+"))

[1] 5-10
Levels: 0 <5 5-10 10+

(2) Modify the rxHistogram call in part (1) so that instead of plotting a histogram of trip_distance, you plot a bar plot of a column called trip_dist_bin which bins trip_distance based on the breakpoints provided in the above code example. Compute trip_dist_bin on the fly (using the transforms inside of rxHistogram).

2.1f Exercises

Exercises

results matching ""

No results matching ""