# Exercises

In this exercise, we will be using the `nyc_jan_xdf`

data from prior exercises. If you need to re-load the data, run the following code:

```
input_csv <- 'yellow_tripsample_2016-01.csv'
input_xdf <- 'yellow_tripsample_2016-01.xdf'
rxImport(input_csv, input_xdf, overwrite = TRUE)
nyc_jan_xdf <- RxXdfData(input_xdf)
```

Let's re-create the histogram for `trip_distance`

using `rxHistogram`

:

```
rxHistogram( ~ trip_distance, nyc_jan_xdf, startVal = 0, endVal = 25, histType = "Percent", numBreaks = 20)
```

(1) Modify the formula in the line above so that we get a separate histogram for card and cash customers (based on the `card_vs_cash`

column created in the last exercise.

We used `rxHistogram`

to get a histogram of `trip_distance`

, which is a `numeric`

column. We can also feed a `factor`

column to `rxHistogram`

and the result is a bar plot. If a `numeric`

column is heavily skewed, its histogram is often hard to look at because most of the information is squeezed to one side of the plot. In such cases, we can convert the `numeric`

column into a `factor`

column, a process that's also called **binning**. We do that in R using the `cut`

function, and provide it with a set of breakpoints that are used as boundaries for moving from one bin to the next.

For example, let's say we wanted to know if taxi trips travel zero miles (for whatever business reason), 5 miles or less, between 5 and 10 miles, or 10 or more miles. If a taxi trip travels 8.9 miles, then the following code example will answer it for us.

```
cut(8.9, breaks = c(-Inf, 0, 5, 10, Inf), labels = c("0", "<5", "5-10", "10+"))
```

```
[1] 5-10
Levels: 0 <5 5-10 10+
```

(2) Modify the `rxHistogram`

call in part (1) so that instead of plotting a histogram of `trip_distance`

, you plot a bar plot of a column called `trip_dist_bin`

which bins `trip_distance`

based on the breakpoints provided in the above code example. Compute `trip_dist_bin`

on the fly (using the `transforms`

inside of `rxHistogram`

).