# Examining outliers

Let's see how we could use `RevoScaleR`

to examine the data for outliers. Our approach here is rather primitive, but the intent is to show how the use the tools: We use `rxDataStep`

and its `rowSelection`

argument to extract all the data points that are candidate outliers. By leaving the `outFile`

argument we output the resulting dataset into a `data.frame`

which we call `odd_trips`

. Lastly, if we are too expansive in our outlier selection criteria, then resulting `data.frame`

could still have too many rows (which could clog the memory and make it slow to produce plots and other summaries). So we create a new column `u`

and populate it with random uniform numbers between 0 and 1, and we add `u < .05`

to our `rowSelection`

criteria. We can adjust this number to end up with a smaller `data.frame`

(threshold closer to 0) or a larger `data.frame`

(threshold closer to 1).

```
# outFile argument missing means we output to data.frame
odd_trips <- rxDataStep(nyc_xdf, rowSelection = (
u < .05 & ( # we can adjust this if the data gets too big
(trip_distance > 50 | trip_distance <= 0) |
(passenger_count > 5 | passenger_count == 0) |
(fare_amount > 5000 | fare_amount <= 0)
)), transforms = list(u = runif(.rxNumRows)))
print(dim(odd_trips))
```

```
[1] 93750 32
```

Since the dataset with the candidate outliers is a `data.frame`

, we can use any R function to examine it. For example, we limit `odd_trips`

to cases where a distance of more than 50 miles was traveled, plot a histogram of the fare amount the passenger paid, and color it based on whether the trip took more or less than 10 minutes.

```
odd_trips %>%
filter(trip_distance > 50) %>%
ggplot() -> p
p + geom_histogram(aes(x = fare_amount, fill = trip_duration <= 10*60), binwidth = 10) +
xlim(0, 500) + coord_fixed(ratio = 25)
```

As we can see, the majority of trips that traveled over 50 miles cost nothing or next to nothing, even though most of these trips took 10 minutes or longer. It is unclear whether such trips were the result of machine error human error, but if for example this analysis was targeted at the company that owns the taxis, this finding would warrant more investigation.