Examining outliers

Let's see how we could use RevoScaleR to examine the data for outliers. Our approach here is rather primitive, but the intent is to show how the use the tools: We use rxDataStep and its rowSelection argument to extract all the data points that are candidate outliers. By leaving the outFile argument we output the resulting dataset into a data.frame which we call odd_trips. Lastly, if we are too expansive in our outlier selection criteria, then resulting data.frame could still have too many rows (which could clog the memory and make it slow to produce plots and other summaries). So we create a new column u and populate it with random uniform numbers between 0 and 1, and we add u < .05 to our rowSelection criteria. We can adjust this number to end up with a smaller data.frame (threshold closer to 0) or a larger data.frame (threshold closer to 1).

# outFile argument missing means we output to data.frame
odd_trips <- rxDataStep(nyc_xdf, rowSelection = (
  u < .05 & ( # we can adjust this if the data gets too big
    (trip_distance > 50 | trip_distance <= 0) |
    (passenger_count > 5 | passenger_count == 0) |
    (fare_amount > 5000 | fare_amount <= 0)
)), transforms = list(u = runif(.rxNumRows)))

print(dim(odd_trips))
[1] 93750    32

Since the dataset with the candidate outliers is a data.frame, we can use any R function to examine it. For example, we limit odd_trips to cases where a distance of more than 50 miles was traveled, plot a histogram of the fare amount the passenger paid, and color it based on whether the trip took more or less than 10 minutes.

odd_trips %>% 
  filter(trip_distance > 50) %>%
  ggplot() -> p

p + geom_histogram(aes(x = fare_amount, fill = trip_duration <= 10*60), binwidth = 10) +
  xlim(0, 500) + coord_fixed(ratio = 25)

fare amount distribution for long trips

As we can see, the majority of trips that traveled over 50 miles cost nothing or next to nothing, even though most of these trips took 10 minutes or longer. It is unclear whether such trips were the result of machine error human error, but if for example this analysis was targeted at the company that owns the taxis, this finding would warrant more investigation.

results matching ""

    No results matching ""