Exercises

Next let's look at the longitude and latitude of the pick-up and drop-off locations.

summary(nyc_taxi[ , grep('long|lat', names(nyc_taxi), value = TRUE)])

Take a look at the histogram for pickup_longitude:

ggplot(data = nyc_taxi) +
  geom_histogram(aes(x = pickup_longitude), fill = "blue", bins = 20)

We can see that most longitude values fall in the expected range, but there's a second peak around 0. There are also some other values outside of the expected range, but we can't see them in the histogram. We just know there are there because of the wide range (in the x-axis) of the histogram.

(1) Plot a similar histogram for dropoff_longitude to see if it follows suit.


Let's learn about two useful R functions:

  • cut is used to turn a numeric value into a categorical value by finding the interval that it falls into. This is sometimes referred to as binning or bucketing.
  • table simply returns a count of each unique value in a vector.

For example, here we ask which bucket does 5.6 fall into?

  • 0 to 4 (including 4)
  • 4 to 10 (including 10)
  • higher than 10
cut(5.6, c(0, 4, 10, Inf)) # 5.6 is in the range (4-10]
table(c(1, 1, 2, 2, 2, 3)) # provides counts of each distinct value

Take a moment to familiarize yourself with both functions by modifying the above examples. We will be using both functions a few times throughout the course.

(2) Use cut to "bucket" pickup_longitude into the following buckets: -75 or less, between -75 and -73, between -73 and -1, between -1 and 1, more than 1. Then table to get counts for each bucket.

results matching ""

    No results matching ""