Exercises
Next let's look at the longitude and latitude of the pick-up and drop-off locations.
summary(nyc_taxi[ , grep('long|lat', names(nyc_taxi), value = TRUE)])
Take a look at the histogram for pickup_longitude
:
ggplot(data = nyc_taxi) +
geom_histogram(aes(x = pickup_longitude), fill = "blue", bins = 20)
We can see that most longitude values fall in the expected range, but there's a second peak around 0. There are also some other values outside of the expected range, but we can't see them in the histogram. We just know there are there because of the wide range (in the x-axis) of the histogram.
(1) Plot a similar histogram for dropoff_longitude
to see if it follows suit.
Let's learn about two useful R functions:
cut
is used to turn a numeric value into a categorical value by finding the interval that it falls into. This is sometimes referred to as binning or bucketing.table
simply returns a count of each unique value in a vector.
For example, here we ask which bucket does 5.6 fall into?
- 0 to 4 (including 4)
- 4 to 10 (including 10)
- higher than 10
cut(5.6, c(0, 4, 10, Inf)) # 5.6 is in the range (4-10]
table(c(1, 1, 2, 2, 2, 3)) # provides counts of each distinct value
Take a moment to familiarize yourself with both functions by modifying the above examples. We will be using both functions a few times throughout the course.
(2) Use cut
to "bucket" pickup_longitude
into the following buckets: -75 or less, between -75 and -73, between -73 and -1, between -1 and 1, more than 1. Then table
to get counts for each bucket.