Geographical features

The next set of features we extract from the data are geographical features, for which we load the following geospatial packages:

library(rgeos)
library(sp)
library(maptools)

It is common to store GIS data in R into shapefiles. A shapefile is essentially a data object that stores geospatial informaiton such as region names and boundaries where a region can be anything from a continent to city neighborhoods. The shapefile we use here was provided by Zillow.com and can be found here. It is a shapefile for the state of New York, and it contains neighborhood-level information for New York City.

nyc_shapefile <- readShapePoly('ZillowNeighborhoods-NY/ZillowNeighborhoods-NY.shp')

We can see what sort of information is available by peeking at nyc_shapefile@data:

head(nyc_shapefile@data, 10)

  STATE   COUNTY                        CITY            NAME REGIONID
0    NY   Monroe                   Rochester Ellwanger-Barry   343894
1    NY New York     New York City-Manhattan    West Village   270964
2    NY    Kings      New York City-Brooklyn     Bensonhurst   193285
3    NY     Erie                     Buffalo      South Park   270935
...

The data stores information about neighborhoods under the column NAME. Since we have longitude and latitude for pick-up and drop-off location, we can use the above data set to find the pick-up and drop-off neighborhoods for each cab ride. To keep the analysis simple, we limit the data to Manhattan only, where the great majority of cab rides take place.

nyc_shapefile <- subset(nyc_shapefile, COUNTY == 'New York') # limit the data to Manhattan only

Notice that even though nyc_shapefile is not a data.frame, subset still worked. This is because subset is a function that works on more than just one kind of input. Quite a few R functions are the same way, such as plot and predict.

With a bit of work, we can plot a map of the whole area, showing the boundaries separating each neighborhood. We won't go into great detail on how the plots are generated, as it would derail us from the main topic.

library(ggplot2)
nyc_shapefile@data$id <- as.character(nyc_shapefile@data$NAME)
nyc_points <- fortify(gBuffer(nyc_shapefile, byid = TRUE, width = 0), region = "NAME") # fortify neighborhood boundaries

As part of the code to create the plot, we use dplyr to summarize the data and get median coordinates for each neighborhood, but since we revisit dplyr in greater depth in the next section, we skip the explanation for now.

library(dplyr)
nyc_df <- inner_join(nyc_points, nyc_shapefile@data, by = "id")
nyc_centroids <- summarize(group_by(nyc_df, id), long = median(long), lat = median(lat))

library(ggrepel)
library(ggplot2)
ggplot(nyc_df) + 
  aes(long, lat, fill = id) + 
  geom_polygon() +
  geom_path(color = "white") +
  coord_equal() +
  theme(legend.position = "none") +
  geom_text_repel(aes(label = id), data = nyc_centroids, size = 3)

We now go back to the data to find the neighborhood information based on the pick-up and drop-off coordinates. We store pick-up longitude and latitude in a separate data.frame, replacing NAs with zeroes (the function we're about to use doesn't work with NAs). We then use the coordinates function to point to the columns that correspond to the geographical coordinates. Finally, we use the over function to find the region (in this case the neighborhood) that the coordinates in the data fall into, and we append the neighborhood name as a new column to the nyc_taxi dataset.

data_coords <- data.frame(
  long = ifelse(is.na(nyc_taxi$pickup_longitude), 0, nyc_taxi$pickup_longitude), 
  lat = ifelse(is.na(nyc_taxi$pickup_latitude), 0, nyc_taxi$pickup_latitude)
)
coordinates(data_coords) <- c('long', 'lat') # we specify the columns that correspond to the coordinates
# we replace NAs with zeroes, becuase NAs won't work with the `over` function
nhoods <- over(data_coords, nyc_shapefile) # returns the neighborhoods based on coordinates
nyc_taxi$pickup_nhood <- nhoods$NAME # we attach the neighborhoods to the original data and call it `pickup_nhood`

We can use table to get a count of pick-up neighborhoods:

head(table(nyc_taxi$pickup_nhood, useNA = "ifany"))

      19th Ward Abbott McKinley        Albright           Allen       Annandale 
              0               0               0               0               0 
     Arbor Hill 
              0

We now repeat the above process, this using drop-off coordinates this time to get the drop-off neighborhood.

data_coords <- data.frame(
  long = ifelse(is.na(nyc_taxi$dropoff_longitude), 0, nyc_taxi$dropoff_longitude), 
  lat = ifelse(is.na(nyc_taxi$dropoff_latitude), 0, nyc_taxi$dropoff_latitude)
)
coordinates(data_coords) <- c('long', 'lat')
nhoods <- over(data_coords, nyc_shapefile)
nyc_taxi$dropoff_nhood <- nhoods$NAME

And since data_coords and nhoods are potentially large objects, we remove them from our session when they're no longer needed.

rm(data_coords, nhoods) # delete these objects, as they are no longer needed

Note how we had to repeat the same process in two different steps, once to get pick-up and once to get drop-off neighborhoods. Now if we need to change something about the above code, we have to change it in two different places. For example, if we want to reset the factor levels so that only Manhattan neighborhoods are showing, we need to remember to do it twice.

Another downside is we ended up with leftover objects data_coords and nhood. Since both objects have the same number of rows as the nyc_taxi dataset, they are relatively large objects, so we manually deleted them from the R session using rm after we finished using them. Carrying around too many by-product objects in the R session that are no longer needed can result in us clogging the memory, especially if the objects take up a lot of space. So we need to be careful and do some housecleaning every now and then so our session remains clean. Doing so is easier said than done.

There is however something we can do to avoid both of the above headaches: wrap the process into an R function.

6.3 Geographical features

Geographical features

results matching ""

No results matching ""