Solutions
(1) RatecodeID is an integer column, so we need to use as.factor to covert it into a factor. When converting a column from one type to another, we often cannot overwrite the existing column and must create a new column instead. This is because of how the data is broken up into chunks and by converting a column from one type to another, we get a mismatch between chunks. payment_type is already a factor and so it doesn't need to be converted to one. When specifying the levels for payment_type, we can limit them to 1 and 2 (corresponding to card and cash) and all other levels wills automatically become NAs.
rxDataStep(nyc_xdf, nyc_xdf,
transforms = list(
rate_code_id = as.factor(RatecodeID),
rate_code_id = factor(rate_code_id, levels = 1:6, labels = c('standard', 'JFK', 'Newark', 'Nassau or Westchester', 'negotiated', 'group ride')),
payment_type = factor(payment_type, levels = 1:2, labels = c('card', 'cash'))
),
overwrite = TRUE)
(2) Because we want to apply the transformation to the large data nyc_xdf, we need to make sure that we don't add unnecessary columns and that the columns have the appropriate types.
nhoods <- over(data_coords, mht_shapefile)
str(nhoods)
''data.frame': 1000 obs. of 5 variables:
$ STATE : Factor w/ 1 level "NY": 1 1 1 1 1 1 1 1 1 1 ...
$ COUNTY : Factor w/ 9 levels "Albany","Bronx",..: 6 6 4 6 6 6 6 6 6 6 ...
$ CITY : Factor w/ 9 levels "Albany","Buffalo",..: 5 5 4 5 5 5 5 5 5 5 ...
$ NAME : Factor w/ 263 levels "19th Ward","Abbott McKinley",..: 103 74 24 135 239 99 135 99 ...
$ REGIONID: num 195133 270829 272994 270875 270957 ...
The shapefile contains a few columns we don't need. The neighborhood column is called NAME and the city column is called CITY (which for NYC also contains the name of the borough). The appropriate column type for both is factor, which is already the case.
(3) There is more than a single correct solution in this case, but here's one that works.
find_nhoods <- function(data) {
# extract pick-up lat and long and find their neighborhoods
pickup_longitude <- ifelse(is.na(data$pickup_longitude), 0, data$pickup_longitude)
pickup_latitude <- ifelse(is.na(data$pickup_latitude), 0, data$pickup_latitude)
data_coords <- data.frame(long = pickup_longitude, lat = pickup_latitude)
coordinates(data_coords) <- c('long', 'lat')
nhoods <- over(data_coords, shapefile)
## add only the pick-up neighborhood and city columns to the data
data$pickup_nhood <- nhoods$NAME
data$pickup_borough <- nhoods$CITY
# extract drop-off lat and long and find their neighborhoods
dropoff_longitude <- ifelse(is.na(data$dropoff_longitude), 0, data$dropoff_longitude)
dropoff_latitude <- ifelse(is.na(data$dropoff_latitude), 0, data$dropoff_latitude)
data_coords <- data.frame(long = dropoff_longitude, lat = dropoff_latitude)
coordinates(data_coords) <- c('long', 'lat')
nhoods <- over(data_coords, shapefile)
## add only the drop-off neighborhood and city columns to the data
data$dropoff_nhood <- nhoods$NAME
data$dropoff_borough <- nhoods$CITY
## return the data with the new columns added in
data
}
(4) We can test the above function on the sample data to make sure it works.
# test the function on a data.frame using rxDataStep
head(rxDataStep(nyc_sample_df, transformFunc = find_nhoods, transformPackages = c("sp", "maptools"),
transformObjects = list(shapefile = nyc_shapefile)))
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
1 2 2016-01-01 00:00:00 2016-01-01 00:00:00 2 1.10
2 2 2016-01-01 00:00:00 2016-01-01 00:00:00 5 4.90
3 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 10.54
4 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 4.75
5 2 2016-01-01 00:00:00 2016-01-01 00:00:00 3 1.76
6 2 2016-01-01 00:00:00 2016-01-01 00:18:30 2 5.52
pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude
1 -73.99037 40.73470 1 N -73.98184
2 -73.98078 40.72991 1 N -73.94447
3 -73.98455 40.67957 1 N -73.95027
4 -73.99347 40.71899 1 N -73.96224
5 -73.96062 40.78133 1 N -73.97726
6 -73.98012 40.74305 1 N -73.91349
dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount
1 40.73241 2 7.5 0.5 0.5 0 0
2 40.71668 1 18.0 0.5 0.5 0 0
3 40.78893 1 33.0 0.5 0.5 0 0
4 40.65733 2 16.5 0.0 0.5 0 0
5 40.75851 2 8.0 0.0 0.5 0 0
6 40.76314 2 19.0 0.5 0.5 0 0
improvement_surcharge total_amount pickup_nhood pickup_borough
1 0.3 8.8 Greenwich Village New York City-Manhattan
2 0.3 19.3 East Village New York City-Manhattan
3 0.3 34.3 Boerum Hill New York City-Brooklyn
4 0.3 17.3 Lower East Side New York City-Manhattan
5 0.3 8.8 Upper East Side New York City-Manhattan
6 0.3 20.3 Gramercy New York City-Manhattan
dropoff_nhood dropoff_borough
1 Gramercy New York City-Manhattan
2 <NA> <NA>
3 Yorkville New York City-Manhattan
4 <NA> <NA>
5 Midtown New York City-Manhattan
6 Astoria-Long Island City New York City-Queens
The last four columns in the data correspond to the neighborhood columns we wanted.