Solutions

(1) RatecodeID is an integer column, so we need to use as.factor to covert it into a factor. When converting a column from one type to another, we often cannot overwrite the existing column and must create a new column instead. This is because of how the data is broken up into chunks and by converting a column from one type to another, we get a mismatch between chunks. payment_type is already a factor and so it doesn't need to be converted to one. When specifying the levels for payment_type, we can limit them to 1 and 2 (corresponding to card and cash) and all other levels wills automatically become NAs.

rxDataStep(nyc_xdf, nyc_xdf, 
           transforms = list(
             rate_code_id = as.factor(RatecodeID),
             rate_code_id = factor(rate_code_id, levels = 1:6, labels = c('standard', 'JFK', 'Newark', 'Nassau or Westchester', 'negotiated', 'group ride')),
             payment_type = factor(payment_type, levels = 1:2, labels = c('card', 'cash'))
             ),
           overwrite = TRUE)

(2) Because we want to apply the transformation to the large data nyc_xdf, we need to make sure that we don't add unnecessary columns and that the columns have the appropriate types.

nhoods <- over(data_coords, mht_shapefile)
str(nhoods)
''data.frame':    1000 obs. of  5 variables:
 $ STATE   : Factor w/ 1 level "NY": 1 1 1 1 1 1 1 1 1 1 ...
 $ COUNTY  : Factor w/ 9 levels "Albany","Bronx",..: 6 6 4 6 6 6 6 6 6 6 ...
 $ CITY    : Factor w/ 9 levels "Albany","Buffalo",..: 5 5 4 5 5 5 5 5 5 5 ...
 $ NAME    : Factor w/ 263 levels "19th Ward","Abbott McKinley",..: 103 74 24 135 239 99 135 99 ...
 $ REGIONID: num  195133 270829 272994 270875 270957 ...

The shapefile contains a few columns we don't need. The neighborhood column is called NAME and the city column is called CITY (which for NYC also contains the name of the borough). The appropriate column type for both is factor, which is already the case.

(3) There is more than a single correct solution in this case, but here's one that works.

find_nhoods <- function(data) {

  # extract pick-up lat and long and find their neighborhoods
  pickup_longitude <- ifelse(is.na(data$pickup_longitude), 0, data$pickup_longitude)
  pickup_latitude <- ifelse(is.na(data$pickup_latitude), 0, data$pickup_latitude)
  data_coords <- data.frame(long = pickup_longitude, lat = pickup_latitude)
  coordinates(data_coords) <- c('long', 'lat')
  nhoods <- over(data_coords, shapefile)

  ## add only the pick-up neighborhood and city columns to the data
  data$pickup_nhood <- nhoods$NAME
  data$pickup_borough <- nhoods$CITY

  # extract drop-off lat and long and find their neighborhoods
  dropoff_longitude <- ifelse(is.na(data$dropoff_longitude), 0, data$dropoff_longitude)
  dropoff_latitude <- ifelse(is.na(data$dropoff_latitude), 0, data$dropoff_latitude)
  data_coords <- data.frame(long = dropoff_longitude, lat = dropoff_latitude)
  coordinates(data_coords) <- c('long', 'lat')
  nhoods <- over(data_coords, shapefile)

  ## add only the drop-off neighborhood and city columns to the data  
  data$dropoff_nhood <- nhoods$NAME
  data$dropoff_borough <- nhoods$CITY

  ## return the data with the new columns added in
  data
}

(4) We can test the above function on the sample data to make sure it works.

# test the function on a data.frame using rxDataStep
head(rxDataStep(nyc_sample_df, transformFunc = find_nhoods, transformPackages = c("sp", "maptools"), 
                transformObjects = list(shapefile = nyc_shapefile)))
  VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
1        2  2016-01-01 00:00:00   2016-01-01 00:00:00               2          1.10
2        2  2016-01-01 00:00:00   2016-01-01 00:00:00               5          4.90
3        2  2016-01-01 00:00:00   2016-01-01 00:00:00               1         10.54
4        2  2016-01-01 00:00:00   2016-01-01 00:00:00               1          4.75
5        2  2016-01-01 00:00:00   2016-01-01 00:00:00               3          1.76
6        2  2016-01-01 00:00:00   2016-01-01 00:18:30               2          5.52
  pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude
1        -73.99037        40.73470          1                  N         -73.98184
2        -73.98078        40.72991          1                  N         -73.94447
3        -73.98455        40.67957          1                  N         -73.95027
4        -73.99347        40.71899          1                  N         -73.96224
5        -73.96062        40.78133          1                  N         -73.97726
6        -73.98012        40.74305          1                  N         -73.91349
  dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount
1         40.73241            2         7.5   0.5     0.5          0            0
2         40.71668            1        18.0   0.5     0.5          0            0
3         40.78893            1        33.0   0.5     0.5          0            0
4         40.65733            2        16.5   0.0     0.5          0            0
5         40.75851            2         8.0   0.0     0.5          0            0
6         40.76314            2        19.0   0.5     0.5          0            0
  improvement_surcharge total_amount      pickup_nhood          pickup_borough
1                   0.3          8.8 Greenwich Village New York City-Manhattan
2                   0.3         19.3      East Village New York City-Manhattan
3                   0.3         34.3       Boerum Hill  New York City-Brooklyn
4                   0.3         17.3   Lower East Side New York City-Manhattan
5                   0.3          8.8   Upper East Side New York City-Manhattan
6                   0.3         20.3          Gramercy New York City-Manhattan
             dropoff_nhood         dropoff_borough
1                 Gramercy New York City-Manhattan
2                     <NA>                    <NA>
3                Yorkville New York City-Manhattan
4                     <NA>                    <NA>
5                  Midtown New York City-Manhattan
6 Astoria-Long Island City    New York City-Queens

The last four columns in the data correspond to the neighborhood columns we wanted.

results matching ""

    No results matching ""