Complex transformations

In the last part rxDataStep was introduced to perform a simple one-liner transformation. We now use rxDataStep again to perform some other, this time more complicated transformations. We can sometimes perform these more complex transformations as longer one-liners using the transforms argument, following the above example. But a cleaner way to do it is to create a function that contains the logic of our transformations and pass it to the transformFunc argument. This function takes the data as input and usually returns the same data as output with one or more columns added or modified. More specifically, the input to the transformation function is a list whose elements are the columns. Otherwise, it is just like any R function. Using the transformFunc argument allows us to focus on writing a transformation function and quickly testing them on the sample data.frame before we run them on the whole data.

For the NYC Taxi data, we are interested in comparing trips based on day of week and the time of day. Those two columns do not exist yet, but we can extract them from pick-up date and time and drop-off date and time. To extract the above features, we use the lubridate package, which has useful functions for dealing with date and time columns. To perform these transformations, we use a transformation function called xforms.

xforms <- function(data) { # transformation function for extracting some date and time features

  weekday_labels <- c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
  cut_levels <- c(1, 5, 9, 12, 16, 18, 22)
  hour_labels <- c('1AM-5AM', '5AM-9AM', '9AM-12PM', '12PM-4PM', '4PM-6PM', '6PM-10PM', '10PM-1AM')

  pickup_datetime <- ymd_hms(data$tpep_pickup_datetime, tz = "UTC")
  pickup_hour <- addNA(cut(hour(pickup_datetime), cut_levels))
  pickup_dow <- factor(wday(pickup_datetime), levels = 1:7, labels = weekday_labels)
  levels(pickup_hour) <- hour_labels

  dropoff_datetime <- ymd_hms(data$tpep_dropoff_datetime, tz = "UTC")
  dropoff_hour <- addNA(cut(hour(dropoff_datetime), cut_levels))
  dropoff_dow <- factor(wday(dropoff_datetime), levels = 1:7, labels = weekday_labels)
  levels(dropoff_hour) <- hour_labels

  data$pickup_hour <- pickup_hour
  data$pickup_dow <- pickup_dow
  data$dropoff_hour <- dropoff_hour
  data$dropoff_dow <- dropoff_dow
  data$trip_duration <- as.integer(as.duration(dropoff_datetime - pickup_datetime))

  data
}

Before we apply the transformation to the data, it's usually a good idea to test it and make sure it's working. We set aside a sample of the data as a data.frame for this purpose. Running the transformation function on nyc_sample_df should return the original data with the new columns.

library(lubridate)
Sys.setenv(TZ = "US/Eastern") # not important for this dataset
head(xforms(nyc_sample_df)) # test the function on a data.frame
  VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
1        2  2016-01-01 00:00:00   2016-01-01 00:00:00               2          1.10
2        2  2016-01-01 00:00:00   2016-01-01 00:00:00               5          4.90
3        2  2016-01-01 00:00:00   2016-01-01 00:00:00               1         10.54
4        2  2016-01-01 00:00:00   2016-01-01 00:00:00               1          4.75
5        2  2016-01-01 00:00:00   2016-01-01 00:00:00               3          1.76
6        2  2016-01-01 00:00:00   2016-01-01 00:18:30               2          5.52
  pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude
1        -73.99037        40.73470          1                  N         -73.98184
2        -73.98078        40.72991          1                  N         -73.94447
3        -73.98455        40.67957          1                  N         -73.95027
4        -73.99347        40.71899          1                  N         -73.96224
5        -73.96062        40.78133          1                  N         -73.97726
6        -73.98012        40.74305          1                  N         -73.91349
  dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount
1         40.73241            2         7.5   0.5     0.5          0            0
2         40.71668            1        18.0   0.5     0.5          0            0
3         40.78893            1        33.0   0.5     0.5          0            0
4         40.65733            2        16.5   0.0     0.5          0            0
5         40.75851            2         8.0   0.0     0.5          0            0
6         40.76314            2        19.0   0.5     0.5          0            0
  improvement_surcharge total_amount pickup_hour pickup_dow dropoff_hour
1                   0.3          8.8    10PM-1AM        Fri     10PM-1AM
2                   0.3         19.3    10PM-1AM        Fri     10PM-1AM
3                   0.3         34.3    10PM-1AM        Fri     10PM-1AM
4                   0.3         17.3    10PM-1AM        Fri     10PM-1AM
5                   0.3          8.8    10PM-1AM        Fri     10PM-1AM
6                   0.3         20.3    10PM-1AM        Fri     10PM-1AM
  dropoff_dow trip_duration
1         Fri             0
2         Fri             0
3         Fri             0
4         Fri             0
5         Fri             0
6         Fri          1110

We run one last test before applying the transformation. Recall that rxDataStep works with a data.frame input too, and that leaving the outFile argument means we return a data.frame. So we can perform the above test with rxDataStep by passing transformation function to transformFunc and specifying the required packages in transformPackages.

head(rxDataStep(nyc_sample_df, transformFunc = xforms, transformPackages = "lubridate"))
Rows Processed: 1000 
  VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
1        2  2016-01-01 00:00:00   2016-01-01 00:00:00               2          1.10
2        2  2016-01-01 00:00:00   2016-01-01 00:00:00               5          4.90
3        2  2016-01-01 00:00:00   2016-01-01 00:00:00               1         10.54
4        2  2016-01-01 00:00:00   2016-01-01 00:00:00               1          4.75
5        2  2016-01-01 00:00:00   2016-01-01 00:00:00               3          1.76
6        2  2016-01-01 00:00:00   2016-01-01 00:18:30               2          5.52
  pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude
1        -73.99037        40.73470          1                  N         -73.98184
2        -73.98078        40.72991          1                  N         -73.94447
3        -73.98455        40.67957          1                  N         -73.95027
4        -73.99347        40.71899          1                  N         -73.96224
5        -73.96062        40.78133          1                  N         -73.97726
6        -73.98012        40.74305          1                  N         -73.91349
  dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount
1         40.73241            2         7.5   0.5     0.5          0            0
2         40.71668            1        18.0   0.5     0.5          0            0
3         40.78893            1        33.0   0.5     0.5          0            0
4         40.65733            2        16.5   0.0     0.5          0            0
5         40.75851            2         8.0   0.0     0.5          0            0
6         40.76314            2        19.0   0.5     0.5          0            0
  improvement_surcharge total_amount pickup_hour pickup_dow dropoff_hour
1                   0.3          8.8    10PM-1AM        Fri     10PM-1AM
2                   0.3         19.3    10PM-1AM        Fri     10PM-1AM
3                   0.3         34.3    10PM-1AM        Fri     10PM-1AM
4                   0.3         17.3    10PM-1AM        Fri     10PM-1AM
5                   0.3          8.8    10PM-1AM        Fri     10PM-1AM
6                   0.3         20.3    10PM-1AM        Fri     10PM-1AM
  dropoff_dow trip_duration
1         Fri             0
2         Fri             0
3         Fri             0
4         Fri             0
5         Fri             0
6         Fri          1110

Everything seems to be working well. This does not guarantee that running the transformation function on the whole dataset will succeed, but it makes it less likely to fail for the wrong reasons. If the transformation works on the sample data.frame, as it does above, but fails when we run it on the whole dataset, then it is usually because of something in the dataset that causes it to fail (such as missing values) that was not present in the sample data. We now run the transformation on the whole data set.

st <- Sys.time()
rxDataStep(nyc_xdf, nyc_xdf, overwrite = TRUE, transformFunc = xforms, transformPackages = "lubridate")
Sys.time() - st
Time difference of 11.07041 mins

results matching ""

    No results matching ""