Solutions

The purpose of this exercise to compare runtimes for running a single call of rxSummary on an XDF file versus a CSV file, for the same data. If the XDF file already exists, rxSummary will always be faster on the XDF file than the CSV file. But for the comparison to be fair, we assume the XDF file does not exist and needs to be created, and we include the time it takes to covert the CSV file into XDF as part of the runtime to run the summary on the XDF file.

(1) Both the rxImport and rxSummary call are part of the runtime calculation.

input_xdf <- 'yellow_tripdata_2016-01.xdf'

st <- Sys.time()
rxImport(input_csv, input_xdf, colClasses = col_classes, overwrite = TRUE)
jan_2016_xdf <- RxXdfData(input_xdf)
sum_xdf <- rxSummary( ~ ., jan_2016_xdf)
rt_xdf <- Sys.time() - st # runtime for XDF file

file.remove(input_xdf) # remove the file to keep folder clean

(2) We point rxSummary directly to the CSV file this time.

input_csv <- 'yellow_tripdata_2016-01.csv'

st <- Sys.time()
jan_2016_csv <- RxTextData(input_csv, colClasses = col_classes)
sum_csv <- rxSummary( ~ ., jan_2016_csv)
rt_csv <- Sys.time() - st # runtime for CSV file

(3) We can just take the difference of the runtimes.

rt_xdf - rt_csv
Time difference of -2.469199 mins

We can see that the XDF conversion and subsequent summary was still faster than summarizing the CSV file. This is because summarizing the XDF file considerably faster, making up for conversion time. Since our results are I/O dependent, they will depend on our hard drive's infrastructure.

(4) The sum_xdf and sum_csv are list objects and the counts for the factor columns are stored in an element called categorical. Here's how we can compare the counts for one factor column:

sum_xdf$categorical[[2]]
sum_csv$categorical[[2]]
  payment_type  Counts
1            2 3673651
2            1 7181476
3            3   38319
4            4   13411
5            5       1

The statistical summaries for the numeric columns are stored in an element called sDataFrame.

sum_xdf$sDataFrame[5, ]
sum_csv$sDataFrame[5, ]
           Name     Mean   StdDev Min     Max ValidObs MissingObs
5 trip_distance 4.648197 2981.095   0 8000010 10906858          0

In either case results are identical.

results matching ""

    No results matching ""