Solutions
The purpose of this exercise to compare runtimes for running a single call of rxSummary
on an XDF file versus a CSV file, for the same data. If the XDF file already exists, rxSummary
will always be faster on the XDF file than the CSV file. But for the comparison to be fair, we assume the XDF file does not exist and needs to be created, and we include the time it takes to covert the CSV file into XDF as part of the runtime to run the summary on the XDF file.
(1) Both the rxImport
and rxSummary
call are part of the runtime calculation.
input_xdf <- 'yellow_tripdata_2016-01.xdf'
st <- Sys.time()
rxImport(input_csv, input_xdf, colClasses = col_classes, overwrite = TRUE)
jan_2016_xdf <- RxXdfData(input_xdf)
sum_xdf <- rxSummary( ~ ., jan_2016_xdf)
rt_xdf <- Sys.time() - st # runtime for XDF file
file.remove(input_xdf) # remove the file to keep folder clean
(2) We point rxSummary
directly to the CSV file this time.
input_csv <- 'yellow_tripdata_2016-01.csv'
st <- Sys.time()
jan_2016_csv <- RxTextData(input_csv, colClasses = col_classes)
sum_csv <- rxSummary( ~ ., jan_2016_csv)
rt_csv <- Sys.time() - st # runtime for CSV file
(3) We can just take the difference of the runtimes.
rt_xdf - rt_csv
Time difference of -2.469199 mins
We can see that the XDF conversion and subsequent summary was still faster than summarizing the CSV file. This is because summarizing the XDF file considerably faster, making up for conversion time. Since our results are I/O dependent, they will depend on our hard drive's infrastructure.
(4) The sum_xdf
and sum_csv
are list
objects and the counts for the factor
columns are stored in an element called categorical
. Here's how we can compare the counts for one factor
column:
sum_xdf$categorical[[2]]
sum_csv$categorical[[2]]
payment_type Counts
1 2 3673651
2 1 7181476
3 3 38319
4 4 13411
5 5 1
The statistical summaries for the numeric
columns are stored in an element called sDataFrame
.
sum_xdf$sDataFrame[5, ]
sum_csv$sDataFrame[5, ]
Name Mean StdDev Min Max ValidObs MissingObs
5 trip_distance 4.648197 2981.095 0 8000010 10906858 0
In either case results are identical.