Exercises

We learned how to use the rxSummary function to summarize the data. If we pass the formula ~ . to rxSummary, we get a summary of all the column in the data. This summary consists of counts for factor columns and numeric summaries for numeric and integer columns (character columns are ignored).

Using one month of the NYC taxi data (say January 2016), perform the following analysis:

(1) Convert the CSV file for that month to XDF, then run rxSummary to get a summary of all its columns. Store the summary in an object called sum_xdf for later use. Use system.time to see how long it takes to do both the conversion and summary together.

(2) Run rxSummary directly on the CSV file for that month, storing the result in an object called sum_csv for later use. Use system.time to time how long it takes to summarize the CSV file.

(3) Compare the runtime in part (1) to part (2). What is your conclusion?

(4) Pick one or two columns (one factor and one numeric) in the data and drill down into sum_xdf and sum_csv to make sure that the summaries do in fact match.

Here's some code to get started. Lines where user input is required starts with ##. Insert your solution into those lines.

input_xdf <- 'yellow_tripdata_2016-01.xdf'
input_csv <- 'yellow_tripdata_2016-01.csv'

st <- Sys.time()
## convert CSV to XDF here
jan_2016_xdf <- RxXdfData(input_xdf)
## summarize XDF file here
rt_xdf <- Sys.time() - st

st <- Sys.time()
jan_2016_csv <- RxTextData(input_csv, colClasses = col_classes)
## summarize CSV file here
rt_csv <- Sys.time() - st

file.remove(input_xdf) # remove the file to keep folder clean

## compare runtimes rt_xdf and rt_csv
## compare results sum_xdf and sum_csv

results matching ""

    No results matching ""