Exercises

We learned how to use the rxSummary function to summarize the data. If we pass the formula ~ . to rxSummary, we get a summary of all the column in the data. This summary consists of counts for factor columns and numeric summaries for numeric and integer columns (character columns don't have any summary statistics).

Using the CSV for the sample corresponding to the month of January 2016 of the NYC taxi data, perform the following analysis:

(1) Convert the CSV file for that month to XDF (using rxImport) then run rxSummary to get a summary of all its columns. Store the summary in an object called sum_xdf for later use. Use system.time to see how long it takes to do both the conversion and summary together.

input_csv <- 'yellow_tripsample_2016-01.csv'
input_xdf <- 'yellow_tripsample_2016-01.xdf'

## your code goes here

file.remove(input_xdf) # remove the file to keep folder clean

(2) Run rxSummary directly on the CSV file for that month, storing the result in an object called sum_csv for later use. Use system.time to time how long it takes to summarize the CSV file.

(3) Compare the runtime in part (1) to part (2). And specify which has a shorter runtime.

rt_xdf - rt_csv

(4) Extract the summaries for the columns tip_amount, fare_amount, and total_amount from sum_xdf and store it in a data.frame called amt_xdf. Extract the same information from sum_csv and store it in a data.frame called amt_csv. Compare the two amt_xdf and amt_csv by taking their difference (only keeping numeric columns).

results matching ""

    No results matching ""