Loading the top 1000 rows
We begin by loading the first 1000 rows of the data using the read.table
function. To avoid unnecessary factor conversions, we examined the data and decided on the proper column types ahead of time, storing them in an object called col_classes
which we then pass to read.table
.
col_classes <- c('VendorID' = "factor",
'tpep_pickup_datetime' = "character",
'tpep_dropoff_datetime' = "character",
'passenger_count' = "integer",
'trip_distance' = "numeric",
'pickup_longitude' = "numeric",
'pickup_latitude' = "numeric",
'RateCodeID' = "factor",
'store_and_fwd_flag' = "factor",
'dropoff_longitude' = "numeric",
'dropoff_latitude' = "numeric",
'payment_type' = "factor",
'fare_amount' = "numeric",
'extra' = "numeric",
'mta_tax' = "numeric",
'tip_amount' = "numeric",
'tolls_amount' = "numeric",
'improvement_surcharge' = "numeric",
'total_amount' = "numeric")
It is a good practice to load a small sample of the data as a data.frame
in R. When we want to apply a function to the XDF data, we can first apply it to the data.frame
where it's easier and faster to catch errors, before applying it to the whole data. We will later learn a method for taking a random sample from the data, but for now the sample simply consists of the first 1000 rows.
input_csv <- 'yellow_tripdata_2016-01.csv'
# we take a chunk of the data and load it as a data.frame (good for testing things)
nyc_sample_df <- read.csv(input_csv, nrows = 1000, colClasses = col_classes)
head(nyc_sample_df)
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
1 2 2016-01-01 00:00:00 2016-01-01 00:00:00 2 1.10
2 2 2016-01-01 00:00:00 2016-01-01 00:00:00 5 4.90
3 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 10.54
4 2 2016-01-01 00:00:00 2016-01-01 00:00:00 1 4.75
5 2 2016-01-01 00:00:00 2016-01-01 00:00:00 3 1.76
6 2 2016-01-01 00:00:00 2016-01-01 00:18:30 2 5.52
pickup_longitude pickup_latitude RatecodeID store_and_fwd_flag dropoff_longitude
1 -73.99037 40.73470 1 N -73.98184
2 -73.98078 40.72991 1 N -73.94447
3 -73.98455 40.67957 1 N -73.95027
4 -73.99347 40.71899 1 N -73.96224
5 -73.96062 40.78133 1 N -73.97726
6 -73.98012 40.74305 1 N -73.91349
dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount
1 40.73241 2 7.5 0.5 0.5 0 0
2 40.71668 1 18.0 0.5 0.5 0 0
3 40.78893 1 33.0 0.5 0.5 0 0
4 40.65733 2 16.5 0.0 0.5 0 0
5 40.75851 2 8.0 0.0 0.5 0 0
6 40.76314 2 19.0 0.5 0.5 0 0
improvement_surcharge total_amount
1 0.3 8.8
2 0.3 19.3
3 0.3 34.3
4 0.3 17.3
5 0.3 8.8
6 0.3 20.3