Filtering by Manhattan

We now narrow our field of vision by focusing on trips that took place inside of Manhattan only, and meet "reasonable" criteria for a trip. Since we added new features to the data, we can also drop some old columns from the data so that the data can be processed faster.

input_xdf <- 'yellow_tripdata_2016_manhattan.xdf'
mht_xdf <- RxXdfData(input_xdf)

rxDataStep(nyc_xdf, mht_xdf, 
           rowSelection = (
             passenger_count > 0 &
               trip_distance >= 0 & trip_distance < 30 &
               trip_duration > 0 & trip_duration < 60*60*24 &
               str_detect(pickup_borough, 'Manhattan') &
               str_detect(dropoff_borough, 'Manhattan') &
               !is.na(pickup_nb) &
               !is.na(dropoff_nb) &
               fare_amount > 0), 
           transformPackages = "stringr",
           varsToDrop = c('extra', 'mta_tax', 'improvement_surcharge', 'total_amount', 
                          'pickup_borough', 'dropoff_borough', 'pickup_nhood', 'dropoff_nhood'),
           overwrite = TRUE)

And since we limited the scope of the data, it might be a good idea to create a sample of the new data (as a data.frame). Our last sample, nyc_sample_df was not a good sample, since we only took the top 1000 rows of the data. This time, we use rxDataStep to create a random sample of the data, containing only 1 percent of the rows from the larger dataset.

mht_sample_df <- rxDataStep(mht_xdf, rowSelection = (u < .01), 
                            transforms = list(u = runif(.rxNumRows)))

dim(mht_sample_df)
Rows Processed: 57493035 
WARNING: The number of rows (574832) times the number of columns (24)
exceeds the 'maxRowsByCols' argument (3000000). Rows will be truncated.

> dim(mht_sample_df)
[1] 125000     24

results matching ""

    No results matching ""