Exercises

Features extraction is the process of creating new (and interesting) columns in our data out of the existing columns. Sometimes new features can be directly extracted from one of several columns in the data. For example, we can extract the day of the week from pickup_datetime and dropoff_datetime. Sometimes new features rely on third-party data. For example, we could have a holiday_flag column to know which dates were holidays.

Let's take a look at the data as it now stands.

head(nyc_taxi)
      pickup_datetime    dropoff_datetime passenger_count trip_distance
1 2015-01-15 19:05:40 2015-01-15 19:28:18               5          8.33
2 2015-01-25 00:13:06 2015-01-25 00:24:51               1          3.37
3 2015-01-25 00:13:08 2015-01-25 00:34:57               1          3.72
4 2015-01-25 00:13:09 2015-01-25 01:02:40               1         10.20
5 2015-01-04 13:44:52 2015-01-04 13:46:38               1          0.36
6 2015-01-04 13:44:52 2015-01-04 14:04:23               1          8.98
  pickup_longitude pickup_latitude rate_code_id dropoff_longitude dropoff_latitude
1            -73.9            40.8     standard             -74.0             40.8
2            -73.9            40.8     standard             -74.0             40.8
3            -74.0            40.8     standard             -74.0             40.7
4            -74.0            40.8     standard             -73.9             40.7
5            -74.0            40.8     standard             -74.0             40.8
6            -73.9            40.8     standard             -74.0             40.8
  payment_type fare_amount extra mta_tax tip_amount tolls_amount
1         card        26.0   1.0     0.5       8.08         5.33
2         card        12.5   0.5     0.5       0.00         0.00
3         card        16.5   0.5     0.5       3.56         0.00
4         cash        39.0   0.5     0.5       0.00         0.00
5         cash         3.5   0.0     0.5       0.00         0.00
6         card        27.0   0.0     0.5       0.00         5.33
  improvement_surcharge total_amount
1                   0.3         41.2
2                   0.3         13.8
3                   0.3         21.4
4                   0.3         40.3
5                   0.3          4.3
6                   0.3         33.1

Discuss possible 'features' (columns) that we can extract from already existing columns. Recall that our goal is to tell interesting (unexpected, or not immediately obvious) stories based on the data, so think of features that would make this dataset more interesting to analyze and the story more compelling.

results matching ""

    No results matching ""