Solutions

(1) We simply use the output of seq as indexes for selecting rows:

head(nyc_taxi[seq(1, nrow(nyc_taxi), 2500), ])

      VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
1            2  2015-01-15 19:05:40   2015-01-15 19:28:18               5
2501         2  2015-01-03 21:37:52   2015-01-03 21:59:17               1
5001         2  2015-01-17 07:56:02   2015-01-17 08:01:17               1
7501         2  2015-01-29 09:44:03   2015-01-29 10:01:40               1
10001        1  2015-01-03 19:22:11   2015-01-03 19:30:01               1
12501        1  2015-01-19 21:58:44   2015-01-19 22:05:32               1
      trip_distance pickup_longitude pickup_latitude rate_code_id store_and_fwd_flag
1              8.33            -73.9            40.8            1                  N
2501           4.54            -74.0            40.7            1                  N
5001           1.37            -74.0            40.8            1                  N
7501           1.38            -74.0            40.7            1                  N
10001          0.80            -74.0            40.8            1                  N
12501          1.80            -74.0            40.8            1                  N
      dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax
1                   -74             40.8            1        26.0   1.0     0.5
2501                -74             40.8            1        17.5   0.5     0.5
5001                -74             40.8            2         6.5   0.0     0.5
7501                -74             40.8            2        12.0   0.0     0.5
10001               -74             40.8            2         6.5   0.0     0.5
12501               -74             40.8            1         7.5   0.5     0.5
      tip_amount tolls_amount improvement_surcharge total_amount        u
1           8.08         5.33                   0.3         41.2 0.019130
2501        2.00         0.00                   0.3         20.8 0.017176
5001        0.00         0.00                   0.3          7.3 0.000286
7501        0.00         0.00                   0.3         12.8 0.049494
10001       0.00         0.00                   0.0          7.3 0.021326
12501       1.75         0.00                   0.3         10.6 0.006837

Another approach we can take is to use the modulo operator (%%) in R, but a this approach is less efficient.

head(nyc_taxi[1:nrow(nyc_taxi) %% 2500 == 1, ])

(2) In this case, we are still using the same bracket notation, but this time return the first rows 5 times.

nyc_taxi[rep(1, 5), ]

    VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
1          2  2015-01-15 19:05:40   2015-01-15 19:28:18               5          8.33
1.1        2  2015-01-15 19:05:40   2015-01-15 19:28:18               5          8.33
1.2        2  2015-01-15 19:05:40   2015-01-15 19:28:18               5          8.33
1.3        2  2015-01-15 19:05:40   2015-01-15 19:28:18               5          8.33
1.4        2  2015-01-15 19:05:40   2015-01-15 19:28:18               5          8.33
    pickup_longitude pickup_latitude rate_code_id store_and_fwd_flag
1              -73.9            40.8            1                  N
1.1            -73.9            40.8            1                  N
1.2            -73.9            40.8            1                  N
1.3            -73.9            40.8            1                  N
1.4            -73.9            40.8            1                  N
    dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax
1                 -74             40.8            1          26     1     0.5
1.1               -74             40.8            1          26     1     0.5
1.2               -74             40.8            1          26     1     0.5
1.3               -74             40.8            1          26     1     0.5
1.4               -74             40.8            1          26     1     0.5
    tip_amount tolls_amount improvement_surcharge total_amount      u
1         8.08         5.33                   0.3         41.2 0.0191
1.1       8.08         5.33                   0.3         41.2 0.0191
1.2       8.08         5.33                   0.3         41.2 0.0191
1.3       8.08         5.33                   0.3         41.2 0.0191
1.4       8.08         5.33                   0.3         41.2 0.0191

(3) This is akin to the last exercise, but this time we repeat 1:10 instead of just 1. We use head here to only show the top 6 rows.

head(nyc_taxi[rep(1:10, 5), ])

  VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance
1        2  2015-01-15 19:05:40   2015-01-15 19:28:18               5          8.33
2        2  2015-01-25 00:13:06   2015-01-25 00:24:51               1          3.37
3        2  2015-01-25 00:13:08   2015-01-25 00:34:57               1          3.72
4        2  2015-01-25 00:13:09   2015-01-25 01:02:40               1         10.20
5        2  2015-01-04 13:44:52   2015-01-04 13:46:38               1          0.36
6        2  2015-01-04 13:44:52   2015-01-04 14:04:23               1          8.98
  pickup_longitude pickup_latitude rate_code_id store_and_fwd_flag dropoff_longitude
1            -73.9            40.8            1                  N             -74.0
2            -73.9            40.8            1                  N             -74.0
3            -74.0            40.8            1                  N             -74.0
4            -74.0            40.8            1                  N             -73.9
5            -74.0            40.8            1                  N             -74.0
6            -73.9            40.8            1                  N             -74.0
  dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount
1             40.8            1        26.0   1.0     0.5       8.08         5.33
2             40.8            1        12.5   0.5     0.5       0.00         0.00
3             40.7            1        16.5   0.5     0.5       3.56         0.00
4             40.7            2        39.0   0.5     0.5       0.00         0.00
5             40.8            2         3.5   0.0     0.5       0.00         0.00
6             40.8            1        27.0   0.0     0.5       0.00         5.33
  improvement_surcharge total_amount      u
1                   0.3         41.2 0.0191
2                   0.3         13.8 0.0229
3                   0.3         21.4 0.0399
4                   0.3         40.3 0.0055
5                   0.3          4.3 0.0497
6                   0.3         33.1 0.0383

Notice the way that the row indexes appear in the results in each case. This can sometimes be an indication of how the data was sampled.

(4) As it turns out, the second condition makes it so that we skip every other row of the data, but we need to be familiar with vector operation in R to guess that and even then it is not immediately clear that is what's happening. So here's a better way of doing the same thing.

nyc_small <- nyc_small[seq(2, nrow(nyc_small), by = 2), ] # take even-numbered rows
subset(nyc_small, fare_amount > 100)

(5) Here's one way we can sample from the data.

nyc_sample <- nyc_taxi[sample(1:nrow(nyc_taxi), nrow(nyc_taxi)/10) , ]

(6) Here's a second way of doing it.

nyc_sample <- subset(nyc_taxi, u < .1)
nyc_sample$u <- NULL # we can drop `u` now, since it is no longer needed

(7) We can sample with replacement using the replace = TRUE argument in the sample function.

nyc_sample <- nyc_taxi[sample(1:nrow(nyc_taxi), 1000, replace = TRUE) , ]

3.3 Solutions

Solutions

results matching ""

No results matching ""