More on querying data

To query a single column of the data, we have two options:

  • we can still use the bracket notation, namely data[ , col_name]
  • we can use a list notation, namely data$col_name
nyc_taxi[1:10, "fare_amount"]
[1] 26.0 12.5 16.5 39.0 3.5 27.0 7.0 8.0 7.5 52.0
nyc_taxi$fare_amount[1:10]
[1] 26.0 12.5 16.5 39.0 3.5 27.0 7.0 8.0 7.5 52.0

Depending on the situation, one notation may be preferable to the other, as we will see.

So far we sliced the data at particular rows using the index of the row. A more common situation is one where we query the data for rows that meet a given condition. Multiple conditions can be combined using the & (and) and | (or) operators.

head(nyc_taxi[nyc_taxi$fare_amount > 350, ]) # return the rows of the data where `fare_amount` exceeds 350
       VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count
82043         2  2015-01-22 14:18:00   2015-01-22 14:18:00               1
96617         2  2015-01-18 13:46:00   2015-01-18 13:46:00               1
237117        2  2015-01-25 04:33:14   2015-01-25 13:29:14               1
288188        1  2015-01-06 16:44:08   2015-01-06 16:44:47               1
370936        2  2015-01-28 14:59:06   2015-01-28 14:59:09               1
384437        1  2015-01-08 11:47:20   2015-01-08 11:47:56               1
       trip_distance pickup_longitude pickup_latitude rate_code_id store_and_fwd_flag
82043              0                0             0.0            5                  N
96617              0                0             0.0            5                  N
237117           182              -74            40.7            5                  N
288188             0              -74            40.7            5                  N
370936             0                0             0.0            5                  N
384437             0              -74            40.7            5                  N
       dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax
82043                0.0              0.0            1         563     0       0
96617                0.0              0.0            1         560     0       0
237117             -73.9             40.8            1         435     0       0
288188             -74.0             40.7            1         900     0       0
370936             -73.9             40.8            1         499     0       0
384437             -74.0             40.7            1         900     0       0
       tip_amount tolls_amount improvement_surcharge total_amount       u
82043           0          0.0                   0.0          563 0.00083
96617           0          0.0                   0.0          560 0.04667
237117         50         11.8                   0.3          497 0.03920
288188          0          0.0                   0.3          900 0.00279
370936          0          0.0                   0.3          499 0.00715
384437          0          0.0                   0.3          900 0.03583

We can use a function like grep to grab only columns that match a certain pattern, such as columns that have the word 'amount' in them.

amount_vars <- grep('amount', names(nyc_taxi), value = TRUE)
nyc_taxi[nyc_taxi$fare_amount > 350 & nyc_taxi$tip_amount < 10, amount_vars]
       fare_amount tip_amount tolls_amount total_amount
82043          563          0            0          563
96617          560          0            0          560
288188         900          0            0          900
...

As these conditional statements become longer, it becomes increasingly tedious to write nyc_taxi$ proir to the column name every time we refer to a column in the data. Note how leaving out nyc_taxi$ by accident can result in an error:

nyc_taxi[nyc_taxi$fare_amount > 350 & tip_amount < 10, amount_vars]
Error in `[.data.frame`(nyc_taxi, nyc_taxi$fare_amount > 350 & tip_amount <  : 
  object 'tip_amount' not found

As the error suggests, R expected to find a stand-alone object called tip_amount, which doesn't exist. Instead, we meant to point to the column called tip_amount in the nyc_taxi dataset, in other words nyc_taxi$tip_amount. This error also suggests one dangerous pitfall: if we did have an object called tip_amount in our R session, we may have failed to notice the bug in the code.

tip_amount <- 20 # this is the value that will be used to check the condition below
nyc_taxi[nyc_taxi$fare_amount > 350 & tip_amount < 10, amount_vars] # since `20 < 10` is FALSE, we return an empty data
[1] fare_amount  tip_amount   tolls_amount total_amount
<0 rows> (or 0-length row.names)

There are three ways to avoid such errors: (1) avoid having objects with the same name as column names in the data, (2) use the with function. With with we are explicitly telling R that the columns we reference are in nyc_taxi, this way we don't need to prefix the columns by nyc_taxi$ anymore. Here's the above query rewritten using with.

with(nyc_taxi, nyc_taxi[fare_amount > 350 & tip_amount < 10, amount_vars])
        fare_amount tip_amount tolls_amount total_amount
82043           563          0            0          563
96617           560          0            0          560
288188          900          0            0          900
370936          499          0            0          499

...

We can use with any time we need to reference multiple columns in the data, not just for slicing the data. In the specific case where we slice the data, there is another option: using the subset function. Just like with, subset takes in the data as its first input so we don't have to prefix column names with nyc_taxi$. We can also use the select argument to slice by columns. Let's contrast slicing the data using subset with the bracket notation:

  • bracket notation: data[rows_to_slice, columns_to_slice]
  • using subset: subset(data, rows_to_slice, select = columns_to_slice)

Here's what the above query would look like using subset:

subset(nyc_taxi, fare_amount > 350 & tip_amount < 10, select = amount_vars)
        fare_amount tip_amount tolls_amount total_amount
82043           563          0            0          563
96617           560          0            0          560
288188          900          0            0          900

...

The select argument for subset allows us to select columns in a way that is not possible with the bracket notation:

nyc_small <- subset(nyc_taxi, fare_amount > 350 & tip_amount < 10, 
                    select = fare_amount:tip_amount) # return all columns between `fare_amount` and `tip_amount`
dim(nyc_small)
[1] 42 4

Take a look at nyc_small, do you notice anything unusual?

head(nyc_small)
       fare_amount extra mta_tax tip_amount
82043          563     0       0          0
96617          560     0       0          0
288188         900     0       0          0
370936         499     0       0          0
384437         900     0       0          0
455751         570     0       0          0
rownames(nyc_small) # here's a hint
 [1] "82043"   "96617"   "288188"  "370936"  "384437"  "455751"  "667190"  "727627" 
 [9] "787235"  "900878"  "1223662" "1297950" "1447237" "1459729" "1478264" "1550198"
[17] "1557541" "1689750" "1762984" "1813668" "1817382" "1904821" "1977690" "2036488"
[25] "2048975" "2067694" "2136470" "2297622" "2360277" "2377436" "2438532" "2559946"
[33] "2695744" "2740956" "2795487" "2843909" "2878208" "3066192" "3263119" "3614580"
[41] "3706056" "3758078"

So subsetting data preserves the row names, which is sometimes useful. We can always reset the rownames by doing this:

rownames(nyc_small) <- NULL
head(nyc_small) # row names are reset
  fare_amount extra mta_tax tip_amount
1         563     0       0          0
2         560     0       0          0
3         900     0       0          0
4         499     0       0          0
5         900     0       0          0
6         570     0       0          0

results matching ""

    No results matching ""