Exercises

Here is an example of a useful new function: seq

seq(1, 10, by = 2)

(1) Once you figure out what seq does, use it to take a sample of the data consisting of every 2500th rows. Such a sample is called a systematic sample.

Here is another example of a useful function: rep

rep(1, 4)

What happens if the first argument to rep is a vector?

rep(1:2, 4)

What happens if the second argument to rep is also a vector (of the same length)?

rep(c(3, 6), c(2, 5))

(2) Create a new data object consisting of 5 copies of the first row of the data.

(3) Create a new data object consisting of 5 copies of each of the first 10 rows of the data.

(4) We learned to how to slice data using conditional statements. Note that in R, not all conditional statements have to involve columns in the data. Here's an example:

subset(nyc_small, fare_amount > 100 & 1:2 > 1)

See if you can describe what the above statement returns. Of course, just because we can do something in R doesn't mean that we should. Sometimes, we have to sacrifice a little bit of efficiency or conciseness for the sake of clarity. So reproduce the above subset in a way that makes the code more understandable. There is more than one way to do this, and you can break up the code in two steps instead of one if you want.

Here's another useful R function: sample. Run the below example multiple times to see the different samples being generated.

sample(1:10, 5)

(5) Use sample to create random sample consisting of about 10 percent of the data. Store the result in a new data object called nyc_sample.

There is another way to do what we just did (that does not involve the sample function). We start by creating a column u containing random uniform numbers between 0 and 1, which we can generate with the runif function.

nyc_taxi$u <- runif(nrow(nyc_taxi))

(6) Recreate the same sample we had in the last exercise but use the column u instead.

(7) You would probably argue that the second solution is easier. There is however an advantage to using the sample function: we can also do sampling with replacement with the sample function. First find the argument that allows sampling with replacement. Then use it to take a sample of size 1000 with replacement from the nyc_taxi data.

results matching ""

    No results matching ""