Exercises
Here is an example of a useful new function: seq
seq(1, 10, by = 2)
(1) Once you figure out what seq
does, use it to take a sample of the data
consisting of every 2500th rows. Such a sample is called a systematic sample.
Here is another example of a useful function: rep
rep(1, 4)
What happens if the first argument to rep
is a vector?
rep(1:2, 4)
What happens if the second argument to rep
is also a vector (of the same length)?
rep(c(3, 6), c(2, 5))
(2) Create a new data object consisting of 5 copies of the first row of the data.
(3) Create a new data object consisting of 5 copies of each of the first 10 rows of the data.
(4) We learned to how to slice data using conditional statements. Note that in R, not all conditional statements have to involve columns in the data. Here's an example:
subset(nyc_small, fare_amount > 100 & 1:2 > 1)
See if you can describe what the above statement returns. Of course, just because we can do something in R doesn't mean that we should. Sometimes, we have to sacrifice a little bit of efficiency or conciseness for the sake of clarity. So reproduce the above subset in a way that makes the code more understandable. There is more than one way to do this, and you can break up the code in two steps instead of one if you want.
Here's another useful R function: sample
. Run the below example multiple times to see the different samples being generated.
sample(1:10, 5)
(5) Use sample
to create random sample consisting of about 10 percent of the data. Store the result in a new data object called nyc_sample
.
There is another way to do what we just did (that does not involve the sample
function). We start by creating a column u
containing random uniform numbers between 0 and 1, which we can generate with the runif
function.
nyc_taxi$u <- runif(nrow(nyc_taxi))
(6) Recreate the same sample we had in the last exercise but use the column u
instead.
(7) You would probably argue that the second solution is easier. There is however an advantage to using the sample
function: we can also do sampling with replacement with the sample
function. First find the argument that allows sampling with replacement. Then use it to take a sample of size 1000 with replacement from the nyc_taxi
data.