Loading data into R
The process of loading data into R can change based on the kind of data or where the data is stored. The standard format for data is tabular. A CSV file is an example of tabular data. We read flat files into R using the read.table
function. For a CSV, there's a short-hand version of read.table
called read.csv
which makes the following assumptions about the data (all of which can be overwritten if need be):
- a comma is used to separate entries
- column headers are at the top
- rows all have an equal number of entries, with two adjacent commas representing an empty cell
- file only contains the data, with all other meta-data stored in a separate file referred to as the data dictionary
As a starting point, we can use the readLines
function in R to print the first few lines of the data.
data_path <- 'NYC_sample.csv'
readLines(file(data_path), n = 3) # print the first 3 lines of the file
[1] "\"VendorID\",\"tpep_pickup_datetime\",\"tpep_dropoff_datetime\",\"passenger_count\",\"trip_distance\",\"pickup_longitude\",\"pickup_latitude\",\"RateCodeID\",\"store_and_fwd_flag\",\"dropoff_longitude\",\"dropoff_latitude\",\"payment_type\",\"fare_amount\",\"extra\",\"mta_tax\",\"tip_amount\",\"tolls_amount\",\"improvement_surcharge\",\"total_amount\",\"u\""
[2] "\"2\",\"2015-01-15 19:05:40\",\"2015-01-15 19:28:18\",5,8.33,-73.8630599975586,40.7695808410645,\"1\",\"N\",-73.9527130126953,40.7857818603516,\"1\",26,1,0.5,8.08,5.33,0.3,41.21,0.0191304027102888"
[3] "\"2\",\"2015-01-25 00:13:06\",\"2015-01-25 00:24:51\",1,3.37,-73.9455108642578,40.7737236022949,\"1\",\"N\",-73.987434387207,40.7557067871094,\"1\",12.5,0.5,0.5,0,0,0.3,13.8,0.0228826124221087"
Before we run read.csv
to load the data into R, let's inspect it more closely by looking at the R help documentation. We can do so by typing ?read.csv
from the R console.
?read.csv
As we can see from the help page above, read.csv
is an offshoot of the more general function read.table
with some of the arguments set to default values appropriate to CSV files (such as sep = ','
or header = TRUE
). There are many arguments in read.table
worth knowing about, such as (just to name a few)
nrows
for limiting the number of rows we read,na.strings
for specifying what defines an NA in acharacter
column,skip
for skipping a certain number of rows before we start reading the data,stringsAsFactors
suppresses the automatic conversion ofcharacter
columns tofactor
Time to run read.csv
. Since the dataset we read is relatively large, we time how long it takes to load it into R. Once all the data is read, we have an object called nyc_taxi
loaded into the R session. This object is an R data.frame
. We can run a simple query on nyc_taxi
by passing it to the head
function.
st <- Sys.time()
nyc_taxi <- read.csv(data_path, stringsAsFactors = FALSE)
Sys.time() - st
Time difference of 1.77 mins
print(class(nyc_taxi))
[1] "data.frame"
It is important to know that nyc_taxi
is no longer linked to the original CSV file: The CSV file resides somewhere on disk, but nyc_taxi
is a copy of the CSV file sitting in memory. Any modifications we make to this file will not overwrite the CSV file, or any file on disk, unless we explicitly do so (for example by using write.table
). Let's begin by comparing the size of the original CSV file with the size of its copy in the R session.
obj_size_mb <- as.integer(object.size(nyc_taxi)) / 2^20 # size of object in memory (we divide by 2^20 to convert from bytes to megabytes)
obj_size_mb
[1] 987
file_size_mb <- file.size(data_path) / 2^20 # size of the original file
file_size_mb
[1] 659
As we can see, the object nyc_taxi
takes up more space in memory than the CSV file does on disk. Since the amount of available memory on a computer is much smaller than available disk space, for a long time the need to load data in its entirety in the memory imposed a serious limitation on using R with large datasets. Over the years, machines have been endowed with more CPU power and more memory, but data sizes have grown even more, so fundamentally the problem is still there. As we become better R programmers, we can learn ways to more efficiently load and process the data, but writing efficient R code is not always easy. Sometimes, doing so is not even desirable, as the resulting code can end up looking hard to read and understand.
Nowadays there are R libraries that provide us with ways to handle large datasets in R quickly and without hogging too much memory. Microsoft R Server's RevoScaleR
library is an example of such a package. RevoScaleR
is covered in a different course for which the current course can serve as a prerequisite.
With the data loaded into R, we can now set out to examine its content, which is the subject of the next chapter.