Loading data into R

The process of loading data into R can change based on the kind of data or where the data is stored. The standard format for data is tabular. A CSV file is an example of tabular data. We read flat files into R using the read.table function. For a CSV, there's a short-hand version of read.table called read.csv which makes the following assumptions about the data (all of which can be overwritten if need be):

  • a comma is used to separate entries
  • column headers are at the top
  • rows all have an equal number of entries, with two adjacent commas representing an empty cell
  • file only contains the data, with all other meta-data stored in a separate file referred to as the data dictionary

As a starting point, we can use the readLines function in R to print the first few lines of the data.

data_path <- 'NYC_sample.csv'
readLines(file(data_path), n = 3) # print the first 3 lines of the file
[1] "\"VendorID\",\"tpep_pickup_datetime\",\"tpep_dropoff_datetime\",\"passenger_count\",\"trip_distance\",\"pickup_longitude\",\"pickup_latitude\",\"RateCodeID\",\"store_and_fwd_flag\",\"dropoff_longitude\",\"dropoff_latitude\",\"payment_type\",\"fare_amount\",\"extra\",\"mta_tax\",\"tip_amount\",\"tolls_amount\",\"improvement_surcharge\",\"total_amount\",\"u\""

[2] "\"2\",\"2015-01-15 19:05:40\",\"2015-01-15 19:28:18\",5,8.33,-73.8630599975586,40.7695808410645,\"1\",\"N\",-73.9527130126953,40.7857818603516,\"1\",26,1,0.5,8.08,5.33,0.3,41.21,0.0191304027102888"

[3] "\"2\",\"2015-01-25 00:13:06\",\"2015-01-25 00:24:51\",1,3.37,-73.9455108642578,40.7737236022949,\"1\",\"N\",-73.987434387207,40.7557067871094,\"1\",12.5,0.5,0.5,0,0,0.3,13.8,0.0228826124221087"

Before we run read.csv to load the data into R, let's inspect it more closely by looking at the R help documentation. We can do so by typing ?read.csv from the R console.

?read.csv

As we can see from the help page above, read.csv is an offshoot of the more general function read.table with some of the arguments set to default values appropriate to CSV files (such as sep = ',' or header = TRUE). There are many arguments in read.table worth knowing about, such as (just to name a few)

  • nrows for limiting the number of rows we read,
  • na.strings for specifying what defines an NA in a character column,
  • skip for skipping a certain number of rows before we start reading the data,
  • stringsAsFactors suppresses the automatic conversion of character columns to factor

Time to run read.csv. Since the dataset we read is relatively large, we time how long it takes to load it into R. Once all the data is read, we have an object called nyc_taxi loaded into the R session. This object is an R data.frame. We can run a simple query on nyc_taxi by passing it to the head function.

st <- Sys.time()
nyc_taxi <- read.csv(data_path, stringsAsFactors = FALSE)
Sys.time() - st
Time difference of 1.77 mins
print(class(nyc_taxi))
[1] "data.frame"

It is important to know that nyc_taxi is no longer linked to the original CSV file: The CSV file resides somewhere on disk, but nyc_taxi is a copy of the CSV file sitting in memory. Any modifications we make to this file will not overwrite the CSV file, or any file on disk, unless we explicitly do so (for example by using write.table). Let's begin by comparing the size of the original CSV file with the size of its copy in the R session.

obj_size_mb <- as.integer(object.size(nyc_taxi)) / 2^20 # size of object in memory (we divide by 2^20 to convert from bytes to megabytes)
obj_size_mb
[1] 987
file_size_mb <- file.size(data_path) / 2^20 # size of the original file
file_size_mb
[1] 659

As we can see, the object nyc_taxi takes up more space in memory than the CSV file does on disk. Since the amount of available memory on a computer is much smaller than available disk space, for a long time the need to load data in its entirety in the memory imposed a serious limitation on using R with large datasets. Over the years, machines have been endowed with more CPU power and more memory, but data sizes have grown even more, so fundamentally the problem is still there. As we become better R programmers, we can learn ways to more efficiently load and process the data, but writing efficient R code is not always easy. Sometimes, doing so is not even desirable, as the resulting code can end up looking hard to read and understand.

Nowadays there are R libraries that provide us with ways to handle large datasets in R quickly and without hogging too much memory. Microsoft R Server's RevoScaleR library is an example of such a package. RevoScaleR is covered in a different course for which the current course can serve as a prerequisite.

With the data loaded into R, we can now set out to examine its content, which is the subject of the next chapter.

results matching ""

    No results matching ""