Deploying to SQL Server (demo)
A basic overview of the SQL Server R Services architecture can be found here. Let's point to a SQL table containing a copy of the NYC Taxi dataset. The first thing we need to do is set up a SQL Server connection string, which contains our SQL login credentials. Since the connection string contains sensitive information, it is usually stored in a file in a restricted location and read from R, but in our example we will simply hard-code the connection string and store it in sqlConnString
. Assume, the NYC Taxi dataset is stored in a table called NYCTaxiSmall
inside the RDB
database that the connection string points to. The last thing left for us to do is to point to the table, which we do with the RxSqlServerData
function. This is the equivalent of RxXdfData
when pointing to an XDF file stored on disk.
input_xdf <- 'yellow_tripdata_2016_manhattan.xdf'
nyc_xdf <- RxXdfData(input_xdf)
sqlConnString <- "Driver=SQL Server;Server=SERVERNAME;Database=RDB;Uid=ruser;Pwd=ruser"
sqlRowsPerRead <- 100000
sqlTable <- "NYCTaxiSmall"
nyc_sql <- RxSqlServerData(connectionString = sqlConnString, rowsPerRead = sqlRowsPerRead,
table = sqlTable)
For the sake of illustration, we now dump the content of nyc_xdf
into the SQL table represented by nyc_sql
(which in called NYCTaxiSmall
in the SQL database). If the XDF file in question is large, this can take a while.
system.time(
rxDataStep(nyc_xdf, nyc_sql, overwrite = TRUE)
)
That's it. We can now use nyc_sql
the same way we used nyc_xdf
before. There is however something missing: we did not specify what the column types were. In this case, RxSqlServerData
will try as best it can to convert a SQL Server column type to an R column type. This can cause problems though. First of all, SQL Server has a richer variety of column types than R. Second, some SQL Server column types like datetime
for example don't always successfully transfer to their corresponding R column type. Third, the R column type factor
does not really have a good equivalent in SQL Server, so in order for a column to be brought in as factor
we must manually specify it. Doing so however gives us the advantage that we can also specify the levels and labels for it, and as we saw they don't always have to be the exact levels we see in the data. For example, if payment_type
is represented by the integers 1 through 5 in the data, but we only care about 1 and 2 and want them labeled card
and cash
respectively, we can do that here without needing to do it later as a separate transformation. To deal with column types we create an object that stores the information about the columns and pass it to the colInfo
argument in RxSqlServerData
. Here's the example for nyc_sql
:
ccColInfo <- list(
tpep_pickup_datetime = list(type = "character"),
tpep_dropoff_datetime = list(type = "character"),
passenger_count = list(type = "integer"),
trip_distance = list(type = "numeric"),
pickup_longitude = list(type = "numeric"),
pickup_latitude = list(type = "numeric"),
dropoff_longitude = list(type = "numeric"),
dropoff_latitude = list(type = "numeric"),
RateCodeID = list(type = "factor", levels = as.character(1:6), newLevels = c("standard", "JFK", "Newark", "Nassau or Westchester", "negotiated", "group ride")),
store_and_fwd_flag = list(type = "factor", levels = c("Y", "N")),
payment_type = list(type = "factor", levels = as.character(1:2), newLevels = c("card", "cash")),
fare_amount = list(type = "numeric"),
tip_amount = list(type = "numeric"),
total_amount = list(type = "numeric")
)
Notice that the object above does not necessarily have to specify the types for each column in the data. We can limit it only to the columns of interest, and even then only the ones that need to be explicitly overwritten. However, since we tend to be more conservative in a production environment, it's best to be more explicit. After all, certain numeric columns in SQL Server could be stored as something different (VARCHAR
for example) which would turn into a character
column in R.
In addition to the columns that were in the original data, we also need to specify the column types for columns that we added to the data throughout the analysis. Let's begin with the date time columns:
weekday_labels <- c('Sun', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat')
hour_labels <- c('1AM-5AM', '5AM-9AM', '9AM-12PM', '12PM-4PM', '4PM-6PM', '6PM-10PM', '10PM-1AM')
ccColInfo$pickup_dow <- list(type = "factor", levels = weekday_labels)
ccColInfo$pickup_hour <- list(type = "factor", levels = hour_labels)
ccColInfo$dropoff_dow <- list(type = "factor", levels = weekday_labels)
ccColInfo$dropoff_hour <- list(type = "factor", levels = hour_labels)
When working with the XDF file in the previous weeks, we went back and forth quite a few times to get the data in the right format, especially where factor
columns were concerned. This is because we were learning about the data as we went and gradually preparing it for analysis. When working in a production environment however, the assumption is that we have our EDA behind us and know quite a bit about the data already. If not, then a recommended approach would be to take a sample of the data first and run some EDA on it. So some of the steps that we took with the XDF file in the prior weeks may have contained some redundancy or inneficiencies, which we never bothered to go back and revise. But when deploying code in production it's a good idea to make a second pass at the code and simplify things wherever it's warranted. As an example, when working with the XDF file, we first wrote a function to extract the pickup_nhood
and dropoff_nhood
columns from the pick-up and drop-off coordinates. We then noticed that those columns contain neighborhoods outside of Manhattan limits (our area of interest), so we made a second pass through the data to remove the factor levels for the irrelevant neighborhoods. With nyc_sql
, we could take a similar approach: read those columns as factors
with levels as is, and then use rxDataStep
to perform a transformation that removes unwanted factor levels. But doing so is inefficient. The better approach is to find all the relevant factor levels (Manhattan neighborhoods, which we can get directly from the shapefile) and in the ccColInfo
object only specify those as levels for those columns. Here's how:
nyc_shapefile <- readShapePoly('ZillowNeighborhoods-NY/ZillowNeighborhoods-NY.shp')
mht_shapefile <- subset(nyc_shapefile, str_detect(CITY, 'New York City-Manhattan'))
manhattan_nhoods <- as.character(mht_shapefile@data$NAME)
ccColInfo$pickup_nhood <- list(type = "factor", levels = manhattan_nhoods)
ccColInfo$dropoff_nhood <- list(type = "factor", levels = manhattan_nhoods)
We are now ready to point to the SQL table a second time, but this time specify how columns should be treated in R using the colInfo
argument.
nyc_sql <- RxSqlServerData(connectionString = sqlConnString, table = sqlTable, rowsPerRead = sqlRowsPerRead, colInfo = ccColInfo)
Just recall that every time we make a change to ccColInfo
, we need to rerun the above line so that the change is reflected. For example, later (after running the seriate
function), we can reorder the factor levels so that instead of being alphabetically ordered as they are now, they can follow a more natural ordering based on proximity to each other.
At this point, the rest of the analysis is no different from what it was with the XDF file, so we can change nyc_xdf
into nyc_sql
and run the remaining code just like before. For example, we can start with rxGetInfo
to double check the column types.
rxGetInfo(nyc_sql, getVarInfo = TRUE, numRows = 3) # show column types and the first 10 rows
Let's now run rxSummary
on all the columns:
system.time(
rxsum_sql <- rxSummary( ~ ., nyc_sql) # provide statistical summaries for all the columns
)
We get our summary back, but something important is missing. We have not yet set the compute context to the remote SQL Server session. Although we got our summary back, because the compute context was set to the local R session (the default) rxSummary
had to download the data (using an ODBC connection) to the local R session so that it could summarize it. In-database analytics however is about taking the data to the computation, not the other way around. So let's now set the compute context to the remote SQL Server session (using rxSetComputeContext
) and rerun rxSummary
.
# Set ComputeContext. Needs a temp directory path to serialize R objects back and forth
sqlShareDir <- paste("C:/AllShare/", Sys.getenv("USERNAME"), sep = "")
sqlWait <- TRUE
sqlConsoleOutput <- TRUE
sqlCC <- RxInSqlServer(connectionString = sqlConnString, shareDir = sqlShareDir,
wait = sqlWait, consoleOutput = sqlConsoleOutput)
rxSetComputeContext(sqlCC)
system.time(
rxsum_sql <- rxSummary( ~ ., nyc_sql) # provide statistical summaries for all the columns
)
We can set the compute context back to local anytime we need to by running rxSetComputeContext(RxLocalSeq())
. We can also run rxGetComputeContext()
to see what the current compute context is. The difference in run time between rxSummary
when the compute context is set to local and when it is set to SQL depends mostly on the size of the data and the speed at which it can travel over the network to reach the local R session (when the compute context is set to local). For large enough data sizes, this difference can be dramatic. By avoiding this cost, in-database analytics means we can greatly reduce our runtime.
There are however some limitations that we need to be aware of. Certain functions, such as rxMerge
, rxSort
or rxSplit
only work with XDF files on the local file system, not with data sitting in Spark or SQL Server. This is because the common data processing functions already have their (probably more efficient) implementation, so we can simply defer to the SQL language if we need to join tables, sort tables, or split tables instead of the above-mentioned RevoScaleR
functions.
As an example, let's run the same linear model we build on the XDF file now using the SQL table. We're going to build the model on 75 percent of the data (the training data) by creating a column u
of random uniform numbers and using rowSelection
only picking rows where u < .75
.
system.time(linmod <- rxLinMod(tip_percent ~ pickup_nhood:dropoff_nhood + pickup_dow:pickup_hour,
data = nyc_sql, reportProgress = 0,
rowSelection = (split == "train")))
We now point to a new SQL table called NYCTaxiScore
(our pointer to it in R will be called nyc_score
). We then use rxPredict
to score the data.
sqlTable <- "NYCTaxiScore"
nyc_score <- RxSqlServerData(connectionString = sqlConnString, rowsPerRead = sqlRowsPerRead,
table = sqlTable)
rxPredict(trained.models$linmod, data = nyc_sql, outData = nyc_score, predVarNames = "tip_percent_pred_linmod", overwrite = TRUE)
Since building models can be a tedious and sometimes time-consuming process, we usually save models that we build so that we can use them for subsequent scoring (using rxPredict
). Saving models is important because we often score new data on a regular basis using the same model. Saving models in R is easy: we simply use the save
function, but saving models in SQL using the save function may break the rules about saving external objects on the server. Foretunately, there's a way we can save R objects in R (not just models but images or any other objects) by inserting them into the database. In the following example, we begin by creating a table called models with a single column called model whose type is varbinary(max)
. Moreover, we create a stored procedure in SQL for inserting a model object into this table.
CREATE TABLE models
(model varbinary(max))
GO
CREATE PROCEDURE [dbo].[PersistModel]
@m nvarchar(max)
AS
BEGIN
-- SET NOCOUNT ON added to prevent extra result sets from interfering with SELECT statements.
SET NOCOUNT ON;
insert into models (model) values (convert(varbinary(max),@m,2))
END
From R, we can now insert a serialized model object into the above SQL table. To serialize an R object we use the serialize
function. We then concatenate the serial object into a single (usually long) character
vector (of length one) which we can insert into the above table. To run any random SQL query in R we can use the sqlQuery
function as shown here. In this case, the query simply executes the stored procedure to store the model object in the table we created in the last step.
summary(linmod)
modelbin <- serialize(linmod, NULL)
modelbinstr <- paste(modelbin, collapse = "")
# Persist model by calling a stored procedure from SQL
library(RODBC)
odbcCloseAll()
conn <- odbcDriverConnect(sqlConnString)
q <- paste("EXEC PersistModel @m='", modelbinstr,"'", sep = "")
sqlQuery(conn, q)
Once the model object is stored in SQL, to open it in R, we must unserialize the object (using the unserialize
R function). After unserializing it, we obtain the original R object back which can be used to score. So if serializing is the SQL equivalent to the save
function, unserializing is the equivalent to load
. We've already seen an example of scoring in dataset using rxPredict
. Here's an example of doing the same, but by invoking a SQL stored procedure instead of running it directly within R. The stored procedure PredictTipBatchMode
will score the NYCTaxiSmall
dataset.
-- Create prediction stored procedure
CREATE PROCEDURE [dbo].[PredictTipBatchMode] @inquery nvarchar(max)
AS
BEGIN
DECLARE @lmodel2 varbinary(max) = (SELECT TOP 1
model
FROM models);
EXEC sp_execute_external_script @language = N'R',
@script = N'
mod <- unserialize(as.raw(model))
print(summary(mod))
OutputDataSet <- rxPredict(modelObject = mod, data = InputDataSet, outData = NULL,
predVarNames = "Score", type = "response", writeModelVars = FALSE,
overwrite = TRUE)
str(OutputDataSet)
print(OutputDataSet)',
@input_data_1 = @inquery,
@params = N'@model varbinary(max)',
@model = @lmodel2
WITH RESULT SETS ((Score float));
END
As a SQL stored procedure, it can be invoked by an application that talks to SQL. Here's how we can run the stored procedure from T-SQL:
DECLARE @query_string nvarchar(max)
SET @query_string='SELECT *
FROM NYCTaxiSmall'
-- Call stored procedure for scoring
EXEC [dbo].[PredictTipBatchMode] @inquery = @query_string;
Of course the same stored procedure can be invokes directly by R, as such:
# Define input data using a SQL query. Here we use the table, NYCTaxiDirectDistFeatures,
# that was created earlier to store the engineered feature, direct_distance.
input = "N' SELECT * FROM NYCTaxiSmall'"
# The query is sent to be executed with input data, scored data frame is returned
q <- paste("EXEC PredictTipBatchMode @inquery = ", input, sep = "")
scoredData <- sqlQuery(conn, q)