Using Azure Machine Learning with an on-premises Database
With Azure Machine Learning (AzureML) you have access to a cloud based, flexible and friendly method to perform machine learning tasks on your data. One disadvantage I frequently run into is that cloud based approach of AzureML since the data you are building your machine learning models on has to be in the cloud as well. Even though AzureML offers a variety of ways to access your data, from CSVs to Azure Blob Storage and Azure SQL Database, having to store data in the cloud is one of the major drawbacks me, and some of my clients, run into.
But there is good news! The Microsoft Data Management Gateway! The Data Management Gateway acts like a bridge between AzureML and your on-premises SQL Server databases allowing you to import data directly from a local database!
Let’s take a look at how you can use the Data Management Gateway to build a machine learning experiment using on-premises data inside AzureML!
Local SQL Server Configuration
Since we plan on using data stored inside SQL Server on-premises, I used the AdventureWorksDW database as the database we are importing data from inside AzureML. You can download the AdventureWorksDW database here: https://msftdbprodsamples.codeplex.com/.
The way we will import data inside AzureML through the Data Management Gateway is based on a query. For this example I went with the query below that returns information from the AdventureWorksDW database about characteristics of people who bought bikes from the AdventureWorks company.
SELECT CustomerKey, GeographyKey, CustomerAlternateKey, MaritalStatus, Gender, CAST (YearlyIncome AS INT) AS SalaryYear, TotalChildren, NumberChildrenAtHome, EnglishEducation, EnglishOccupation, HouseOwnerFlag, NumberCarsOwned, CommuteDistance, Region, Age, BikeBuyer FROM vTargetMail
Running the query above on my local SQL Server databases returns the following results and returns a total of 18.484 rows:
Setting up the Data Management Gateway
The next step is to configure the Data Management Gateway so we can access the results of the query below inside AzureML. The Data Management Gateway configuration is stored inside AzureML Studio, underneath the “Settings” page.
To create a new Data Gateway simply click the “New Data Gateway” button at the bottom and a pop-up should appear (make sure you allow pop-ups from the AzureML Studio page else some of the configuration pages might get blocked).
Here we can specify a name for the Data Gateway and provide a description. In this case I went for “AzureML_On-Premises” as the Gateway name and clicked the next page button.
This configuration page is actually the most important one as it shows the Gateway Registration Key we need to tie the Data Gateway in AzureML together with the local Data Management Gateway. In the screenshot above I intentionally removed the registration key.
From this page we can directly download the Data Management Gateway software. We need to install the Data Management Gateway on an on-premises machine. This does not necessarily have to be the same machine we are importing our data from, but for this article I installed it on the same machine where my SQL Server Instance resides.
You can click the link in the Data Gateway page in AzureML or use the following link to download the Data Management Gateway software: http://go.microsoft.com/fwlink/?LinkId=271260.
The installation of the Data Management Gateway is a pretty straightforward next, next, finish.
After installation finishes you will be presented the Microsoft Data Management Gateway Configuration Manager page:
This is where we enter the gateway key we received when adding the new Data Gateway inside AzureML. Copy/paste the code from AzureML inside the input box and click “Register”.
If all went well the Data Management Gateway will be linked to the Data Gateway in AzureML:
If we return to the Data Gateway page inside AzureML we should see an updated page as well, indicating that the connection between the Data Gateway and the on-premises Data Management Gateway has been registered.
We can now click the “V” button to save and close the configuration of the Data Gateway and our Data Gateway should be added:
Now that we have our gateway configured and connected we can access our on-premises database directly from AzureML!
Importing on-premises data from AzureML
Let’s start by creating a new experiment in AzureML Studio. The way we access our on-premises databases is through the Import Data module. Simply drag the Import Data module into the AzureML experiment canvas, the properties of the Import Data module will appear on the right.
To make use of the gateway we created we will have to configure some setting of the Import Data module.
First of all we will have to change the Data source to “On-Premises SQL Data” as show in the image below.
The next step is to select the Data Gateway we configured in the previous steps:
After that we need to provide the name of the SQL Server our database resides on, as well as providing the name of the database:
Then we need to provide the credentials AzureML will use when connecting to your on-premises databases. Simply click the “Enter values” button and the Credential Manager application will start.
In this case I selected “SQL Server Authentication” and provided a username and password of a SQL Server user I created that has read permissions on the AdventureWorksDW databases.
Click “OK” to save this information inside the AzureML experiment.
The final step is to provide a query that will be used to retrieve the data from your on-premises database. For this example I used the query at the top of this article.
In this case I left the “Use cached results” checkbox unchecked. This means that every time I run the experiment AzureML will access my on-premises database to retrieve the data. If you do check the checkbox the results are cached inside AzureML.
Right now the properties of the Import Data module should look something like this:
Basically we are all done now with the configuration, save and run the experiment!
After the experiment finished running we are able to see what data AzureML has imported, click on the Import Data module and select the “Visualize” option.
AzureML will return a result set of the data it has imported into the AzureML workspace:
As you can see from the figure above, all of the information I showed you at the top of this article by running the query locally is there and a total of 18.484 rows is returned, just like the locally executed query!
Now that we have our on-premises data in AzureML we can continue with using the data to build a machine learning experiment! And that’s basically how easy it is to configure AzureML to use data inside an on-premises SQL Server database.
The steps above show the quick and easy method you can use to access on-premises database data inside AzureML. There are some optional settings for the Data Management Gateway you might want to consider changing if you plan on using this for your production data. For instance, by default a self-signed certificate is used to encrypt the communication between the gateways. For a more secure solution you might want to change it to a valid certificate. You can also choose to configure a proxy for the communication to AzureML if you want.
All of these settings can be configured inside the Microsoft Data Management Gateway Configuration Manager on the machine where you installed the Data Management Gateway.
Having the option to use data stored in on-premises databases inside AzureML is a very nice addition to AzureML for those people who do not have everything stored inside the cloud yet. It is easy and quick to configure and allows you to use the power of the cloud-based AzureML without having to move all the data to the cloud first!