How to Expose Azure Databricks to SAP on the Fly?

Data Platforms:

Paweł Adamczak, Dominik Maj

When dealing with a very large amount of data, like Sales history from a couple of years, it might be beneficial to offload the SAP HANA with Cloud infrastructure for the sake of savings, both in terms of storage and computing resources.

There are many ways to transfer SAP HANA data with the data archiving process. The recommended options are to either utilize SAP ILM Store or SAP IQ database to organize a data offload. Another option is to take advantage of the Azure platform and its cost-effective storage.

Besides the aspect of archiving the SAP data in the dedicated layer, there are additional benefits of having the data in the Azure environment. Azure allows us to perform a comprehensive set of transformations of the data, as well as effectively expose the results back to SAP HANA.

The sample scenario makes use of the Azure BLOB storage and Azure Databricks to calculate Forecasts on a large set of historical data and exposes it in a ready-to-consume format, back to SAP HANA via ODBC.

SAP-posts-3a-1024x302-(Compressify.io)

The input data is stored in a CSV file with over 1 billion rows. Ingesting it into Databricks and storing it as a table results in 11.2 GB of storage space. It’s worth mentioning that Databricks gives flexibility of the supported file format – except for CSV, it can be parquet or many others. However, the CSV data file on Azure BLOB is a good, universal scenario.

The Sales forecast calculations were performed in Databricks using DataFrames and SQL. The source data was a minimal set of columns of Sales history, organized by Customer, Geography, Product, and Date. We wanted to calculate a Forecast for each Customer, Geography, Product, and Day combination for the next year.

Our forecasting method was straightforward, based on the weighted sum of values from the past 5 years. The weights were defined in a way that: the weight is higher for recent years and decreases for previous year.

The weights were incorporated into the data by creating a temporary view. The last step of the process was a simple sum over the years, achieved by grouping by month and days, and the dimension keys.

Calculating the forecasts on 1 billion rows took roughly 25 minutes and resulted in about 66 million rows. The output dataset, stored as a Spark table, was used for the live connection from SAP HANA.

To access the external system from SAP HANA perspective, we needed to create the Remote Source System via Smart Data Access (SDA). The only prerequisite for such integration is to install the ODBC driver and provide the details of Databricks cluster to the SAP HDB server. Once the connection to the remote system is established, we can create a Virtual Table pointing to the Databricks table, further consumed by the Calculation View.

As a result, the SAP BW/4HANA data model can consume data coming from Databricks and all other SAP HANA tables.

SAP-posts-3b-1024x287-(Compressify.io)

Running three different aggregation queries on the dataset, using a small, one-worker cluster, took respectively 23, 89, and 11 seconds. This performance may seem unsatisfactory, as today’s users are accustomed to sub-second query responses. However, our scenario was to achieve the lowest possible infrastructure cost.

The Azure cluster used, with one DS3v2 worker costs $0.69 USD per hour. It totals to $504.80 USD per month. The cluster can be easily turned off when not needed or shared among multiple processes, further lowering the costs per process. We also verified that scaling up the Databricks cluster to some limits improves query performance. Scaling up on the fly can be achieved by using a higher number of worker nodes; but using larger and more powerful workers from the start is also a good option.