T1A Architecture Showcase: Azure and Databricks in Utility Industry

Jun 13, 2023

—

8 min

—

data engineering

Author

Akos Monostori, Oleg Mikhov

Intro

Today we want to tell you a little bit about a project in the Utility Industry. Our client, an electric power company that has served over a quarter of million retail and business customers in the southern states of the USA. The customer’s Data Science Team operates with large data sets such as customer and billing information, electricity meter measurements, geolocation of their power grid, and historical weather data. This Team had a decentralized set of tools to support their day-to-day activities related to operating and maintaining existing and developing new predictive models. And unfortunately, the majority of tasks were performed by using laptop resources that show performance issues and now could not handle the growing demand on the data and the compute power.

Goals

Based on the background of the company and discussion with stakeholders, T1A identified the best solution for the problems of the Data Science Team would be the implementation of a self-service data science platform. Data Scientists must be able to analyze large amounts of data effectively without using up all of their organization’s resources. This was where T1A sow data science platforms provide the most value:

Creating a data science platform makes it easier for Data Scientists to work together, as their code, data models, and data visualizations reside in a single, shared location in the cloud, which is easily accessible anywhere on any device.
Data science platforms help data scientists move their analytical models into production without requiring additional DevOps or engineering effort.
Data science platforms help data scientists offload many low-value tasks like scheduling jobs, reproducing past results, running reports, and configuring environments for non-technical users.

The client and T1A formulated the goals of implementing a Data Science Platform as:

Implement a new data platform that can be used to ingest, store and process large amounts of data. The platform should fulfill the modern requirements for Data Engineering, Data Analytics, Data Visualization, and Machine Learning.
Build pipelines to make data from a set of sources available in the platform that are sufficient for the most common activities.
Demonstrate how existing Machine Learning projects can be migrated into the new platform.
Demonstrate how the platform can be used as a data source for reporting systems.
Create a prototype, working with cloud resources in Azure.

Challenges

Most of the challenges that the client faced with Data Platform were related to the company’s lack of experience with similar solutions, such as:

The existing SAS-based platform was not scalable and could not support modern needs.
This was the first cloud-based data project in the company, which required extra levels of security and proving to the IT Security team that the solution was safe to implement.
Large datasets with billions of records had to be ingested from various sources in different formats, which made the data load strategy difficult to define.
In some cases, the data load strategy did not fit into the scope of the project to implement complex ETL solutions.
There was no ML Ops process in place to support the implementation, which required additional support for the implementation of the ML model.

Solution

T1A and the client identified a Cloud Data Science Platform as the solution to their problem, which would connect with on-premises systems. T1A used the powerful Azure service, Data Factory, to load data into the Azure Cloud Environment. Data Factory allows data to be pulled from On-Prem sources using Self-Hosted runtime, simplifying data movement to the Cloud and providing high throughput with all security compliance and encryption. The data was loaded directly into a Blob storage, where it was stored for a specified time according to the retention policy. The data platform core solution, Databricks, then loaded the data into its Delta format and optimized storage for data processing and analytics. This was built as a standard pipeline for data integration with On-Prem systems and can be safely reused in other client’s projects.

Diagram 1 — System Integration Architecture

As for reporting and Data Analytics, the client decided to start using the PowerBI solution in House, we connected the PowerBI to Azure Databricks thru On-Premise Gateway Server, which allowed end users to access Databricks Data without having a data connection over the internet. All business users also have access to Databricks data by using PowerBI Desktop.

Diagram 2 — Reporting Integration Architecture

Detailed Description of the Architecture

The architecture was built using the following components:

Source Systems (SAP, SAS, Oracle, MS SQL, External 3rd-party data)
On-Premise Self-Hosted Azure Data Factory Runtime as an ingestion tool
Azure Cloud Data Factory as a Data Pipeline tool
Azure Cloud Blob as a Data Lake Storage tool
Azure Cloud-hosted Databricks as a platform to process and manage data and run Data Science
Microsoft Power BI as a reporting tool

Source datasets:

As a part of the implementation of the data science platform, a set of data sets from these sources were loaded and corresponding ETLs were designed and developed. All data-loading pipelines work in two steps:

Stage data — Load data to Azure storage in parquet format
Land data in Databricks — Land data in Databricks with appropriate transformations if needed

Data Sources that we pull data from:

SAP (HANA DB): Large datasets, that store customer billing information.
SAS Data Sets: Metering information, collected from other systems. Stores the input data set used for the migrated ML model.
Oracle Tables: Data from outage management system (reports about power outages).
MS SQL Server Tables: ArcGIS geography data of the power grid and external factors (for example — vegetation data).
Head-End System: Data of electricity meters, operated by a third party (Landis+Gyr) outside of the company.

Azure Data Factory Self-hosted Integration Runtime

A Compute Node of Azure Data Factory that is hosted on-premises. The purpose of the node is to avoid direct exposure of the data sources to cloud services. Data is transferred in a secure way.

Azure Data Factory

Azure Data Factory is Azure’s cloud ETL service for scale-out data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. In our project, it is the main ETL tool used for data moving from on-premises to Azure and thru Azure if needed. During the implementation phase, we developed 30 pipelines, through which we migrated from the 5 sources 20 different tables, overall 2,314,937,042 rows of data in total.

A sample of a set-up ETL process of one of the sources in Azure Data Factory:

Azure Key Vault

Azure Key Vault is a cloud service for securely storing and accessing secrets.
Secrets Management — Azure Key Vault can be used to securely store and tightly control access to tokens, passwords, certificates, API keys, cryptographic keys and other secrets
Key Management — Azure Key Vault can be used as a Key Management solution. Azure Key Vault makes it easy to create and control the encryption keys used to encrypt your data.
Certificate Management — Azure Key Vault lets you easily provision, manage, and deploy public and private TLS/SSL certificates for use with Azure and your internally connected resources.

Azure Blob Storage

Azure Blob storage is an object storage solution for the cloud, which can be accessed by various tools. In our project, it serves as an intermediate storage between the source systems and the data science platform.

Databricks

Databricks is a web-based platform in Azure, which can serve all data needs, such as storage, data analysis, and machine learning. It can create insight using SQL, Python, and R and provide active connections to visualization tools like Power BI, Qlikview, and Tableau, and can build predictive models using SparkML. A user-friendly environment, high performance, and low cost, it is why T1A proposed Databricks for that solution.

After T1A completed ETL processes development, all the integrated data sources, which previously required five different tools to access, now are easily accessible in one place in Databricks:

Based on the accuracy chart below, the ML model has achieved impressive results, demonstrating its effectiveness in predicting the outcome with high precision:

Reporting — Microsoft Power BI

MS Power BI was selected as a reporting tool to work with the Data Science Platform.

Report developers are allowed to connect to the Data Science Platform in Databricks using a direct connection. Reports that are published to Power Bi Cloud Service have to use the Power BI Gateway server to pull data from the Databricks. Once the report is published the appropriate configuration for the published data set has to be made.

The on-premises data gateway acts as a bridge to provide quick and secure data transfer between on-premises data (data that isn’t in the cloud) and several Microsoft cloud services.

A visual of the protected vegetation areas in the service area of our customer:

T1A succeeded in identifying the best tools for the client’s project, such as Azure Data Factory, Blob storage, and Databricks, which enabled the ingestion, storage, processing, and analysis of large data volumes. The implementation of this data science platform has enhanced the accuracy of the client’s machine learning models, facilitated the migration of existing machine learning projects to the new platform, and allowed for the migration of data for other business units.

Through this project, T1A has gained valuable experience that can be applied to other projects. Specifically, we can replicate success in the Hybrid Data Pipeline, integration with On-Prem systems, and migration of the legacy machine learning solutions to the target Databricks platform.

Lesson Learned

Our project had its share of challenges, but thanks to the Project Team, we overcame them.

To ensure project success, it is critical to obtain access to data sources as early as possible.
Additionally, when working with cloud-based data, it is important to demonstrate to the IT Security team that the solution is secure.
Ingesting large datasets from various sources in different formats can make defining a data load strategy difficult. So a robust integration framework with sources should be put first.
Before accepting the results, define acceptance criteria for the data with the Customer team. All parties involved should agree on acceptable discrepancy levels.

Opportunities

Thanks to the experience we gained from this project, we were already able to successfully implement another ML model for revenue forecasting.

Working closely with the client’s stakeholders and SMEs, we identified several potential areas where we could be of assistance:

Support the customer with the usage of the current platform.
Help improve the accuracy of the migrated ML model.
Help with the migration and development of additional ML models that help our client’s customers .
Help with the cloud migration of additional data sources.
Potential migration of data for other business departments.
Data Platform for all customer’s departments.

Authors: Akos Monostori, Oleg Mikhov

No items found.

Unlock Exclusive Content