Today we want to tell you a little bit about a project in the Utility Industry. Our client, an electric power company that has served over a quarter of million retail and business customers in the southern states of the USA. The customer’s Data Science Team operates with large data sets such as customer and billing information, electricity meter measurements, geolocation of their power grid, and historical weather data. This Team had a decentralized set of tools to support their day-to-day activities related to operating and maintaining existing and developing new predictive models. And unfortunately, the majority of tasks were performed by using laptop resources that show performance issues and now could not handle the growing demand on the data and the compute power.
Based on the background of the company and discussion with stakeholders, T1A identified the best solution for the problems of the Data Science Team would be the implementation of a self-service data science platform. Data Scientists must be able to analyze large amounts of data effectively without using up all of their organization’s resources. This was where T1A sow data science platforms provide the most value:
The client and T1A formulated the goals of implementing a Data Science Platform as:
Most of the challenges that the client faced with Data Platform were related to the company’s lack of experience with similar solutions, such as:
T1A and the client identified a Cloud Data Science Platform as the solution to their problem, which would connect with on-premises systems. T1A used the powerful Azure service, Data Factory, to load data into the Azure Cloud Environment. Data Factory allows data to be pulled from On-Prem sources using Self-Hosted runtime, simplifying data movement to the Cloud and providing high throughput with all security compliance and encryption. The data was loaded directly into a Blob storage, where it was stored for a specified time according to the retention policy. The data platform core solution, Databricks, then loaded the data into its Delta format and optimized storage for data processing and analytics. This was built as a standard pipeline for data integration with On-Prem systems and can be safely reused in other client’s projects.
Diagram 1 — System Integration Architecture
As for reporting and Data Analytics, the client decided to start using the PowerBI solution in House, we connected the PowerBI to Azure Databricks thru On-Premise Gateway Server, which allowed end users to access Databricks Data without having a data connection over the internet. All business users also have access to Databricks data by using PowerBI Desktop.
Diagram 2 — Reporting Integration Architecture
The architecture was built using the following components:
As a part of the implementation of the data science platform, a set of data sets from these sources were loaded and corresponding ETLs were designed and developed. All data-loading pipelines work in two steps:
Data Sources that we pull data from:
A Compute Node of Azure Data Factory that is hosted on-premises. The purpose of the node is to avoid direct exposure of the data sources to cloud services. Data is transferred in a secure way.
Azure Data Factory is Azure’s cloud ETL service for scale-out data integration and data transformation. It offers a code-free UI for intuitive authoring and single-pane-of-glass monitoring and management. In our project, it is the main ETL tool used for data moving from on-premises to Azure and thru Azure if needed. During the implementation phase, we developed 30 pipelines, through which we migrated from the 5 sources 20 different tables, overall 2,314,937,042 rows of data in total.
A sample of a set-up ETL process of one of the sources in Azure Data Factory:
Azure Blob storage is an object storage solution for the cloud, which can be accessed by various tools. In our project, it serves as an intermediate storage between the source systems and the data science platform.
Databricks is a web-based platform in Azure, which can serve all data needs, such as storage, data analysis, and machine learning. It can create insight using SQL, Python, and R and provide active connections to visualization tools like Power BI, Qlikview, and Tableau, and can build predictive models using SparkML. A user-friendly environment, high performance, and low cost, it is why T1A proposed Databricks for that solution.
After T1A completed ETL processes development, all the integrated data sources, which previously required five different tools to access, now are easily accessible in one place in Databricks:
Based on the accuracy chart below, the ML model has achieved impressive results, demonstrating its effectiveness in predicting the outcome with high precision:
MS Power BI was selected as a reporting tool to work with the Data Science Platform.
Report developers are allowed to connect to the Data Science Platform in Databricks using a direct connection. Reports that are published to Power Bi Cloud Service have to use the Power BI Gateway server to pull data from the Databricks. Once the report is published the appropriate configuration for the published data set has to be made.
The on-premises data gateway acts as a bridge to provide quick and secure data transfer between on-premises data (data that isn’t in the cloud) and several Microsoft cloud services.
A visual of the protected vegetation areas in the service area of our customer:
T1A succeeded in identifying the best tools for the client’s project, such as Azure Data Factory, Blob storage, and Databricks, which enabled the ingestion, storage, processing, and analysis of large data volumes. The implementation of this data science platform has enhanced the accuracy of the client’s machine learning models, facilitated the migration of existing machine learning projects to the new platform, and allowed for the migration of data for other business units.
Through this project, T1A has gained valuable experience that can be applied to other projects. Specifically, we can replicate success in the Hybrid Data Pipeline, integration with On-Prem systems, and migration of the legacy machine learning solutions to the target Databricks platform.
Our project had its share of challenges, but thanks to the Project Team, we overcame them.
Thanks to the experience we gained from this project, we were already able to successfully implement another ML model for revenue forecasting.
Working closely with the client’s stakeholders and SMEs, we identified several potential areas where we could be of assistance:
Authors: Akos Monostori, Oleg Mikhov