Data Platform: The new generation of Data Lakes
This article is a follow-up to Data Platform as a Service, in collaboration with Albert Palau, describes the high-level architecture, and goes into details on Data Lake. We will detail the rest of the blocks and components shown in the next articles.
The important thing about the architecture is not the vendor or specific product but the capabilities of the components we used. In the end, the product choice depends on many factors:
- Knowledge of the Team.
- If the product is available in the cloud, we are using it.
- Ability to integrate with our existing products.
- The cost.
In this case, we have designed a solution mainly based on Azure services. At the same time, we have designed an architecture that would allow us to integrate or migrate to other cloud services in an agile way.
Have an agile cloud data platform and not locking vendor depends on:
- Use open-source technology for the core of the platform. This allows us to move our platform to another cloud provider.
- Provide a data hub service for streaming and batch data.
- Automated data pipelines allow us to move our data to different data repositories easily.
- Data services layer uncoupled of data persistence engine.
Of course, we can use a specific product from a vendor that provides added value (Big Query, Redshift, Snowflake,...), but we should always have a plan to be able to replace it with another technology in an agile way.
Data Model (Domain Driven Design)
A data platform requires a global data model definition. In our opinion, it is better to have a bad model than not to have a model at all. This is one of the first decisions (the most important one), but we will not consider it a blocker as long as we can work on the data platform design. This is a strategic decision that should involve all the company. On top of that is a long-term decision, making changes on the technological side is easy but changing a global data model requires years and a lot of effort.
We are believers in Domain-Driven Design (DDD). We've been working with the DDD approach for many years. In our opinion, IT companies have an advantage because all the company is involved from business people to the developer and this is the key to the success of DDD. In other types of companies, it takes more effort to get everyone involved.
Something important about data domains:
- There are two views of data domains, producers and consumers. Usually, these domains are different because consumers' domains are a composition of data from several producer domains.
- Specific data can have a main domain and a secondary domain.
- The data domain organization is not static. The domains are changed, merged, evolved, or removed.
In terms of the data domain, the best approach for us is to follow a bottom-up design. It means to start from producer data domains, this is the base on which the data products will be built. This data product will be built by the own consumers. Therefore our data platform will have to provide them will all the necessary tools, services, support, standardized processes, and integrations.
The sale domain is a very common use case of the consumer data domain and very complex. What is a sale? In a big company with multichannel orders (eCommerce, social media, physical store..), the sale concept is a little bit different between channels and departments. It is composed of data from several domains.
Is it the same sales data product for the eCommerce department as the financial department? It depends on many factors, but there are probably different data products because each team requires different data, data validation processes, and indicators.
This is a simple description because this topic needs a specific article. To have a global view of this approach in a new data platform, we would recommend reading about the Data Mesh paradigm.
Data Ingestion Pattern
The most valuable resource of a new Data Platform is the data, this at the same time is the most complex to provide. There are two approaches to upload data:
- Pull: Based on the core team and centralized management, develop data pipelines to ingest the data into the platform. In the beginning, this works very well because there are no dependencies with other teams, but in the end, as I have already explained in the previous article, they will end up in a bottleneck.
- Push: It's the best approach in terms of operation, architecture, and paradigm but it depends on other teams. For example, the Distribution team needs to analyze the sales data. Having the sales data requires the Sales team to push the data to the data platform. We could be waiting for sales data for a long time because the sales team has no time to do it or it is not his priority.
Following the "Push" approach it is a good decision in terms of operation, architecture, and paradigm. It depends on the reality of the enterprise architecture, we must provide "Pull" capabilities because in many companies usually, there are a lot of legacy systems or teams no ready to push the data.
In our opinion, the best way to provide a "pull" service is to develop an automated data ingestion engine service.
What is a data ingestion engine service?
This is a self-service platform that allows creating ETL processes and streaming processes without code, it only requires SQL sentences and mappings.
The goal is to provide several flavors to cover all the cases:
- Allow teams to push their data to the exchange area on their own.
- Provide a core and centralized team to upload data for non-technical teams.
- Provide a self-service platform that simplifies the data ingestion for technical teams.
If we follow the same approach for all types of data ingestion pipeline, we will have a catalog of automated connectors to allow the teams to push their data:
- Change Data Capture.
- Images, files, etc...
The main idea here is to work from a non-desired "pull strategy" towards a "push model" by building the common components that the product owners will use to push the data in the future. This will allow us to achieve an automated ingestion layer.
We have to provide all the tools and standard processes (ingestion, data quality, etc...) to allow the producers to push automatically their data into the data platform. This self-service can be a Web Portal or GitOps solution.
The following articles will describe how to develop an ingestion engine in detail.
Microservice Architecture: Push
The event-driven microservices architectures are one of the best scenarios to apply "Push Strategy" based on streaming. These architectures follow a publish-subscribe communication pattern usually based on a durable messaging system such as Apache Kafka.
This pattern provides scalable and loosely coupled architecture:
- The publisher sends a message once to the topic.
- This message is received by all the subscribers who sign up for this topic. The event is produced once time and consumed many times.
- Publisher and subscriber operations can operate independently of each other. There are no dependencies between them.
We can provide a standard ingestion connector to subscribe to these topics and ingest the events in near real-time to our Data Platform. These architectures have challenges in terms of informational scope that are not usually covered:
- These durable topics usually have restrictions based on time or sizing. In the case of error scenarios, reprocessing is more complex.
- The process to resend historical data.
- Async data quality API for massive scenarios.
It is a natural choice for analytics, machine learning environments, and store raw data. A Data Lake is a repository of data, usually based on object storage, that allows us to store:
- Structured data from relational databases.
- Semi-structured data from NoSQL or other sources (CSV, XML, JSON).
- Unstructured data and binary data (documents, video, images).
Currently, cloud storage services have improved a lot and offer different quality of service that allows us:
- To have high performance and low latencies for hot data.
- To have a large storage capacity and higher latencies at a low cost, for cold and warm data.
As cloud object storage, we've chosen Azure Data Lake Storage Gen2. This object storage provides a few interesting features:
- Volume: It can manage massive amounts of data, petabytes of information, and gigabits of throughput.
- Performance: Optimized for analytical use cases.
- Security: It allows POSIX permissions on directories or individual files. Mount an Azure Data Lake Storage Gen2 filesystem to DBFS using a service principal and OAuth 2.0
- Events: It provides a service to automatically generate an event for each operation performed, such as creating and deleting files. These events allow designing event-driven data processes.
We have to take several decisions aligned with users and use cases:
- Provide read-only access to the data. It allows the data lake to be the source of truth and a single repository of data for all users.
- Structured data and semi-structured data are stored in a columnar format. There are many options to store data but a very good choice is to use Delta Lake, we'll describe it in the following section.
- The data are stored partitioned by the business data domain and distributed in several object storage.
- Provide a Hive Metastore service to provide spark-SQL access by using external tables. This allows having a single image of the data and abstracts users from the physical location of the data.
Nowadays we can use an external Hive Metastore, an open-source version, managed by us instead of a vendor-managed service with integration restrictions. This gives us the freedom to integrate any spark environment regardless of who provides the spark platform environment (Databricks, Cloudera, etc..).
Spark-SQL & Hive Metastore
Spark SQL provides us a distributed query engine to consume our structured and semi-structured data in a more optimized way and uses Hive Metastore like data catalog. Using SQL we can query data from:
- DataFrame and Dataset API.
- External tools like Databricks Notebooks. This is a user-friendly tool that facilitates data consumption by non-technical users.
Data Lake as a service
If we put all the pieces that we have described until now together, we can design and build a Data Lake platform:
- The data ingestion engine is responsible to ingest the data, create and manage the metadata in the Hive Metastore.
- The core of the Data Lake is composed of the object storage layer and Hive Metastore. These are the two main components that allow us to provide compute layer as a service.
- The compute layer is composed of several sparks clusters integrated into the Data Lake. These clusters allow to access this data using spark jobs, SQL Analytics, or Databrick Notebooks.
In our opinion, the ability to provide this dynamic and scalable compute/service layer is what allows us to provide a Data Lake platform as a service otherwise we would have something more similar to traditional on-premises Data Lakes. We can create a permanent cluster, 24x7, or we can create ephemeral clusters to run our jobs. A spark cluster is a core of this compute and service layer. It is the least common divisor of the service catalog that we will offer in our Data Lake platform.
For example, we want to provide sandbox analytical services for our data product teams. We will create an isolated compute environment for each one, but all of them will access the same data. To provide these features as a service we need:
- Define the components that composed sandbox analytics, based on spark technology.
- Provide self-service capabilities by web service catalog or as code way (git-ops).
Of course, this is a very simplified view because we've not defined the security, high availability, or data quality services.
What does Delta Lake provide?
It is an open-source layer that provides ACID capabilities and ensures that readers never see inconsistent data. Data Pipelines can refresh the data without any impact on the running spark process.
There are other important features such as:
- Schema on-write: It enforces schema check when writing data, the job fails when schema mismatch is detected.
- Schema Evolution: It supports schema evolution for compatible scheme evolutions such as adding a new column.
- Time travel: Data versioning is an important feature in Machine Learning use cases. Allow managing the data as code. As a code repository, the users have versioning of the data, for each change to a dataset generates a new version of the data during its entire life cycle.
- Merge: Supports merge, update and delete operations to enable complex ingestion scenarios.
The evolution of the Data Lake
A few years ago the difference between traditional Data Lake, Data Warehouse, and Data Hub was conceptual and also technical.
The Data Lake technology, based on Hadoop, Spark, Parquet, Hive, etc.., had a lot of limitations. At this moment Delta Lake and other options such as Apache Hudi add new amazing features to the Data Lake ecosystem. These features joined with uncoupling architecture (compute and storage layer), serverless, and other new features such as SQL Analytics or Delta Engine from Databricks, are unfolding a new generation of Data Lake platforms.
Databricks names this new generation as Lake House. In our opinion, right now the maturity of this new generation allows Data Lakes to provide two new roles:
- Data Warehouse for specifics and reduced scenarios.
The Warehouse's capabilities are improving a lot, but at this time, it still requires a high level of technical knowledge to distribute the data and achieve the same performance as other products like Snowflake, Big query, or Oracle Autonomous Data Warehouse.
In combination with an events hub such as Kafka, the new generation of Data Lakes is a good choice to be the core of our data platform. It is a mature technology, mainly based on open-source, at a very competitive cost-performance, in constant evolution and we can provide it in all cloud vendors.
We hope you had a good time reading! Next article we will talk about analytical sandboxes.