Get the most value from your data with data lakehouse architecture

Hear from the CIO, CTO and other C-level and senior executives on data and AI strategies at the Future of Work Summit on January 12, 2022. Learn more


This article was contributed by Gunasekaran S, Director of Data Engineering at Sigmoid.

Over the years, Cloud Data Lake and warehousing architectures have helped enterprises scale their data management efforts by reducing costs. Traditionally, steps in a data management architecture typically involve extracting enterprise data from operational data repositories and storing them in a raw data lake. The next step is to implement a second round of ETL process to transfer critical subsets of this data to the data warehouse to generate business insights for decision making. However, there are several challenges in the current set-up, such as:

  • Lack of compatibility: Companies can often find it difficult to keep their data leaks and data warehouse architectures consistent. Not only is it expensive, but teams also need to use continuous data engineering tactics for ETL / ELT data between the two systems. Each step can present failures and unwanted errors that affect the quality of the data as a whole.
  • Constantly changing datasets: The data stored in the data warehouse may not be as current as the data in the data lake which depends on the data pipeline schedule and frequency.
  • Vendor lock-in: Transferring large amounts of data to a centralized EDW not only because of the time and resources required for companies to carry out such work but also because of this architecture creates a closed-loop which makes vendor lock-in very challenging. In addition, the data stored in warehouses is difficult to share with all end users within the organization.
  • Poor maintainability: With Data Lake and Data Warehouse, companies need to maintain multiple systems and facilitate synchronization which complicates the system and makes it difficult to maintain in the long run.
  • Data Governance: While the data in DataLake is mostly in a different file-based format, the data warehouse is mostly in a database format, and that adds to the complexity in terms of data governance and descent.
  • Advanced Analysis Limitations: Advanced machine learning applications like PyTorch and TensorFlow are not fully compatible with the data warehouse. These applications retrieve data from data lakes where data quality is often not controlled.
  • Copies of data and related costs: Data leaks and data available in data warehouses lead to the extent of data copying and the associated costs. Moreover, commercial warehouse data in proprietary format increases the cost of transferred data.

Data Lackhouse addresses these typical limitations of data lakes and data warehouse architectures and combines the best components of both data warehouses and data lakes to provide significant value for organizations.

Data Lakehouse: A Brief Overview

Data Lakehouse is essentially the next generation of cloud data lake and warehousing architecture that combines the best of both worlds. It is an architectural approach to support all data formats (structured, semi-structured or unstructured) as well as multiple data workloads (data warehouse, BI, AI / ML, and streaming). Data Lackhouses is underpinned by a new open system architecture that allows data teams to implement data structures through smart data management features that are on the same low-cost storage platforms as Data Warehouses used in Data Lakes.

The data lakehouse architecture allows data teams to quickly gain insights as they have the opportunity to use data without having to access multiple systems. Data Lakehouse can also help architecture companies ensure that data teams have the most accurate and up-to-date data at their disposal for mission-critical machine learning, enterprise analytics initiatives and reporting purposes.

Benefits of Data Lackhouse

There are many reasons to look at modern data lakehouse architecture to drive sustainable data management practice. The following are some of the key factors that make Data Lackhouse an ideal choice for enterprise data storage initiatives:

  • Data quality delivered by simple schema: Data Lakehouse comes with a dual-layered architecture where the warehouse layer is embedded on a Data Lake enforcing schema that provides data quality and control and organizes fast BI and reporting.
  • Decrease in data drift: Data Lakehouse architecture reduces the need for multiple data copies and significantly reduces data drift related challenges.
  • Quick Query: Quick interactive query with true data democratization facilitates more informed decision making. Architecture data allows scientists, engineers and analysts to quickly access the data they need. This results in a faster time-to-insight cycle.
  • Effective administration: By implementing a data lakehouse architecture, companies can help their data teams save significant time and effort as they require less time and resources to store and process data and deliver business insights. In fact, a platform for data management established by Data Lackhouse can also significantly reduce the administrative burden.
  • Seamless Data Governance: Data Lakehouse serves as a single source, allowing data teams to embed advanced features such as audit logging and access control.
  • Effective data access and data security: Data Lackhouses provides data teams with the option of maintaining proper access controls and encryption across the entire pipeline for data integrity. Furthermore, in the data leakhouse model, data teams do not need to manage security for all data copies which makes security administration much easier and cost-effective.
  • Less chances of data redundancy: Data Lackhouse architecture reduces the need for multiple data copies required in the implementation process of data lakes and data warehouses, thereby reducing data drift.
  • High scalability: Data Lakehouse provides high scalability of both data and metadata. This allows companies to run complex analysis projects with a fast time-to-insight cycle.

Emerging data lakehouse pattern

Azure Databricks Lakehouse and Snowflake are two leading lakehouse platforms that companies can leverage for their data management initiatives. However, the decision to choose one should be based on the needs of the company. There are several companies leveraging this platform, including Databrix for data processing and Snowflake for data warehousing capabilities. Over time, these two platforms have slowly begun to build the capabilities that others have to offer in their quest to emerge as the platform of choice for multiple workloads.

Now, let’s take a look at these different lakehouse patterns and how they have evolved over time.

Databrix: Data processing engine on data lakes adding data leakhouse capabilities

Databrix is ​​essentially an Apache spark-powered data processing tool that provides data teams with agile programming environments with auto-scalable computing capabilities. Companies only need to pay for the computational resources used. The Databrix platform is best suited for early stage data processing in pipelines where data needs to be prepared and ingested. Companies can also take advantage of this to prepare data for conversion and enrichment but it falls short when it comes to processing data for reporting.

Over the past few years, Databrix has focused on building capabilities around traditional data warehouses. The platform comes with a built-in DQL-query interface and intuitive visualization features. In addition, Databrix also comes with a table structure that is similar to a database that is typically developed in Delta file format. This format is used to add database capabilities to Data Lake. The format allows for data versioning through ACID transactions and schemas.

The main differences of Azure Databrix Lakehouse

  • Comes with a ready-to-use spark environment without the need for configuration
  • Embedded open source Delta Lake technology that acts as an additional storage layer
  • Delta performs better by integrating smaller files into tables
  • ACID functionality in the Delta table helps ensure complete data security
  • There are many language options like Scala, Python, R, Java and SQL
  • The platform supports interactive data analysis with notebook-style coding
  • Provides seamless integration options with other cloud platform services such as Blob Storage, Azure Data Factory and Azure DevOps
  • Provides open source library support

Snowflake: Cloud Data Warehouse Expands to Address Data Lake Capabilities

Unlike Databrix, Snowflake transformed the data warehousing space a few years ago by offering computational capabilities that are highly scalable and distributed. The platform has achieved this by separating storage and processing capabilities into data warehouse ecosystems. This is one of the approaches adopted by Snowflake to expand the solution in data lake space.

Over the years, Snowflake has been gradually expanding its ELT capabilities, allowing companies to run their ELT processes integrated with the platform. For example, some companies take advantage of Snowflake streams and tasks to complete SQL tasks in Snowflake, while others use “dbt” with Snowflake.

The main difference of Snowflake Data Lakehouse

  • Comes with built-in export and query tools
  • Platforms can integrate seamlessly with BI tools such as Metabase, Tableau, PowerBI and more.
  • The platform supports JSON format for query and output of data
  • Provides secure and compressed storage options for semi-structured data
  • Can easily connect to object storage like Amazon S3
  • Comes with granular protection to deliver maximum data integrity
  • There is no significant limit to the size of the query
  • Presence of standard SQL dialect and strong work library
  • Comes with a virtual warehouse that allows data teams to segregate and classify workloads as needed
  • Promotes secure data sharing and easy integration with other cloud technologies

Dremio and Firebolt – SQL Lakehouse Engine on Data Lake

In addition to Snowflake and Databrix, data lakehouse tools like Dremio and Firebolt are also coming with advanced query capabilities. For example, Dremio’s SQL Lakehouse platform, high-performance dashboards and intuitive analytics have the ability to deliver directly to any data lake storage, eliminating the need for a data warehouse. Likewise, Firebolt comes with advanced indexing capabilities that allow data teams to shrink data access to data ranges that are even smaller than partitions.

Evolution on Cloud Data Lake and Warehouse

Data Lackhouse is an evolution on cloud data lakes and warehousing architectures that allows data teams to take advantage of the best of both worlds while overcoming all historical data management vulnerabilities. When done correctly, the Data Lakehouse initiative can release data and enable the company to use it the way it wants and at the desired speed.

Going forward, as the cloud data warehouse and data lake architecture integrate, companies may soon find vendors that combine all the capabilities of all data lakehouse tools. This can open up endless opportunities when it comes to building and managing data pipelines.

Gunasekaran S is the Director of Data Engineering at Sigmoid.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including tech people working on data, can share data-related insights and innovations.

If you would like to read about the latest ideas and latest information, best practices and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing to your own article!

Read more from DataDecisionMakers

Similar Posts

Leave a Reply

Your email address will not be published. Required fields are marked *