Hadoop Data Lakes and Traditional EDW Go Head to Head

The traditional warehouse approach to data management is to use standard ETL tools that ingest data from different sources and stage it in the traditional relational databases. After staging, analysts use ETL to massage and prepare the data for reports to business users.

The Electronic Data Warehouse and ETL

This traditional data warehouse approach involves writing and maintaining the ETL jobs to ingest every source of data. Any metadata updates in the source will cause the ETL workflow to break and require manual updates.

Data is stored in traditional staging relational databases, where storage is expensive and slow. Business rules workflows created in ETL and any changes to the logic causes workflows to manually update again.

This traditional approach establishes an inflexible schema, enforced on end users. And, updates to data stores will be manual and require repeated not incremental effort.

data lake architecture

 

The traditional database approach involves gathering data from all sources, transforming and restructuring it and loading it into dimensional models in the data store. Data stores are subject area-specific and contain conformed dimensions across subject areas. Integration of these data stores lead to the EDW.

Hadoop Data Lake Approach

In simple terms, data lakes are data repositories. The term lake means that data is sitting in the repository. It’s well-articulated but not in an enforced structure. Data lakes are almost always used in terms of Hadoop.

Hadoop data lakes offer three important advantages over traditional data warehouse methods:

  • High-speed, high-volume processing. The ability to drop random data into a lake offers analysts the unique ability to handle large volumes of structured and unstructured data from any source. Because data structures are not enforced, they can be added to the lake easily without enforcing transformations when they load. This increases the speed of ingesting and processing data by several orders of magnitude faster than with traditional data processing systems.
  • Lower costs, faster implementation. Another significant difference is that Hadoop technology uses standard, off-the-shelf hardware. This reduces the cost of ownership and time needed to order equipment. The ability to use commodity hardware relies on the concept of map-reduce. In this approach, map-reduce works across several nodes, and the software orchestrates load distribution and re-mapping tasks. Because data is loaded without enforcing a structure, it’s much easier to explore and analyze the data. And, users can access data elements needed for analysis directly.
  • Lower compute and memory requirements. A data lake is akin to the idea of data staging in traditional data warehouses. However, data in the staging area is transient. It’s not possible to stage large volumes of data for extended periods of time without adding significant increases in compute and memory resources.

In a lake, these limitations are irrelevant. Commodity hardware is much cheaper, and Hadoop data lake architecture enables distributed computing. So, the staged data persists and isn’t volatile.

  • Data lakes enable flexible data governance. Master data definitions and management can be applied to data that’s used for analysis. This approach eliminates an over-engineered metadata layer and updates only the data that’s relevant to an analysis.

In traditional data warehouse methods, data ingestion is time-consuming. It can take months to just review results of data analysis. By using data lakes, business users can respond to monitored events in near-real time. More agile analysis enables faster provisioning of the right resources to the right users at the right time.

This speed is made possible by Hadoop data compilation, without disk space limitations. This approach removes the need to decide which data elements should be brought over.

Data Lake Advantages

Data lakes provide important advantages:

  1. You can collect and use any type of data. A data lake contains all types of data (raw and processed) from any source. And, you can store and process data over extended periods of time.
  2. You can refine and use data wherever you are. Data lakes enable business users across multiple business units to refine, explore and enrich data however they wish.
  3. There are many ways to access data. A data lake gives users access via batch, interactive, online, search, in-memory and other processing methods.
  4. They provide a modern data infrastructure. As data stores continue to grow, enterprise Hadoop investments can provide a framework for an efficient, modern data lake architecture and opportunity in an enterprise data lake.

Data lake benefits include:

  • Standardization: The global open source community promotes the rapid growth of use of Hadoop Data Lake. Data lake implementations bring a high level of definition and best practices that will make data management easier and less expensive.
  • More efficient data processing. Data lakes store all data types from any source. Users can push or pull data directly from its source into the lake without time-consuming manipulation or transformations. And, use of commodity hardware enables higher-volume data processing and limits storage costs.
  • Faster analysis and insight. When analysts use data lakes, they can source data quickly. Because there’s no need to apply a schema, they can get accurate results without knowing the data’s original structure. As new data sources get added to the environment, they can be loaded into the data lake. And, data can be updated, guided by users’ business requirements.
  • No need for up-front data models. Traditional data modelling, which is done up-front, fails in a big data environment. These failures are caused by the nature of the incoming data and the limitations that BI puts on the analysis.

Data lakes overcome these limitations by providing a loosely coupled architecture, which enables flexible analysis. Business users with different requirements can view the same data from different points of view.

  • Frequently used data structures can be stored for use. Data that‘s used frequently (for standardized reports, for example), can be loaded into the data warehouse with its structure intact. Other data can reside in the lake and be analyzed as needed.
  • One lake, many views. Business users can use data in the lake to meet local as well as enterprise-wide business requirements from the same data store. The enterprise view of the data can be considered as another local view

 


 

Interested to see how Syntelli can help your organization with your Big Data initiative? Contact us today or request a demo of Syntelli Services!

Request a Demo

About the Author

Shikha is a tech leader with deep expertise in emerging technologies such as Big Data analytics using Actian, Hortonworks, Tableau, and Spotfire. Her experience includes working with Fortune 500 companies, implementing solution design, architecting, and project managing. Shikha leads Technology for Syntelli and is passionate about non-profit causes and giving back to the community.

Shikha Kashyap

Chief Technology Officer