sitegplus.blogg.se - Data lakehouse vs data lake

DATA LAKEHOUSE VS DATA LAKE HOW TO

Data Mesh is not a Data LakeĪ Data Mesh can be used for many more use cases than a Data Lake can (eg in the Operational data domains). And, many historical Data Lake designs do not incorporate any principles of a Data Mesh (eg lacking cohesion with data producers and/or any focus on data product thinking), or for organizational reasons the Data Lake teams remain isolated from the business data producers. Objectively, we are aiming to align data consumers to the data while requiring minimal data processing inputs from IT as a ‘middleman’ in the process. We can use the same tech stack to reduce the ‘impedance’ of data processing that occurs between the data producers and the data consumers. In this way, we are bringing together and reducing the friction of data that flows among Systems of Record, Systems of Analysis and Systems of Engagement. Back to the real-world example of actual lakes, within larger lakes there is an entire ecosystem of ‘zones and currents’ within the lakes themselves: The Data Lake is one part of the Data Mesh solution, and not even the most important part.Īs a discipline, the Data Lake technical concepts are still vast and important (with or without a Data Mesh).

In what I consider a great example of a Data Mesh, the folks at Intuit specifically include their Key Stakeholders, Pipelines, and consumption APIs as part of the Data Product definition. You can stream data within a lake (eg Apache Spark Streaming) but that does not make it a Data Mesh. Without the explicit tie-in to operational data domains (eg the domain oriented source teams), the overall Data Lake solution remains siloed – data is merely being tossed over the wall from one team to the next. It takes organizational and technical commitment to join up the data producers to the data consumers, with IT working to provide the over-arching tech stack. In fact, most data lakes still operate more or less in isolation from the producers of the data. Streaming within a Data Lake is not a Data Mesh, but a Data Mesh should be able to Stream within a Data Lake! In fact, even the use of 'data product thinking' does not in and of itself make a Data Mesh - because the concepts, methodology and best practices of Data Products can be applied to any kind of data architecture (centralized or distributed). Not everyone agrees this is a particularly innovative concept, since this also sounds a lot like modern data warehouses, but that debate is not the purpose of this post.Īs discussed in the reference Data Mesh stories at the top of this post, some folks are talking about a Data Mesh as being a kind of Data Lake but with (1) well defined data ‘zones’, (2) a catalog of metadata with strong schema typing on the data, (3) a bit of streaming between data inside the lake, and (4) SQL federation tools that may query the data directly within a lake (eg reporting from data directly in the lake).īut this is not really a Data Mesh… it is a particular style of using a Data Lake. The lakehouse concept takes the usual Data Lake concept and adds a few things, such as: ACID transaction support, schema enforcement, stronger SQL support for analytics, and stream processing. If we look at lakes in northern California in isolation, they are these large bodies of water separated by great distances. In the work that we are doing at Oracle, we explicitly aim to make Data Mesh a solution that is useful for both the domain-ownership side (eg operations) as well as the domain-consumer side (eg analytics and data lake). But if you read the paper and digest the ideas, one of the key failure-modes that Zhamak discusses is siloed and hyper-specialized ownership - when the domain-oriented source teams (eg apps & LoB operations) are disconnected from the data & ML platform engineers, who are again disconnected from the domain-oriented consumers. Thus, any data lake whose data domains (or overall integration design) are disconnected from the domain owners (eg the operational applications) is not a great example of a Data Mesh! I'm personally always a bit suspect when I see a story about Data Mesh where the technical architecture starts with, ' and so the raw data is here in our Data Lake.'.

DATA LAKEHOUSE VS DATA LAKE HOW TO

Perhaps some of the confusion goes back to Zhamak’s title of the very popular 2019 paper, “ How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh” which includes the term Data Lake right in the title.

This may seem obvious, but there are some who kind of conflate Data Mesh to mean a certain ‘style’ of Data Lake, for example as written about here, here and here. Nor is it a Data Lakehouse, or a Data Warehouse.