The concept of a data lake is fairly recent: What exactly is it? What are its advantages? And is there some kind of Loch Ness monster swimming at the bottom of this data lake?
What is a Data Lake?
The term "Data Lake" has in fact been defined progressively as it developed. To put it simply, a Data Lake stores raw data (that is, in its native format) for later use.
Coined by James Dixon — CTO of Pentaho, a Business Intelligence software — this practice has grown in popularity and is increasingly used in Big Data initiatives today.
As a tool, the Data Lake disrupts the data integration market and helps redefine how companies handle their data. To give a deeper definition, a Data Lake stores disparate information while largely ignoring context. Unlike a traditional Data Center, the lake pays no attention to how its data will be used, governed, defined, or secured, nor when it will be used or serve: some of the lake's data may never even be used.
Big Data initiatives have only recently begun to use these data lakes because a data lake contains all its data in an unstructured and unorganized format. Because the data are not structured, they can be manipulated in various ways, and in many cases Big Data works better that way.
Indeed, in the past Data Centers were adequate storage areas because they were better organized: that is still true, however it becomes difficult for the data scientists to discover new insights when data is already pre-organized.
One advantage of the data lake is that all the information stored there is available at any time, in its original format. Obviously, it takes longer to go from point A to point B, but in an increasingly competitive world, where every byte of data counts, the Data Lake can be very appealing. If we consider that theInternet of Things is the next big trend in data integration, then now is probably the time to pay attention to it and that is why its popularity should continue to grow.
We will look at its advantages and disadvantages.

What are the advantages of a Data Lake?
Storing a large volume of data
The first advantage of a Data Lake is that it allows store considerable volumes of data great diversity: whether structured or not, coming from any type of database or not… the Data Lake is by nature completely neutral regarding the type of information it contains. It is precisely because it does not have a strict operating schema that the Data Lake is a valuable tool. And for good reason, none of the data it contains is ever altered, degraded, or distorted.
Easier data analysis
But this is not the only advantage of a Data Lake. Indeed, because the data is raw, it can be analyzed on an ad hoc basis, on demand, with the goal of detecting trends and generating reports according to the company's needs, without this becoming a massive project involving another platform or another data repository.
The data available in the Data Lake is easily exploitable in real time and allows your company to adopt a "data-centered" approach so that your decisions, choices, and strategies are never disconnected from market reality or your activities.
A low-cost technology
The data lake is designed to run on basic, low-cost hardware and can be used with open-source software: from a hardware and software perspective, it is fair to say that setting up a Data Lake represents a negligible cost.
A unique place for your data
A Data Lake does not care about the nature of the data: it stores and processes everything, whether already structured, semi-structured, or unstructured. This practice drastically reduces cost and time compared with traditional systems.
There is a lot to gain from having all your data in the same place, mixing all these datasets of different types.
A new world of discoveries
In practice, a data lake allows its operator to discover previously unseen data: unlike a traditional Data Center, where the user is limited in the questions they can ask and the answers they can obtain, in a data lake the possibilities are limited only by the total amount of data.
Of course, they can dive into their data lake with the same series of questions they used to ask their traditional Data Center and obtain the same answers (or better answers).
But they can also ask different questions that were previously impossible to pose, thereby obtaining even more answers and sometimes extracting better insights.
Advanced analysis capabilities
Many software suites include descriptive analytics that show the user an interpretation of what happened, often accompanied by extensive visuals and various diagrams.
This data-processing capability has existed for decades. But with the rise of big data, companies need more — such as prescriptive, predictive, and diagnostic analytics — to stay ahead of the market and their competitors. A data lake provides that capability.

What are the disadvantages of a Data Lake?
Unfiltered data
However, not everything is perfect with the lake: as we now know, it enables more advanced searches based on much larger volumes of data. However, there are no unique identifiers: since there is no metadata, the extractor therefore always has to start from scratch for each new analysis.
It's much more tedious to start researching within an unfiltered dataset when nothing is sorted into a category or class. In a word, it's hard to make use of a data lake, where everything is “in bulk,” rather than within an environment where everything has its place. This leads to other problems that follow naturally (lake, source, you got it?).
The usefulness of retaining certain data
Because the stored data are not defined a priori, there is no control over what is dumped into the lake.
Are the data useful? No one can know, at least not until they are analyzed: at least, in a Data Center, the data can be qualifiedorganized, and their relevance is recognized. The lake is a jumble.
Data security
Easy to anticipate, isn't it? Setting up a Data Lake also poses security issues problems: indeed, since no one has mastery or knowledge of what is in the lake, it is likely that some data will be corrupted and that this will only be discovered too late.
This drawback is significant, because a great many companies began using this technology without really worrying about data security, which must not be compromised.
And the law?
The problems with the Data Lake do not stop there, since now that various privacy protection regulations (such as the GDPR) are fully in force, companies in the European Union are obliged to honor any request to delete the personal information they hold about an individual.
Data Lakes are a potential legal quagmire.
Few tools to probe the depths
The Strategic monitoring tools traditional ones have a hard time with everything that's in the lake. The Business Intelligence solutions They are mostly designed to analyze structured data, and simply do not work satisfactorily when asked to process information that is not structured.
Although Data Centers contain far less data, they are more accurate.
Lack of qualified staff
Exploiting a data lake requires specific skills, and since the practice is recent, sufficiently qualified integrators are rare for now.
Conclusion
Data Lakes have many strengths and offer opportunities that will need to be seized in the near future. A forward-looking company can equip itself with the means to implement a data lake in order to make the most of it, with a low initial investment.
However, there are a number of weaknesses that cannot be ignored given how harmful they can prove to be.
Need help with your data strategy? Find freelance data strategy experts for free on Codeur.com !