Data lakes by

Data lake

A data lake:

* a single platform combining storage, data governance, and analytics

* Designed to address these challenges.

* Centralized, secure, and durable cloud-based storage platform that allows you to ingest and store structured and unstructured data, and transform these raw data assets as needed.

* You don’t need an innovation-limiting pre-defined schema.

* You can use a complete portfolio of data exploration, reporting, analytics, machine learning, and visualization tools on the data.

* Makes data and the optimal analytics tools available to more users, across more lines of business.

* Enables users to get all of the business insights they need, whenever they need them.

With a data lake, you can do the following:

* Ingest and store data from a wide variety of sources into a centralized platform.

* Build a comprehensive data catalog to find and use data assets stored in the data lake.

* Secure, protect, and manage all of the data stored in the data lake.

* Use tools and policies to monitor, analyze, and optimize infrastructure and data.

* Transform raw data assets in place into optimized usable formats.

* Query data assets in place.

* Use a broad and deep portfolio of data analytics, data science, machine learning, and visualization tools.

The Data Lake can be an effective data management solution for advanced analytics experts and business users alike. A Data Lake allows users to analyze a large variety and volume when and how they want. Following a Data and Analytics as a Service (DAaaS) model provides users with on-demand, self-serve data.

However, to be successful, a Data Lake needs to leverage a multitude of products while being tailored to the industry and providing users with extensive, scalable customization.

Characteristics of a Successful Data Lake Implementation

A Data Lake enables users to analyze the full variety and volume of data stored in the lake. This necessitates features and functionalities to secure and curate the data, and then to run analytics, visualization, and reporting on it. The characteristics of a successful Data Lake include:

Use of multiple tools and products.

Extracting maximum value out of the Data Lake requires customized management and integration that are currently unavailable from any single open-source platform or commercial product vendor. The cross-engine integration necessary for a successful Data Lake requires multiple technology stacks that natively support structured, semi-structured, and unstructured data types.

Domain specification.

The Data Lake must be tailored to the specific industry. A Data Lake customized for biomedical research would be significantly different from one tailored to financial services. The Data Lake requires a business-aware data-locating capability that enables business users to find, explore, understand, and trust the data. This search capability needs to provide an intuitive means for navigation, including key word, faceted, and graphical search. Under the covers, such a capability requires sophisticated business ontologies, within which business terminology can be mapped to the physical data. The tools used should enable independence from IT so that business users can obtain the data they need when they need it and can analyze it as necessary, without IT intervention.

Automated metadata management.

The Data Lake concept relies on capturing a robust set of attributes for every piece of content within the lake. Attributes like data lineage, data quality, and usage history are vital to usability. Maintaining this metadata requires a highly-automated metadata extraction, capture, and tracking facility. Without a high-degree of automated and mandatory metadata management, a Data Lake will rapidly become a Data Swamp.

Configurable ingestion workflows.

In a thriving Data Lake, new sources of external information will be continually discovered by business users. These new sources need to be rapidly on-boarded to avoid frustration and to realize immediate opportunities. A configuration-driven, ingestion workflow mechanism can provide a high level of reuse, enabling easy, secure, and trackable content ingestion from new sources.

Integrate with the existing environment.

The Data Lake needs to meld into and support the existing enterprise data management paradigms, tools, and methods. It needs a supervisor that integrates and manages, when required, existing data

management tools, such as data profiling, data mastering and cleansing, and data masking technologies.

Keeping all of these elements in mind is critical for the design of a successful Data Lake.

https://knowledgent.com/whitepaper/design-successful-data-lake/ https://docs.aws.amazon.com/aws-technical-content/latest/building-data-lakes/building-data-lakes-on-aws.pdf

]]>