메인 컨텐츠로 바로가기

SAS 체험하기

소프트웨어
Learn
지원
파트너
- 개요
회사 소개

Viya 플랫폼
Viya에 대해 알아보기
무료체험 및 구매
Viya로 이동

지금 평가판 사용하기

솔루션
사기 방지
IoT
마케팅
리스크 매니지먼트
모든 솔루션 보기

산업별
은행
공공 부문
보험
생명 과학
소매 및 소비재
모든 산업군 보기

탐색
무료체험 및 구매
Why SAS
고객 성공 스토리
파트너
컨설팅

SAS를 선택해야 하는 이유

SAS 기술에 대한 수요가 증가하고 있습니다. 커리어를 발전시키고 팀원들에게 인기 있는 기술을 교육하세요.

SAS를 선택해야 하는 이유

Training
Overview
Train My Team
정규교육 신청
이러닝 신청
전체 교육과정
교육 및 교육장 안내
무료 교육
Books
My Training

Academics
Academic Programs
Free Academic Software
Educators
Students

Certification
Choose a Credential
Exam Preparation
Practice Exams
Certification Discounts
인증시험 안내 및 신청
자격증 조회 및 출력
단체시험 안내
인증시험 Voucher
My Certification Manager

탐색
커뮤니티
이벤트 및 웨비나
SAS 이벤트 정보
Ask the Expert
웨비나
영상 튜토리얼
유튜브 채널

커뮤니티
SAS Viya
SAS 프로그래밍
Statistical Procedures
신규 SAS 사용자
관리자
모든 커뮤니티 보기

문서화
제품별
설치 & 구성
SAS Viya 관리
SAS Viya 프로그래밍
시스템 요구사양
모든 문서 보기

지원
Downloads
Knowledge Base
Starter Kit
제품별 지원
지원
모든 지원 서비스 보기

탐색
블로그(Blogs)
사용자 그룹
웨비나
영상 튜토리얼
유튜브 채널
My SAS

개요
파트너 프로그램
파트너 찾기
PartnerNet에 로그인

SAS를 선택해야 하는 이유

SAS가 가장 신뢰할 수 있는 글로벌 분석 플랫폼인 이유와 분석가, 고객 및 업계 전문가들이 SAS를 선호하는 이유를 알아보십시오.

자세히 보기

회사
개요
연간 리포트
리더십
비전과 미션
사무실 위치

채용 정보
개요
문화
인턴쉽
채용 정보

뉴스 및 이벤트
News Room
뉴스레터
국문 블로그
보도자료
인텔리진
영문 블로그
국내 이벤트
해외 이벤트

탐색
브랜드
SNS 및 커뮤니티
Trust Center
Contact us

sas.com
support.sas.com
documentation.sas.com
blogs.sas.com
communities.sas.com
developer.sas.com

검색

글로벌 연락처

리스트에서 해당국가를 찾을 수 없다면, 글로벌 연락처 리스트를 확인하세요.

Visit the Cary, NC, USA corporate headquarters site

Americas

Europe

Middle East & Africa

Asia Pacific

View our worldwide contacts list for help finding your region

Americas

Brasil
Canada (English)

Canada (Français)
Colombia

México
United States

Europe

Belgium
Česká Republika
Danmark
Deutschland
España
France

Iceland
Ireland
Italia
Nederland
Norge
Österreich

Polska
Portugal
România
Россия
Schweiz (Deutsch)
Suisse (Français)

Suomi
Sverige
Türkiye
Україна
United Kingdom

Middle East & Africa

Middle East

Saudi Arabia

South Africa

Asia Pacific

Australia
中国 (简体中文)
Hong Kong
India
日本

대한민국
Malaysia
New Zealand
Philippines
Singapore

台灣 (繁體中文)
Thailand (English)
ประเทศไทย (ภาษาไทย)

Contact us

안녕하십니까, !

로그인(Sign In)

프로파일 작성하기

My SAS, 체험판, 커뮤니티 그리고 더 많은 내용을 확인하십시오.

프로필 편집

My SAS, 체험판, 커뮤니티 그리고 더 많은 내용을 확인하십시오.

SAS 사이트

블로그(Blogs)

개발자(Developer)

파트너 사이트

This page exists on your local site.

Go there now

Stay here

X

SAS 인사이트

Free TDWI report

Abstract technology midnight blue and pink

What is a data lake and why does it matter?

By: Jim Harris, Blogger-in-Chief at Obsessive-Compulsive Data Quality (OCDQ)

A data lake is a storage repository that can rapidly ingest large amounts of raw data in its native format. As a result, business users can quickly access it whenever needed and data scientists can apply analytics to get insights. Unlike its older cousin – the data warehouse – a data lake is ideal for storing unstructured big data like tweets, images, voice and streaming data. But it can be used to store all types of data – any source, any size, any speed, any structure.

Some of the formats for information stored in data lakes are:

Structured data, such as rows and columns from relational database tables.
Semistructured data, like delimited flat text files and schema-embedded files.
Unstructured data – including social media content and data from the Internet of Things (IoT) – as well as documents, images, voice and video.

History of data lakes: Where it all started

The early versions of what we now call data lakes were pioneered by the watering holes of the yellow elephant – Hadoop. When it was first released, Hadoop was a collection of open source software used for the distributed storage, processing and analytics of big data. It was especially useful for emerging sources of semistructured and unstructured data that were becoming more prevalent at the time. Data lakes were also used to scale up for structured data whose volumes were rising rapidly.

Unfortunately, the early hype around Hadoop implied that you could arbitrarily throw any amount of data into a lake, then let users fend for themselves. Multiple publicized failures proved that approach was wrong. Some early adopters saw their data lakes quickly devolve into poorly managed and ungoverned dumping grounds akin to data swamps. This resulted in:

Redundancy, which skewed analytical results.
Non-auditable data, which no one would trust.
Poor query performance, which killed the primary early purposes of the data lake – high-performance exploration and discovery.

These undocumented and disorganized early data lakes were nearly impossible to navigate. Metadata tagging evolved to become one of the most essential data lake management practices because it made the data in the lake easier to find. Data lake governance improved its auditability and trustworthiness, verifying its usability for more enterprise applications.

The technologies and methodologies used to implement a data lake have matured over time. Now they include not only Hadoop but also other traditional and big data technologies.

Faster Insights From Faster Data: A TDWI Best Practices Report

For competitive advantage, businesses need to make fast, data-driven decisions. Data lakes are flexible platforms useful for any type of data – including operational, time-series and near-real-time data. Learn how they work with other technologies to provide fast insights that lead to better decisions.

Download paper now

Moving past the early hype

As the hype of early data lakes faded away, a data lake stopped being mistaken for a data platform. Instead, it was recognized as a container for multiple collections of varied data coexisting in one convenient location.

Today, data lakes are formally included in enterprise data and analytics strategies. Organizations recognize that the term data lake refers to just one part of the enterprise ecosystem, which includes:

Source systems.
Ingestion pipelines.
Integration and data processing technologies.
Databases.
Metadata.
Analytics engines.
Data access layers.

To be a comprehensive business intelligence platform that generates high business value, a data lake requires integration, cleansing, metadata management and governance. Leading organizations are now taking this holistic approach to data lake management. As a result, they can use analytics to correlate diverse data from diverse sources in diverse structures. This means more comprehensive insights for the business to call upon when making decisions.

Why are data lakes important?

Because a data lake can rapidly ingest all types of new data – while providing self-service access, exploration and visualization – businesses can see and respond to new information faster. Plus, they have access to data they couldn’t get in the past.

These new data types and sources are available for data discovery, proofs of concept, visualizations and advanced analytics. For example, a data lake is the most common data source for machine learning – a technique that’s often applied to log files, clickstream data from websites, social media content, streaming sensors and data emanating from other internet-connected devices.

Many businesses have long wished for the ability to perform discovery-oriented exploration, advanced analytics and reporting. A data lake quickly provides the necessary scale and diversity of data to do so. It can also be a consolidation point for both big data and traditional data, enabling analytical correlations across all data.

Although it’s typically used to store raw data, a lake can also store some of the intermediate or fully transformed, restructured or aggregated data produced by a data warehouse and its downstream processes. This is often done to reduce the time data scientists must spend on common data preparation tasks.

The same approach is sometimes used to obscure or anonymize personally identifiable information (PII) or other sensitive data that’s not needed for analytics. This helps businesses comply with data security and privacy policies. Access controls are another method businesses can use to maintain security.

Female programmer at desktop computer at night

To be a comprehensive business intelligence platform that generates high business value, a data lake requires integration, cleansing, metadata management and governance. Many organizations are taking this holistic approach to data lake management.

Data lake versus data warehouse

Wondering what you need to consider when comparing a data lake and data warehouse? One of the top considerations relates to the design, or schema, of the data store.

Relational databases and other structured data stores use a schema-driven design. This means any data added to them must conform to, or be transformed into, the structure predefined by their schema. The schema is aligned with associated business requirements for specific uses. The easiest example of this type of design is a data warehouse.

A data lake, on the other hand, uses a data-driven design. This allows for rapid ingestion of new data before data structures and business requirements are defined for its use. Sometimes data lakes and data warehouses are differentiated by the terms schema on write (data warehouse) versus schema on read (data lake).

Schema on write (data warehouse) limits or slows ingestion of new data. It is designed with a specific purpose in mind for the data, as well as specific associated metadata. However, most data can serve multiple purposes.
Schema on read (data lake) retains the raw data, enabling it to be easily repurposed. It also allows multiple metadata tags for the same data to be assigned.

Since it’s not restricted to a single structure, a data lake can accommodate multistructured data for the same subject area. For example, data lakes can blend structured sales transactions with unstructured customer sentiment. And since it’s focused on storage, a data lake requires less processing power than a data warehouse. Data lakes are also much easier, faster and less expensive to scale over time.

One disadvantage of a lake is that its data is not standardized, unduplicated, quality-checked or transformed. In response, some people have adopted a trend to use data lakes differently. Lakes can provide a new, improved zone for landing and staging data before it’s prepared, integrated and transformed for loading into the data warehouse.

These examples illustrate why a data lake does not replace a data warehouse – it complements the data warehouse. Besides being used as staging areas, lakes can also serve as archives. In this scenario, outdated data is archived but kept readily accessible for auditing and historical analysis.

Data lakes in the early days: A deeper dive

Early data lakes used the open source Hadoop distributed file system (HDFS) as a framework for storing data across many different storage devices as if it were a single file. HDFS worked in tandem with MapReduce as the data processing and resource management framework that split up large computational tasks – such as analytical aggregations – into smaller tasks. These smaller tasks ran in parallel on computing clusters of commodity hardware.

In its second release, Hadoop made an improvement that decoupled the resource management framework from MapReduce and replaced it with Yet Another Resource Negotiator (YARN). This essentially became Hadoop’s operating system. Most important, YARN supported alternatives to MapReduce as the processing framework. This greatly expanded the applications (and data management and governance functions) that could be executed natively in Hadoop.

Check out some related content

Data lakes are formally included in many organizations' data and analytics strategies today.

Ready to learn more about some related topics? In the box to the right, learn how data integration has evolved and check out our tips for building better data lakes. Discover why governance is essential, and get the latest on data tagging best practices. Or, read all about the ins and outs of cloud computing.

Data lake governance – do you need it?

Should the data inside of a lake be subject to governance? The simple answer is yes. If the data is being used for business decision-making purposes, governance is essential. Learn more in this blog post about data lake governance.
3 tips for building a better data lake

Do you know the most common pitfalls of building a data lake? Read this blog post to get great tips about how to skip the mistakes others have made.
Cloud computing

Cloud platforms are an integral part of many organizations' data strategies today, including decisions to place a data lake in the cloud. In this primer, you'll learn all about cloud computing and why it's a major force for business innovation.
4 data tagging best practices

Metadata tagging is an essential data lake management practice because it makes the data in the lake easier to find. In this blog post, read about data tagging best practices and why it's so important to tag your data correctly.
Data integration: It ain't what it used to be

As organizations ingest larger volumes of structured and unstructured data, many move data into a lake built using an underlying object store and custom metadata. Read this article to see how data integration techniques have evolved over time.

How data lakes work today

Rapid ingestion and the ability to store raw data in its native format have always been key benefits of data lakes. But what exactly does that mean? And how does it work?

Raw data means that the data has not been processed or prepared for a particular use. Some data sources, however, have previously applied some amount of processing or preparation to their data. So, a data lake stores raw data in the sense that it does not process or prepare the data before storing it. One notable exception relates to formatting.
Native format means data remains in the format of the source system or application that created it. However, this is not always the best option for data lake storage. In fact, rarely does rapid ingestion simply mean copying data as-is into a file system directory used by the lake.

For example, a Microsoft Excel spreadsheet is, by default, in its native XLS format. But most data lakes would prefer to store it as a delimited flat text file in comma-separated values (CSV) format. Transactional data from relational databases is also often converted to CSV files for storage in the lake.

Embedded schema and granular data

Another common alternative is to use a file format with embedded schema information, such as JavaScript Object Notation (JSON). For example, clickstream data, social media content and sensor data from the IoT are usually converted into JSON files for data lake storage. JSON files are also a good example of how data lake ingestion often involves converting data from its native format into a more granular format.

Granular data, especially with embedded schema information such as key-value pairs (KVP), enables faster read and write operations. This means that it:

Does not waste storage on placeholders for optional, default or missing keys or values.
Can be aggregated and disaggregated to meet the needs of different situations.
Becomes easier to retrieve only the data applicable to a specific use.

Additional, and more optimized, data lake storage formats are also available. Some of these storage formats enable better scalability and parallel processing, while also embedding schema information. Data can be converted into:

Column stores (e.g., Redshift, Vertica, Snowflake).
Compressed column-oriented formats (e.g., Parquet for Spark or ORC for Hive).
Conventional row-oriented formats (e.g., PostgreSQL, MySQL or other relational databases).
Compressed row-oriented formats (e.g., Avro for Kafka).
In-memory stores (e.g., SingleStore, Redis, VoltDB).
NoSQL stores (e.g., MongoDB, Elasticsearch, Cassandra).

When to use different data storage options

Most data lakes use a variety of storage options, depending on the data sources and business cases. This is especially true for business cases related to access and analytics. For example:

Column-oriented storage works best when rapid retrieval and accelerated aggregation are most important.
Row-oriented storage works best when there is a lot of schema variability, as is often the case with streaming applications. It’s also ideal when the data lake is used as a staging area for a data warehouse.
In-memory storage works best for real-time analytical use cases.
NoSQL storage works best for analytics scenarios that require rapid generation of metrics across large data sets.

Data lakes and the importance of architecture

The bottom line is that a data lake is not just a massive repository – it requires a well-designed data architecture. It’s possible to use a broad range of tools for implementing rapid ingestion of raw data into the data lake. Such tools include the data integration and extract-transform-load (ETL) tools your enterprise likely already has. Certain new big data technologies (including some of the examples above) provide this functionality as well.

Regardless of how you choose to implement ingestion and storage, data lakes can involve deploying back-end technologies you’re less familiar with – especially nonrelational database management systems (non-RDBMSs). Luckily, many of these technologies include user-friendly front-end interfaces. For example, some provide SQL-like query functionality that many users expect and already know how to use.

Get to know SAS^® Viya^®

Data is constantly changing around us – and our decisions need to adapt quickly. To get started with turning raw data into meaningful decisions, organizations have to first access, explore, transform and prepare data for analysis.

The data lake gives business users and data scientists alike access to data they couldn’t get in the past – allowing them to explore and visualize the data from one convenient location. In turn, they can see and respond to new information faster.

Data lakes complement SAS Viya, which is an artificial intelligence, analytics and data management platform that’s used to transform raw data into operational insights. SAS Viya supports every type of decision an organization needs to make. Watch the video to learn more.

비디오 플레이어이(가) 로딩 중입니다.

현재 시간 0:00

/

지정 기간 0:00

로드됨: 0%

0:00

스트리밍 유형 라이브

남은 시간 -0:00

1x

2x
1.75x
1.5x
1.25x
1x, 선택됨
0.75x
0.5x

챕터

제품 설명 끄기, 선택됨

자막 설정, 자막 설정 대화 상자가 열립니다
서브타이틀 끄기, 선택됨
English 자막

This is a modal window.

대화창 시작. Esc 키를 누르면 취소되고 창이 닫힙니다.

텍스트색상투명도

텍스트 배경색상투명도

자막 배경색상투명도

폰트 크기

텍스트 가장자리 스타일

폰트 모음

대화창 종료

This is a modal window. 이 모달은 Esc 키를 누르거나 닫기 버튼을 활성화하여 닫을 수 있습니다.

This is a modal window. 이 모달은 Esc 키를 누르거나 닫기 버튼을 활성화하여 닫을 수 있습니다.

This is a modal window. 이 모달은 Esc 키를 누르거나 닫기 버튼을 활성화하여 닫을 수 있습니다.

This is a modal window. 이 모달은 Esc 키를 누르거나 닫기 버튼을 활성화하여 닫을 수 있습니다.

video thumbnail

Data lake and cloud

A data lake can be used as a centralized repository through which all enterprise data flows. As such, it becomes an easily accessible staging area where all enterprise data can be sourced. This includes data consumed by on-site applications as well as cloud-based applications that can accommodate big data’s size, speed and complexity. All of which leads to the question of data lake versus cloud: Where should the data lake be located?

Cloud data lake

For some enterprises, the cloud may be the best option for data lake storage. That’s because it provides complementary benefits – elastic scalability, faster service delivery and IT efficiency – along with a subscription-based accounting model.

On-site data lake

Enterprises may opt for grounding their data lake within their own walls for reasons similar to arguments made for managing a private cloud on-site. This approach provides the utmost security and control while protecting intellectual property and business-critical applications. It can also safeguard sensitive data in compliance with government regulations.

But the disadvantages of managing a private cloud on-site also apply to a data lake. Both can lead to increased in-house maintenance of the data lake architecture, hardware infrastructure, and related software and services.

Hybrid data lake

Sometimes businesses choose a hybrid data lake, which splits their data lake between on-site and cloud. In these architectures, the cloud data lake typically does not store data that is business critical. And if it contains personally identifiable information (PII) or other sensitive data, it is obscured or anonymized. This helps the business comply with data security and privacy policies. To minimize cloud storage costs, the data stored in the cloud can be purged periodically or after pilot projects are completed.

About Jim Harris

Jim Harris is a recognized data quality thought leader with 25 years of enterprise data management industry experience. He is an independent consultant, speaker and freelance writer. Harris is the Blogger-in-Chief at Obsessive-Compulsive Data Quality, an independent blog offering a vendor-neutral perspective on data quality and its related disciplines, including data governance, master data management and business intelligence.

Recommended reading

What is synthetic data? And how can you use it to fuel AI breakthroughs?There's no shortage of data in today's world, but it can be difficult, slow and costly to access sufficient high-quality data that’s suitable for training AI models. Learn why synthetic data is so vital for data-hungry AI initiatives, how businesses can use it to unlock growth, and how it can help address ethical challenges.
Containing health care costs: Analytics paves the way to payment integrityTo ensure payment integrity, health care organizations must uncover a broad range of fraud, waste and abuse in claims processing. Data-driven analytics – along with rapid evolutions in the use of computer vision, document vision and text analytics – are making it possible.
Situational awareness guides our responses – routine to crisisMany circumstances call for situational awareness – that is, being mindful of what’s present and happening around you. The COVID-19 pandemic heightened this need, as leaders across industries used analytics and visualization to gain real-time situational awareness and respond with fast, critical decisions.

Ready to subscribe to Insights now?

Teal background with radiance graphic

SAS^® Viya^™

Make analytics accessible to everyone and bridge the talent gap in your organization

Please leave this field blank

Thank you for subscribing to Insights!

Enter email address*

*

Subscribe to Insights newsletter

Home
SAS 인사이트
What is a data lake & why does it matter?

Home
SAS 인사이트
What is a data lake & why does it matter?

데이터 분석에 답이 있습니다. SAS의 분석 솔루션은 데이터를 인텔리전스로 변환하여 전 세계 고객들에게 숨겨진 분석 인사이트 발견할 수 있도록 지원하고 있습니다.

Follow Us

Facebook
Twitter
LinkedIn
YouTube
RSS

탐색

My SAS
News Room
SAS Viya
SAS 이벤트 정보
SAS 채용 정보
SAS를 선택해야 하는 이유
Training

개발자(Developers)
교육 전문가
무료체험 및 구매
문서화
산업별
솔루션 (Solutions)
영상 튜토리얼

자격증
접근성
제품
지원 서비스
커뮤니티
학생
회사

What is...

What is...

IoT(사물 인터넷)
데이터 사이언스
디지털 트랜스포메이션
분석 (Analytics)
인공 지능
클라우드 컴퓨팅

Cookie Preferences
개인정보보호 정책
이용 약관
Trust Center
© 2023 SAS Institute Inc. All Rights Reserved.

Contact us
공유
구독

공유
이 페이지를 주변 인원들에게 공유하십시오.

Back to Top

About cookies on this site

This site uses cookies and related technologies for site operation, analytics and third-party advertising purposes, as described in our SAS Privacy Statement. You may consent to our use of these technologies, reject non-essential technologies or further manage your preferences. To opt out of SAS making information relating to cookies and similar technologies available to third parties for advertising purposes, select "Required only." To exercise other rights you may have related to cookies, select "Manage cookies."

SAS Privacy Statement | Powered by:

| Truste