Wróć do strony głównej

Oprogramowanie
Dowiedz się
Wsparcie
Partnerzy
- Informacje ogólne
O nas

Platforma SAS Viya
Dowiedz się więcej o SAS Viya
Wypróbuj i Kup

Wypróbuj teraz

Rozwiązania
Sztuczna Inteligencja (AI)
Wykrywanie Nadużyć
IoT
Marketing
Zarządzanie Ryzykiem
Wszystkie Produkty i Rozwiązania

Branże
Bankowość
Sektor Publiczny
Ubezpieczenia
Nauki przyrodnicze
Wszystkie branże

Dowiedz się więcej
Wypróbuj/ Kup
Dlaczego SAS?
Referencje
Rozwiązania generatywnej sztucznej inteligencji
Konsulting

Dlaczego warto uczyć się SAS?

Zapotrzebowanie na kompetencje SAS rośnie. Rozwijaj karierę i kompetencje zespołu w zakresie poszukiwanych umiejętności.

Dlaczego warto uczyć się SAS?

Szkolenia
Informacje ogólne
Rozwijaj swój zespół
Katalog kursów
Bezpłatne szkolenia
KSIĄŻKI
Moje szkolenia

Sektor Akademicki
Program Akademicki
Bezpłatne oprogramowanie dla sektora akademickiego
Wykładowcy
Studenci

Certyfikaty
Wybierz poświadczenie
Przygotowanie do egzaminu
Rabaty
Moja ścieżka certyfikacji

Dowiedz się więcej
Społeczności
Wydarzenia i Webinaria
Wydarzenia
Zapytaj eksperta
Wszystkie webinaria
Tutoriale wideo
Kanał YouTube

Społeczności
SAS Viya
Programowanie SAS
Procedury statystyczne
Nowi użytkownicy SAS
Administratorzy
Społeczności

Dokumentacja
Produkty
Instalacja i konfiguracja
SAS Viya Administracja
SAS Viya Programowanie
Wymagania systemowe
Cała dokumentacja

Wsparcie i Usługi
Pliki do pobrania
Baza wiedzy
Zestaw startowy
Wsparcie według produktu
Usługi
Wsparcie techniczne

Dowiedz się więcej
Blog
Grupy użytkowników
Webinaria
Tutoriale wideo
Kanał YouTube
Mój profil SAS

Informacje ogólne
Program Partnerski
Znajdź partnera
Zaloguj się do sieci PartnerNet

Dlaczego SAS?

Dowiedz się, dlaczego SAS jest najbardziej zaufaną platformą analityczną na świecie i dlaczego analitycy, klienci i eksperci branżowi doceniają SAS.

Dowiedz się więcej o SAS

Firma
Informacje ogólne
Raport roczny
Kadra zarządzająca
Wizja i misja
Biura SAS

Kariera
Informacje ogólne
Kultura
Praktyki
Przejrzyj ogłoszenia

Aktualności i Wydarzenia
Newsroom
Newslettery
Blog
Wydarzenia

Dowiedz się więcej
Marka
Social
Prywatność
Kontakt

sas.com
support.sas.com
blogs.sas.com
communities.sas.com
developer.sas.com

Szukaj

Wybierz region

Odwiedź stronę siedziby głównej naszej firmy w Cary, NC, USA

Ameryka

Europa

Bliski Wschód i Afryka

Region Azji i Pacyfiku

Skontaktuj sie z nami, aby uzyskać pomoc.

Ameryka

Brasil
Canada (English)

Canada (Français)
Colombia

México
United States

Europa

Belgium
Česká Republika
Danmark
Deutschland
España
France

Iceland
Ireland
Italia
Nederland
Norge
Österreich

Polska
Portugal
România
Россия
Schweiz (Deutsch)
Suisse (Français)

Suomi
Sverige
Türkiye
Україна
United Kingdom

Bliski Wschód i Afryka

Middle East

Saudi Arabia

South Africa

Region Azji i Pacyfiku

Australia
中国 (简体中文)
Hong Kong
India
日本

대한민국
Malaysia
New Zealand
Philippines
Singapore

台灣 (繁體中文)
Thailand (English)
ประเทศไทย (ภาษาไทย)

Kontakt

Cześć !

Mój profil SAS

Uzyskaj dostęp do wersji próbnych, społeczności i innych zasobów.

Mój profil SAS

Uzyskaj dostęp do wersji próbnych, społeczności i innych zasobów.

Strony SAS

This page exists on your local site.

Go there now

Stay here

X

SAS Insights

Download a free TDWI report

Data lake and data warehouse – know the difference

By: Phil Simon, author, speaker and noted technology expert

Over the past few years, you may have heard someone somewhere drop the term “data lake.” The concept has increasingly gained traction as data volumes have increased exponentially, streaming data has taken off, and unstructured data has continued to dwarf its structured counterpart.

But what is a data lake anyway? Is it just marketing hype? And, generally speaking, how does it differ from the traditional data warehouse?

Understanding the traditional data warehouse

Odds are that at some point in your career you’ve come across a data warehouse, a tool that’s become synonymous with extract, transform and load (ETL) processes. At a high level, data warehouses store vast amounts of structured data in highly regimented ways. They require that a rigid, predefined schema exists before loading the data. (It’s almost always a star or snowflake schema.) Put differently, the schema in a data warehouse is defined “on write.” ETL processes dutifully kick out error reports, generate logs, and send errant records to exception files and tables to be addressed at later dates.

Because of this rigidity and the ways in which they work, data warehouses support partial or incremental ETL. In other words (and depending on the severity of the issue), an organization can load or reload portions of its data warehouse when something goes wrong.

Organizations typically populate data warehouses periodically. Generally speaking, data refreshes via regular cycles – say every morning at 3 a.m. when employees aren’t likely to be accessing the data and downstream systems. Employees arrive at work the next day with freshly squeezed data.

To be sure, the data stored in traditional data warehouses remains valuable today. Still, organizations and their leaders need to begin rethinking contemporary data integration. Consider the Internet of Things (IoT) and the analytics it makes possible. Sensors on vehicles, farm equipment, wearables, thermostats and even crops result in massive amounts of data that stream continuously. It’s a good bet that even an industrial-strength data warehouse will struggle with these new streams of data.

Faster Insights From Faster Data: A TDWI Best Practices Report

Businesses need to make fast, data-driven decisions to achieve a competitive advantage. Data lakes are flexible platforms that can be used with any type of data – including operational, time-series and near-real-time data. Learn how data lakes work with other technologies to provide fast insights that lead to better decisions.

Download paper now

The rise of the data lake

Against this backdrop, we’ve seen the rise in popularity of the data lake. Make no mistake: It’s not a synonym for data warehouses or data marts. Yes, all these entities store data, but the data lake is fundamentally different in the following regard. As David Loshin writes, “The idea of the data lake is to provide a resting place for raw data in its native format until it’s needed.” Data lies dormant unless and until someone or something needs it.

When accessing data lakes, users determine:

The specific data types and sources they need.
How much they need.
When they need it.
The types of analytics that they need to derive.

Are all of these possible in a data warehouse? Probably not. And even if they were possible, achieving them in a period of time that business users would find acceptable is unlikely – especially in today’s rapidly changing environments. Beyond that, one particular schema almost certainly will not fit every business need. To wit, the data may ultimately arrive in a way that renders it virtually useless for the employee’s evolving purposes.

A different kind of schema

For this very reason, a data lake schema is defined “on read.” Put differently, a data lake still requires a schema. However, that schema is not predefined. It’s ad hoc. Data is applied to a plan or schema as users pull it out of a stored location – not as it goes in. Data lakes keep data in its unaltered (natural) state; it doesn’t define requirements unless and until users query the data.

When used correctly, data lakes offer business and technical users the ability to query smaller, more relevant and more flexible data sets. As a result, query times can drop to a fraction of what they would have been in a data mart, data warehouse or relational database.

Portrait of Phil Simon

Organizations will continue to integrate “small” data with its big counterpart, and foolish is the soul who believes that one application – no matter how expensive or robust – can handle everything. Phil Simon Author, speaker and technology expert

The increased flexibility of the data lake

The data lake emphasizes the flexibility and availability of data. As such, it can provide users and downstream applications with schema-free data; that is, data that resembles its “natural” or raw format regardless of origin.

While the jury is still out, many if not most data lake applications do not support partial or incremental loading. (In this way, the data lake differs from the data warehouse.) An organization cannot load or reload portions of its data into a data lake. It tends to be all or nothing.

A data lake analogy

If you’re still struggling with the notion of a data lake, then maybe the following analogy will clarify matters. Think of a data mart or data warehouse as a storage facility rife with cases of bottled water. Those cases didn’t just magically appear overnight. People and machines gathered and purified the water. After packaging it, only then was it ready for people to buy and drink.

By comparison, think of a data lake as a large body of natural water that you would only drink if you were dying of thirst. If you need 50 gallons of water to put out a fire, you don’t need to buy cases of bottled water and empty them out one by one. It’s all there, ready to go.

In keeping with this analogy, the “water” in a data lake flows from many places: rivers, tributaries and waterfalls. That is, the data lake doesn’t hold only one type of water (that is, data). Data lakes can house all types of data: structured, semistructured and unstructured. Note, however, that filling a data lake with structured data means that it will lose at least some of its structure and – you guessed it – some of its value. To this end, if you’re only interested in structured data, a data warehouse may still be your best bet.

There’s little doubt in my mind that the data lake will occupy an increasingly key place in the future of data management.

Two schools of thought on data lakes

Because we’re still in the early stages, today’s opinion on data lakes is anything but universal. At a high level, there are two schools of thought. One group views the data lake as not only important, but also imperative for data-driven companies. This group understands the limitations of contemporary data warehouses – principally that they were not built to handle vast streams of unstructured data. What’s more, the difference between “on write” and “on read” isn’t simply a matter of semantics. On the contrary, the latter lends itself to vastly faster response times and, by extension, analytics.

That’s one viewpoint and I happen to agree with it. To be fair, we have not reached industrywide consensus here – far from it. Skeptics of data lakes aren’t shy about their opinions. The cynics view the data lake as a buzzword or the hype of software vendors with a serious stake in the game. Moreover, some consider the data lake a new name for an old concept with limited applicability for their enterprises.

Adding to the legitimate confusion around the topic, few folks use the term “data lake” in a consistent manner. Some folks call any data preparation, storage or discovery environment a data lake.

Parallels with Hadoop and relational databases

When conceptualizing the need for data lakes, perhaps it’s best to think of Hadoop – the open-source, distributed file system that many organizations adopted over the years. Hadoop grew for many reasons, not the least of which is that it fulfilled a genuine need that relational database management systems (RDBMSs) could not address. To be fair, its open-source nature, fault tolerance and parallel processing place high on the list as well.

RDBMSs simply weren’t designed to handle gigabytes or petabytes of unstructured data. Try loading thousands of photos, videos, tweets, articles and emails into your traditional SQL server or Oracle database and running reports or writing SQL statements. Good luck with that.

For decades, data warehouses have handled even large volumes of structured data exceptionally well: lists of employees, sales, transactions and the like. They feed countless business intelligence and enterprise reporting applications. It’s unreasonable, however, to expect those same data warehouses to efficiently process fundamentally different data volumes, speeds and types.

A note on metadata

Data lakes rely upon ontologies and metadata to make sense out of data loaded into them. Again, methodologies vary. But generally speaking, each data element in a lake inherits a unique identifier assigned with extensive metadata (tags). Conclusion: The data lake is here to stay.

The bright future of the data lake

There’s little doubt in my mind that the data lake will occupy an increasingly key place in the future of data management. Organizations will continue to integrate “small” data with its big counterpart, and foolish is the soul who believes that one application – no matter how expensive or robust – can handle everything.

When a business question arises, users will increasingly need answers faster than traditional data storage and reporting stalwarts can provide. When used properly, data lakes allow users to analyze smaller data sets and quickly answer critical questions.

Blue Green abstract art

Manage your data beyond boundaries

Watch a short demo to see how SAS Data Management can help you manage data beyond boundaries to improve productivity, build trust and make better decisions.

Video Player is loading.

Current Time 0:00

/

Duration 0:00

Loaded: 0%

0:00

Stream Type LIVE

Remaining Time -0:00

1x

2x
1.75x
1.5x
1.25x
1x, selected
0.75x
0.5x

Chapters

descriptions off, selected

subtitles settings, opens subtitles settings dialog
subtitles off, selected
English Captions

This is a modal window.

Beginning of dialog window. Escape will cancel and close the window.

TextColorOpacity

Text BackgroundColorOpacity

Caption Area BackgroundColorOpacity

Font Size

Text Edge Style

Font Family

End of dialog window.

This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.

This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.

This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.

This is a modal window. This modal can be closed by pressing the Escape key or activating the close button.

About the author

Phil Simon is a keynote speaker and recognized technology expert. He is the award-winning author of eight management books, most recently Analytics: The Agile Way. He consults organizations on matters related to strategy, data, analytics, and technology. His contributions have been featured on The Harvard Business Review, CNN, Wired, The New York Times, and many other sites. In the fall of 2016, he joined the faculty at Arizona State University's W. P. Carey School of Business (Department of Information Systems).

Recommended reading

Artykuł I see big data. All the time. It’s everywhere.Big data is rapidly creeping into every element of our life. In this article, Tamara Dull from SAS Best Practices explores some big data examples - and how it will affect you.
Artykuł Goooooal! How data stewards score with data visualizationWhen it comes to data visualization, the role a data steward plays is not so different from that of a referee. They both enforce rules, stay true to the game, and are critical to success.
Artykuł Charlie Brown's teacher speaks Hadoop. Do you?Ever felt like you and your big data specialist were speaking different languages? Learn how a non-geek can speak big data.
Artykuł Five steps that can save your data analytics – and help you save faceThere’s nothing more awkward than watching analysts struggle to defend their results. Even if you think your process is rock-solid, things can go awry – unless you keep these milestones in mind.

Ready to subscribe to Insights now?

Please leave this field blank

Thank you for subscribing to Insights!

Enter email address*

*

Subscribe to Insights newsletter

Home
SAS Insights
Data lake and data warehouse – know the difference

Home
SAS Insights
Data lake and data warehouse – know the difference

Rozwiązania SAS w zakresie danych i sztucznej inteligencji zapewniają naszym globalnym klientom wiedzę, której mogą zaufać w ważnych momentach, inspirując nowe, odważne innowacje w różnych branżach.

Bądź na bieżąco

Facebook
Twitter
LinkedIn
YouTube
RSS

Dowiedz się więcej

Branże
Certyfikaty
Deweloperzy
Dlaczego SAS?
Dokumentacja
Dostępność
Firma
Internet rzeczy

Kariera
Mój profil SAS
News Room
Produkty
Rozwiązania
SAS Viya
Społeczności
Studenci

Szkolenia
Transformacja cyfrowa
Tutoriale wideo
Wsparcie i Usługi
Wydarzenia
Dla edukatorów
Wypróbuj/ Kup

Co to jest...

Co to jest...

Analityka
Analityka w Chmurze
Data Science
Sztuczna Inteligencja

Cookie Preferences
Oświadczenie dot. prywatności
Warunki użytkowania
Prywatność
©2025 SAS Institute Inc. Wszelkie prawa zastrzeżone.

Kontakt
Udostępnij
Subskrybuj

Udostępnij
Udostępnij tę stronę innym.

Back to Top

About cookies on this site

This site uses cookies and related technologies for site operation, analytics and third-party advertising purposes, as described in our SAS Privacy Statement. You may consent to our use of these technologies, reject non-essential technologies or further manage your preferences. To opt out of SAS making information relating to cookies and similar technologies available to third parties for advertising purposes, select "Required only." To exercise other rights you may have related to cookies, select "Manage cookies."

SAS Privacy Statement | Powered by:

| Truste