Cover: Data Lakes For Dummies by Alan Simon

Logo: Wiley

Data Lakes For Dummies®

To view this book's Cheat Sheet, simply go to www.dummies.com and search for “Data Lakes For Dummies Cheat Sheet” in the Search box.

Introduction

In December 1995, I wrote an article for Database Programming & Design magazine entitled “I Want a Data Warehouse, So What Is It Again?” A few months later, I began writing Data Warehousing For Dummies (Wiley), building on the article’s content to help readers make sense of first-generation data warehousing.

Fast-forward a quarter of a century, and I could very easily write an article entitled “I Want a Data Lake, So What Is It Again?” This time, I’m cutting right to the chase with Data Lakes For Dummies. To quote a famous former baseball player named Yogi Berra, it’s déjà vu all over again!

Nearly every large and upper-midsize company and governmental agency is building a data lake or at least has an initiative on the drawing board. That’s the good news.

The not-so-good news, though, is that you’ll find a disturbing lack of agreement about data lake architecture, best practices for data lake development, data lake internal data flows, even what a data lake actually is! In fact, many first-generation data lakes have fallen short of original expectations and need to be rearchitected and rebuilt.

As with data warehousing in the mid-’90s, the data lake concept today is still a relatively new one. Consequently, almost everything about data lakes — from its very definition to alternatives for integration with or migration from existing data warehouses — is still very much a moving target. Software product vendors, cloud service providers, consulting firms, industry analysts, and academics often have varying — and sometimes conflicting — perspectives on data lakes. So, how do you navigate your way across a data lake when the waters are especially choppy and you’re being tossed from side to side?

That’s where Data Lakes For Dummies comes in.

About This Book

Data Lakes For Dummies helps you make sense of the ABCs — acronym anarchy, buzzword bingo, and consulting confusion — of today’s and tomorrow’s data lakes.

This book is not only a tutorial about data lakes; it also serves as a reference that you may find yourself consulting on a regular basis. So, you don’t need to memorize large blocks of content (there’s no final exam!) because you can always go back to take a second or third or fourth look at any particular point during your own data lake efforts.

Right from the start, you find out what your organization should expect from all the time, effort, and money you’ll put into your data lake initiative, as well as see what challenges are lurking. You’ll dig deep into data lake architecture and leading cloud platforms and get your arms around the big picture of how all the pieces fit together.

One of the disadvantages of being an early adopter of any new technology is that you sometimes make mistakes or at least have a few false starts. Plenty of early data lake efforts have turned into more of a data dump, with tons of data that just isn’t very accessible or well organized. If you find yourself in this situation, fear not: You’ll see how to turn that data dump into the data lake you originally envisioned.

I don’t use many special conventions in this book, but you should be aware that sidebars (the gray boxes you see throughout the book) and anything marked with the Technical Stuff icon are all skippable. So, if you’re short on time, you can pass over these pieces without losing anything essential. On the other hand, if you have the time, you’re sure to find fascinating information here!

Within this book, you may note that some web addresses break across two lines of text. If you’re reading this book in print and want to visit one of these web pages, simply key in the web address exactly as it’s noted in the text, pretending as though the line break doesn’t exist. If you’re reading this as an e-book, you’ve got it easy — just click the web address to be taken directly to the web page.

Foolish Assumptions

The most relevant assumption I’ve made is that if you’re reading this book, you either are or will soon be working on a data lake initiative.

Maybe you’re a data strategist and architect, and what’s most important to you is sifting through mountains of sometimes conflicting — and often incomplete — information about data lakes. Your organization already makes use of earlier-generation data warehouses and data marts, and now it’s time to take that all-important next step to a data lake. If that’s the case, you’re definitely in the right place.

If you’re a developer or data architect who is working on a small subset of the overall data lake, your primary focus is how a particular software package or service works. Still, you’re curious about where your daily work fits into your organization’s overall data lake efforts. That’s where this book comes in: to provide context and that “aha!” factor to the big picture that surrounds your day-to-day tasks.

Or maybe you’re on the business and operational side of a company or governmental agency, working side by side with the technology team as they work to build an enterprise-scale data environment that will finally support the entire spectrum of your organization’s analytical needs. You don’t necessarily need to know too much about the techie side of data lakes, but you absolutely care about building an environment that meets today’s and tomorrow’s needs for data-driven insights.

The common thread is that data lakes are part of your organization’s present and future, and you’re seeking an unvarnished, hype-free, grounded-in-reality view of data lakes today and where they’re headed.

In any event, you don’t need to be a technical whiz with databases, programming languages such as Python, or specific cloud platforms such as Amazon Web Services (AWS) or Microsoft Azure. I cover many different technical topics in this book, but you’ll find clear explanations and diagrams that don’t presume any prerequisite knowledge on your part.

Icons Used in This Book

As you read this book, you encounter icons in the margins that indicate material of particular interest. Here’s what the icons mean:

Tip These are the tricks of the data lake trade. You can save yourself a great deal of time and avoid more than a few false starts by following specific tips collected from the best practices (and learned from painful experiences) of those who preceded you on the path to the data lake.

Warning Data lakes are often filled with dangerous icebergs. (Okay, bad analogy, but you hopefully get the idea.) When you’re working on your organization’s data lake efforts, pay particular attention to situations that are called out with this icon.

Technical Stuff If you’re more interested in the conceptual and architectural aspects of data lakes than the nitty-gritty implementation details, you can skim or even skip material that is accompanied by this icon.

Remember Some points are so critically important that you’ll be well served by committing them to memory. You’ll even see some of these points repeated later in the book because they tie in with other material. This icon calls out this crucial content.

Beyond the Book

In addition to the material in the print or e-book you’re reading right now, this product comes with a free Cheat Sheet for the three types of data for your data lake, four zones inside your data lake, five phases to building your data lake, and more. To access the Cheat Sheet, go to www.dummies.com and type Data Lakes For Dummies Cheat Sheet in the Search box.

Where to Go from Here

Now it’s time to head off to the lake — the data lake, that is! If you’re totally new to the subject, you don’t want to skip the chapters in Part 1 because they’ll provide the foundation for the rest of the book. If you already have some exposure to data lakes, I still recommend that you at least skim Part 1 to get a sense of how to get beyond all the hype, buzzwords, and generalities related to data lakes.

You can then read the book sequentially from front to back or jump around as needed. Whatever path works best for you is the one you should take.

Part 1

Getting Started with Data Lakes

IN THIS PART …

  • Separate the data lake reality from the hype.
  • Steer your data lake efforts in the right direction.
  • Diagnose and avoid common pitfalls that can dry up your data lake.