TY - GEN AU - Serra J. TI - Deciphering Data Architectures: Choosing Between A Modern Data Warehouse, Data Fabric, Data Lakehouse And Data Mesh SN - 9789355425928 PY - 2024/// PB - SPD N2 - Table of Contents Foreword. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix Part I. Foundation 1. Big Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 What Is Big Data, and How Can It Help You? 4 Data Maturity 7 Stage 1: Reactive 8 Stage 2: Informative 8 Stage 3: Predictive 9 Stage 4: Transformative 9 Self-Service Business Intelligence 9 Summary 10 2. Types of Data Architectures. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 Evolution of Data Architectures 14 Relational Data Warehouse 16 Data Lake 18 Modern Data Warehouse 20 Data Fabric 21 Data Lakehouse 21 Data Mesh 22 Summary 23 ix 3. The Architecture Design Session. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 What Is an ADS? 25 Why Hold an ADS? 26 Before the ADS 27 Preparing 27 Inviting Participants 29 Conducting the ADS 31 Introductions 31 Discovery 31 Whiteboarding 36 After the ADS 37 Tips for Conducting an ADS 38 Summary 40 Part II. Common Data Architecture Concepts 4. The Relational Data Warehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 What Is a Relational Data Warehouse? 43 What a Data Warehouse Is Not 46 The Top-Down Approach 47 Why Use a Relational Data Warehouse? 49 Drawbacks to Using a Relational Data Warehouse 52 Populating a Data Warehouse 53 How Often to Extract the Data 53 Extraction Methods 54 How to Determine What Data Has Changed Since the Last Extraction 54 The Death of the Relational Data Warehouse Has Been Greatly Exaggerated 56 Summary 57 5. Data Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 What Is a Data Lake? 60 Why Use a Data Lake? 60 Bottom-Up Approach 62 Best Practices for Data Lake Design 63 Multiple Data Lakes 69 Advantages 69 Disadvantages 72 Summary 72 x | Table of Contents 6. Data Storage Solutions and Processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Data Storage Solutions 76 Data Marts 76 Operational Data Stores 77 Data Hubs 79 Data Processes 81 Master Data Management 81 Data Virtualization and Data Federation 82 Data Catalogs 87 Data Marketplaces 87 Summary 89 7. Approaches to Design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 Online Transaction Processing Versus Online Analytical Processing 92 Operational and Analytical Data 94 Symmetric Multiprocessing and Massively Parallel Processing 94 Lambda Architecture 96 Kappa Architecture 98 Polyglot Persistence and Polyglot Data Stores 100 Summary 101 8. Approaches to Data Modeling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Relational Modeling 103 Keys 103 Entity–Relationship Diagrams 104 Normalization Rules and Forms 104 Tracking Changes 106 Dimensional Modeling 107 Facts, Dimensions, and Keys 107 Tracking Changes 108 Denormalization 109 Common Data Model 111 Data Vault 111 The Kimball and Inmon Data Warehousing Methodologies 113 Inmon’s Top-Down Methodology 114 Kimball’s Bottom-Up Methodology 115 Choosing a Methodology 117 Hybrid Models 118 Methodology Myths 120 Summary 123 Table of Contents | xi 9. Approaches to Data Ingestion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 ETL Versus ELT 125 Reverse ETL 127 Batch Processing Versus Real-Time Processing 129 Batch Processing Pros and Cons 130 Real-Time Processing Pros and Cons 130 Data Governance 131 Summary 132 Part III. Data Architectures 10. The Modern Data Warehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 The MDW Architecture 135 Pros and Cons of the MDW Architecture 140 Combining the RDW and Data Lake 142 Data Lake 142 Relational Data Warehouse 142 Stepping Stones to the MDW 143 EDW Augmentation 143 Temporary Data Lake Plus EDW 145 All-in-One 146 Case Study: Wilson & Gunkerk’s Strategic Shift to an MDW 147 Challenge 147 Solution 147 Outcome 148 Summary 148 11. Data Fabric. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 The Data Fabric Architecture 152 Data Access Policies 154 Metadata Catalog 154 Master Data Management 155 Data Virtualization 155 Real-Time Processing 155 APIs 155 Services 156 Products 156 Why Transition from an MDW to a Data Fabric Architecture? 156 Potential Drawbacks 157 Summary 157 xii | Table of Contents 12. Data Lakehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 Delta Lake Features 160 Performance Improvements 162 The Data Lakehouse Architecture 163 What If You Skip the Relational Data Warehouse? 165 Relational Serving Layer 167 Summary 167 13. Data Mesh Foundation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169 A Decentralized Data Architecture 170 Data Mesh Hype 171 Dehghani’s Four Principles of Data Mesh 172 Principle #1: Domain Ownership 172 Principle #2: Data as a Product 173 Principle #3: Self-Serve Data Infrastructure as a Platform 175 Principle #4: Federated Computational Governance 176 The “Pure” Data Mesh 177 Data Domains 178 Data Mesh Logical Architecture 179 Different Topologies 181 Data Mesh Versus Data Fabric 182 Use Cases 183 Summary 185 14. Should You Adopt Data Mesh? Myths, Concerns, and the Future. . . . . . . . . . . . . . . . . 187 Myths 187 Myth: Using Data Mesh Is a Silver Bullet That Solves All Data Challenges Quickly 187 Myth: A Data Mesh Will Replace Your Data Lake and Data Warehouse 188 Myth: Data Warehouse Projects Are All Failing, and a Data Mesh Will Solve That Problem 188 Myth: Building a Data Mesh Means Decentralizing Absolutely Everything 188 Myth: You Can Use Data Virtualization to Create a Data Mesh 189 Concerns 190 Philosophical and Conceptual Matters 190 Combining Data in a Decentralized Environment 191 Other Issues of Decentralization 192 Complexity 193 Duplication 193 Feasibility 194 People 196 Domain-Level Barriers 197 Table of Contents | xiii Organizational Assessment: Should You Adopt a Data Mesh? 198 Recommendations for Implementing a Successful Data Mesh 199 The Future of Data Mesh 201 Zooming Out: Understanding Data Architectures and Their Applications 202 Summary 203 Part IV. People, Processes, and Technology 15. People and Processes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 Team Organization: Roles and Responsibilities 208 Roles for MDW, Data Fabric, or Data Lakehouse 208 Roles for Data Mesh 210 Why Projects Fail: Pitfalls and Prevention 213 Pitfall: Allowing Executives to Think That BI Is “Easy” 213 Pitfall: Using the Wrong Technologies 213 Pitfall: Gathering Too Many Business Requirements 213 Pitfall: Gathering Too Few Business Requirements 214 Pitfall: Presenting Reports Without Validating Their Contents First 214 Pitfall: Hiring an Inexperienced Consulting Company 214 Pitfall: Hiring a Consulting Company That Outsources Development to Offshore Workers 215 Pitfall: Passing Project Ownership Off to Consultants 215 Pitfall: Neglecting the Need to Transfer Knowledge Back into the Organization 215 Pitfall: Slashing the Budget Midway Through the Project 215 Pitfall: Starting with an End Date and Working Backward 216 Pitfall: Structuring the Data Warehouse to Reflect the Source Data Rather Than the Business’s Needs 216 Pitfall: Presenting End Users with a Solution with Slow Response Times or Other Performance Issues 216 Pitfall: Overdesigning (or Underdesigning) Your Data Architecture 217 Pitfall: Poor Communication Between IT and the Business Domains 217 Tips for Success 217 Don’t Skimp on Your Investment 217 Involve Users, Show Them Results, and Get Them Excited 218 Add Value to New Reports and Dashboards 219 Ask End Users to Build a Prototype 219 Find a Project Champion/Sponsor 219 Make a Project Plan That Aims for 80% Efficiency 220 Summary 220 xiv | Table of Contents 16. Technologies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 Choosing a Platform 223 Open Source Solutions 223 On-Premises Solutions 226 Cloud Provider Solutions 227 Cloud Service Models 230 Major Cloud Providers 232 Multi-Cloud Solutions 232 Software Frameworks 235 Hadoop 235 Databricks 238 Snowflake 240 Summary 241 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Table of Contents | xv ER -