TY - GEN AU - Tranquillin M. AU - Lakshmanan V. TI - Architecting Data And Machine Learning Platforms: :Enable Analytics And AI-Driven Innovation In The Cloud SN - 9789355428158 PY - 2023/// PB - SPD N2 - Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi 1. Modernizing Your Data Platform: An Introductory Overview. . . . . . . . . . . . . . . . . . . . . . . 1 The Data Lifecycle 2 The Journey to Wisdom 2 Water Pipes Analogy 3 Collect 4 Store 5 Process/Transform 7 Analyze/Visualize 8 Activate 9 Limitations of Traditional Approaches 10 Antipattern: Breaking Down Silos Through ETL 10 Antipattern: Centralization of Control 13 Antipattern: Data Marts and Hadoop 15 Creating a Unified Analytics Platform 16 Cloud Instead of On-Premises 17 Drawbacks of Data Marts and Data Lakes 18 Convergence of DWHs and Data Lakes 19 Hybrid Cloud 23 Reasons Why Hybrid Is Necessary 24 Challenges of Hybrid Cloud 25 Why Hybrid Can Work 26 Edge Computing 27 Applying AI 29 Machine Learning 29 Uses of ML 30 Why Cloud for AI? 31 iii Cloud Infrastructure 31 Democratization 32 Real Time 34 MLOps 35 Core Principles 36 Summary 38 2. Strategic Steps to Innovate with Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 Step 1: Strategy and Planning 42 Strategic Goals 43 Identify Stakeholders 45 Change Management 45 Step 2: Reduce Total Cost of Ownership by Adopting a Cloud Approach 47 Why Cloud Costs Less 47 How Much Are the Savings? 49 When Does Cloud Help? 50 Step 3: Break Down Silos 50 Unifying Data Access 51 Choosing Storage 52 Semantic Layer 53 Step 4: Make Decisions in Context Faster 55 Batch to Stream 55 Contextual Information 56 Cost Management 56 Step 5: Leapfrog with Packaged AI Solutions 57 Predictive Analytics 58 Understanding and Generating Unstructured Data 59 Personalization 60 Packaged Solutions 60 Step 6: Operationalize AI-Driven Workflows 61 Identifying the Right Balance of Automation and Assistance 61 Building a Data Culture 62 Populating Your Data Science Team 62 Step 7: Product Management for Data 64 Applying Product Management Principles to Data 64 1. Understand and Maintain a Map of Data Flows in the Enterprise 65 2. Identify Key Metrics 65 3. Agreed Criteria, Committed Roadmap, and Visionary Backlog 66 4. Build for the Customers You Have 67 5. Don’t Shift the Burden of Change Management 67 6. Interview Customers to Discover Their Data Needs 68 7. Whiteboard and Prototype Extensively 68 iv | Table of Contents 8. Build Only What Will Be Used Immediately 69 9. Standardize Common Entities and KPIs 69 10. Provide Self-Service Capabilities in Your Data Platform 70 Summary 70 3. Designing Your Data Team. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 Classifying Data Processing Organizations 73 Data Analysis–Driven Organization 76 The Vision 77 The Personas 78 The Technological Framework 80 Data Engineering–Driven Organization 82 The Vision 82 The Personas 84 The Technological Framework 86 Data Science–Driven Organization 89 The Vision 89 The Personas 91 The Technological Framework 92 Summary 94 4. A Migration Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Modernize Data Workflows 95 Holistic View 95 Modernize Workflows 96 Transform the Workflow Itself 98 A Four-Step Migration Framework 98 Prepare and Discover 99 Assess and Plan 100 Execute 103 Optimize 104 Estimating the Overall Cost of the Solution 105 Audit of the Existing Infrastructure 105 Request for Information/Proposal and Quotation 106 Proof of Concept/Minimum Viable Product 107 Setting Up Security and Data Governance 108 Framework 108 Artifacts 110 Governance over the Life of the Data 111 Schema, Pipeline, and Data Migration 113 Schema Migration 113 Pipeline Migration 113 Table of Contents | v Data Migration 116 Migration Stages 121 Summary 122 5. Architecting a Data Lake. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Data Lake and the Cloud—A Perfect Marriage 125 Challenges with On-Premises Data Lakes 125 Benefits of Cloud Data Lakes 126 Design and Implementation 127 Batch and Stream 127 Data Catalog 129 Hadoop Landscape 130 Cloud Data Lake Reference Architecture 131 Integrating the Data Lake: The Real Superpower 136 APIs to Extend the Lake 136 The Evolution of Data Lake with Apache Iceberg, Apache Hudi, and Delta Lake 136 Interactive Analytics with Notebooks 138 Democratizing Data Processing and Reporting 140 Build Trust in the Data 141 Data Ingestion Is Still an IT Matter 143 ML in the Data Lake 145 Training on Raw Data 145 Predicting in the Data Lake 146 Summary 146 6. Innovating with an Enterprise Data Warehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149 A Modern Data Platform 149 Organizational Goals 150 Technological Challenges 151 Technology Trends and Tools 152 Hub-and-Spoke Architecture 154 Data Ingest 157 Business Intelligence 161 Transformations 164 Organizational Structure 169 DWH to Enable Data Scientists 171 Query Interface 171 Storage API 172 ML Without Moving Your Data 173 Summary 177 vi | Table of Contents 7. Converging to a Lakehouse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 The Need for a Unique Architecture 179 User Personas 179 Antipattern: Disconnected Systems 180 Antipattern: Duplicated Data 180 Converged Architecture 182 Two Forms 183 Lakehouse on Cloud Storage 184 SQL-First Lakehouse 189 The Benefits of Convergence 193 Summary 195 8. Architectures for Streaming. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197 The Value of Streaming 197 Industry Use Cases 198 Streaming Use Cases 199 Streaming Ingest 200 Streaming ETL 200 Streaming ELT 202 Streaming Insert 203 Streaming from Edge Devices (IoT) 204 Streaming Sinks 205 Real-Time Dashboards 205 Live Querying 206 Materialize Some Views 206 Stream Analytics 207 Time-Series Analytics 207 Clickstream Analytics 208 Anomaly Detection 210 Resilient Streaming 211 Continuous Intelligence Through ML 212 Training Model on Streaming Data 212 Streaming ML Inference 215 Automated Actions 215 Summary 216 9. Extending a Data Platform Using Hybrid and Edge. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Why Multicloud? 219 A Single Cloud Is Simpler and Cost-Effective 220 Multicloud Is Inevitable 220 Multicloud Could Be Strategic 221 Multicloud Architectural Patterns 223 Table of Contents | vii Single Pane of Glass 223 Write Once, Run Anywhere 224 Bursting from On Premises to Cloud 225 Pass-Through from On Premises to Cloud 226 Data Integration Through Streaming 227 Adopting Multicloud 229 Framework 229 Time Scale 231 Define a Target Multicloud Architecture 231 Why Edge Computing? 233 Bandwidth, Latency, and Patchy Connectivity 233 Use Cases 235 Benefits 236 Challenges 237 Edge Computing Architectural Patterns 237 Smart Devices 238 Smart Gateways 238 ML Activation 239 Adopting Edge Computing 241 The Initial Context 241 The Project 241 The Final Outcomes and Next Steps 244 Summary 245 10. AI Application Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 Is This an AI/ML Problem? 248 Subfields of AI 248 Generative AI 249 Problems Fit for ML 253 Buy, Adapt, or Build? 254 Data Considerations 254 When to Buy 255 What Can You Buy? 256 How Adapting Works 258 AI Architectures 260 Understanding Unstructured Data 261 Generating Unstructured Data 263 Predicting Outcomes 265 Forecasting Values 266 Anomaly Detection 268 Personalization 269 Automation 271 viii | Table of Contents Responsible AI 272 AI Principles 273 ML Fairness 274 Explainability 275 Summary 276 11. Architecting an ML Platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 279 ML Activities 279 Developing ML Models 280 Labeling Environment 281 Development Environment 281 User Environment 282 Preparing Data 283 Training ML Models 284 Deploying ML Models 286 Deploying to an Endpoint 287 Evaluate Model 288 Hybrid and Multicloud 288 Training-Serving Skew 288 Automation 293 Automate Training and Deployment 293 Orchestration with Pipelines 294 Continuous Evaluation and Training 296 Choosing the ML Framework 298 Team Skills 298 Task Considerations 299 User-Centric 299 Summary 300 12. Data Platform Modernization: A Model Case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 303 New Technology for a New Era 303 The Need for Change 304 It Is Not Only a Matter of Technology 305 The Beginning of the Journey 307 The Current Environment 307 The Target Environment 309 The PoC Use Case 311 The RFP Responses Proposed by Cloud Vendors 312 The Target Environment 312 The Approach on Migration 316 The RFP Evaluation Process 323 The Scope of the PoC 323 Table of Contents | ix The Execution of the PoC 324 The Final Decision 325 Peroration 326 Summary 326 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329 x | Table of Contents ER -