000 11530 a2200157 4500
005 20241018125254.0
008 241018b |||||||| |||| 00| 0 eng d
020 _a9789355421920
041 _aEnglish
100 _aSarkis A.
_9208597
245 _aTraining Data For Machine Learning
_bHuman Supervision From Annotation To Data Science
260 _bSPD
_c2023
300 _a306
520 _aTable of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1. Training Data Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Training Data Intents 2 What Can You Do With Training Data? 3 What Is Training Data Most Concerned With? 4 Training Data Opportunities 11 Business Transformation 11 Training Data Efficiency 12 Tooling Proficiency 13 Process Improvement Opportunities 13 Why Training Data Matters 13 ML Applications Are Becoming Mainstream 14 The Foundation of Successful AI 15 Training Data Is Here to Stay 16 Training Data Controls the ML Program 16 New Types of Users 17 Training Data in the Wild 18 What Makes Training Data Difficult? 18 The Art of Supervising Machines 20 A New Thing for Data Science 20 ML Program Ecosystem 21 Data-Centric Machine Learning 22 Failures 23 History of Development Affects Training Data Too 24 What Training Data Is Not 25 Generative AI 25 v Human Alignment Is Human Supervision 27 Summary 28 2. Getting Up and Running. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Introduction 31 Getting Up and Running 32 Installation 33 Tasks Setup 34 Annotator Setup 35 Data Setup 35 Workflow Setup 35 Data Catalog Setup 36 Initial Usage 36 Optimization 36 Tools Overview 37 Training Data for Machine Learning 38 Growing Selection of Tools 38 People, Process, and Data 38 Embedded Supervision 39 Human Computer Supervision 39 Separation of End Concerns 40 Standards 40 Many Personas 40 A Paradigm to Deliver Machine Learning Software 41 Trade-Offs 41 Costs 41 Installed Versus Software as a Service 42 Development System 43 Scale 44 Installation Options 48 Annotation Interfaces 50 Modeling Integration 50 Multi-User versus Single-User Systems 50 Integrations 51 Scope 51 Hidden Assumptions 56 Security 57 Open Source and Closed Source 60 History 63 Open Source Standards 63 vi | Table of Contents Realizing the Need for Dedicated Tooling 63 Summary 66 3. Schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Schema Deep Dive Introduction 67 Labels and Attributes—What Is It? 68 What Do We Care About? 68 Introduction to Labels 68 Attributes Introduction 69 Attribute Complexity Exceeds Spatial Complexity 73 Technical Overview 76 Spatial Representation—Where Is It? 78 Using Spatial Types to Prevent Social Bias 78 Trade-Offs with Types 82 Computer Vision Spatial Type Examples 83 Relationships, Sequences, Time Series: When Is It? 87 Sequences and Relationships 87 When 87 Guides and Instructions 88 Judgment Calls 89 Relation of Machine Learning Tasks to Training Data 89 Semantic Segmentation 90 Image Classification (Tags) 92 Object Detection 92 Pose Estimation 92 Relationship of Tasks to Training Data Types 93 General Concepts 93 Instance Concept Refresher 93 Upgrading Data Over Time 94 The Boundary Between Modeling and Training Data 95 Raw Data Concepts 96 Summary 97 4. Data Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Introduction 99 Who Wants the Data? 100 A Game of Telephone 101 Planning a Great System 103 Naive and Training Data–Centric Approaches 104 Raw Data Storage 109 Table of Contents | vii By Reference or by Value 110 Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware 111 Data Storage: Where Does the Data Rest? 111 External Reference Connection 112 Raw Media (BLOB)–Type Specific 112 Formatting and Mapping 114 User-Defined Types (Compound Files) 114 Defining DataMaps 114 Ingest Wizards 114 Organizing Data and Useful Storage 115 Remote Storage 116 Versioning 116 Data Access 118 Disambiguating Storage, Ingestion, Export, and Access 119 File-Based Exports 119 Streaming Data 119 Queries Introduction 120 Integrations with the Ecosystem 121 Security 121 Access Control 121 Identity and Authorization 121 Example of Setting Permissions 122 Signed URLs 122 Personally Identifiable Information 124 Pre-Labeling 124 Updating Data 125 Summary 127 5. Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Introduction 129 Glue Between Tech and People 130 Why Are Human Tasks Needed? 132 Partnering with Non-Software Users in New Ways 132 Getting Started with Human Tasks 132 Basics 133 Schemas’ Staying Power 134 User Roles 135 Training 135 Gold Standard Training 136 Task Assignment Concepts 136 viii | Table of Contents Do You Need to Customize the Interface? 137 How Long Will the Average Annotator Be Using It? 137 Tasks and Project Structure 137 Quality Assurance 138 Annotator Trust 139 Annotators Are Partners 139 Common Causes of Training Data Errors 141 Task Review Loops 141 Analytics 143 Annotation Metrics Examples 143 Data Exploration 144 Models 146 Using the Model to Debug the Humans 146 Distinctions Between a Dataset, Model, and Model Run 147 Getting Data to Models 148 Dataflow 148 Overview of Streaming 149 Data Organization 149 Pipelines and Processes 150 Direct Annotation 153 Business Process Integration 154 Attributes 154 Depth of Labeling 154 Supervising Existing Data 155 Interactive Automations 155 Example: Semantic Segmentation Auto Bordering 156 Video 157 Summary 162 6. Theories, Concepts, and Maintenance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Introduction 165 Theories 166 A System Is Only as Useful as Its Schema 166 Who Supervises the Data Matters 167 Intentionally Chosen Data Is Best 168 Working with Historical Data 169 Training Data Is Like Code 170 Surface Assumptions Around Usage of Your Training Data 171 Human Supervision Is Different from Classic Datasets 173 General Concepts 176 Table of Contents | ix Data Relevancy 176 Need for Both Qualitative and Quantitative Evaluations 177 Iterations 178 Prioritization: What to Label 178 Transfer Learning’s Relation to Datasets (Fine-Tuning) 178 Per-Sample Judgment Calls 180 Ethical and Privacy Considerations 181 Bias 181 Bias Is Hard to Escape 183 Metadata 183 Preventing Lost Metadata 184 Train/Val/Test Is the Cherry on Top 185 Sample Creation 185 Simple Schema for a Strawberry Picking System 186 Geometric Representations 187 Binary Classification 188 Let’s Manually Create Our First Set 189 Upgraded Classification 192 Where Is the Traffic Light? 193 Maintenance 193 Actions 193 Net Lift 195 Levels of System Maturity of Training Data Operations 196 Applied Versus Research Sets 197 Training Data Management 198 Quality 199 Completed Tasks 199 Freshness 201 Maintaining Set Metadata 201 Task Management 201 Summary 202 7. AI Transformation and Use Cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Introduction 203 AI Transformation 204 Seeing Your Day-to-Day Work as Annotation 205 The Creative Revolution of Data-centric AI 207 You Can Create New Data 207 You Can Change What Data You Collect 208 You Can Change the Meaning of the Data 209 x | Table of Contents You Can Create! 209 Think Step Function Improvement for Major Projects 209 Build Your AI Data to Secure Your AI Present and Future 210 Appoint a Leader: The Director of AI Data 210 New Expectations People Have for the Future of AI 211 Sometimes Proposals and Corrections, Sometimes Replacement 212 Upstream Producers and Downstream Consumers 212 Spectrum of Training Data Team Engagement 217 Dedicated Producers and Other Teams 218 Organizing Producers from Other Teams 218 Use Case Discovery 221 Rubric for Good Use Cases 222 Evaluating a Use Case Against the Rubric 225 Conceptual Effects of Use Cases 227 The New “Crowd Sourcing”: Your Own Experts 229 Key Levers on Training Data ROI 230 What the Annotated Data Represents 230 Trade-Offs of Controlling Your Own Training Data 230 The Need for Hardware 231 Common Project Mistakes 231 Modern Training Data Tools 232 Think Learning Curve, Not Perfection 232 New Training and Knowledge Are Required 233 How Companies Produce and Consume Data 234 Trap to Avoid: Premature Optimization in Training Data 234 No Silver Bullets 236 Culture of Training Data 236 New Engineering Principles 237 Summary 238 8. Automation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Introduction 239 Getting Started 240 Motivation: When to Use These Methods? 240 Check What Part of the Schema a Method Is Designed to Work On 241 What Do People Actually Use? 241 What Kind of Results Can I Expect? 242 Common Confusions 243 User Interface Optimizations 244 Risks 244 Table of Contents | xi Trade-Offs 245 Nature of Automations 246 Setup Costs 246 How to Benchmark Well 246 How to Scope the Automation Relative to the Problem 247 Correction Time 248 Subject Matter Experts 248 Consider How the Automations Stack 249 Pre-Labeling 249 Standard Pre-Labeling 249 Pre-Labeling a Portion of the Data Only 252 Interactive Annotation Automation 254 Creating Your Own 255 Technical Setup Notes 255 What Is a Watcher? (Observer Pattern) 256 How to Use a Watcher 256 Interactive Capturing of a Region of Interest 257 Interactive Drawing Box to Polygon Using GrabCut 257 Full Image Model Prediction Example 258 Example: Person Detection for Different Attribute 258 Quality Assurance Automation 259 Using the Model to Debug the Humans 259 Automated Checklist Example 259 Domain-Specific Reasonableness Checks 260 Data Discovery: What to Label 260 Human Exploration 260 Raw Data Exploration 261 Metadata Exploration 261 Adding Pre-Labeling-Based Metadata 262 Augmentation 262 Better Models Are Better than Better Augmentation 263 To Augment or Not to Augment 263 Simulation and Synthetic Data 265 Simulations Still Need Human Review 265 Media Specific 267 What Methods Work with Which Media? 268 Considerations 269 Media-Specific Research 269 Domain Specific 270 Geometry-Based Labeling 270 xii | Table of Contents Heuristics-Based Labeling 271 Summary 271 9. Case Studies and Stories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Introduction 273 Industry 274 A Security Startup Adopts Training Data Tools 274 Quality Assurance at a Large-Scale Self-Driving Project 275 Big-Tech Challenges 281 Insurance Tech Startup Lessons 288 Stories 289 An Academic Approach to Training Data 292 Kaggle TSA Competition 292 Summary 295 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Table of Contents | xiii
942 _cBK
999 _c359834
_d359834