Training Data For Machine Learning Human Supervision From Annotation To Data Science
Language: English Publication details: SPD 2023Description: 306ISBN:- 9789355421920
| Cover image | Item type | Current library | Home library | Collection | Shelving location | Call number | Materials specified | Vol info | URL | Copy number | Status | Notes | Date due | Barcode | Item holds | Item hold queue priority | Course reserves | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Books | Cummins College of Engineering for Women Pune | 006.31 SAR (Browse shelf(Opens below)) | Available (not for issue) | CCEP-BK-67498 |
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1. Training Data Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Training Data Intents 2
What Can You Do With Training Data? 3
What Is Training Data Most Concerned With? 4
Training Data Opportunities 11
Business Transformation 11
Training Data Efficiency 12
Tooling Proficiency 13
Process Improvement Opportunities 13
Why Training Data Matters 13
ML Applications Are Becoming Mainstream 14
The Foundation of Successful AI 15
Training Data Is Here to Stay 16
Training Data Controls the ML Program 16
New Types of Users 17
Training Data in the Wild 18
What Makes Training Data Difficult? 18
The Art of Supervising Machines 20
A New Thing for Data Science 20
ML Program Ecosystem 21
Data-Centric Machine Learning 22
Failures 23
History of Development Affects Training Data Too 24
What Training Data Is Not 25
Generative AI 25
v
Human Alignment Is Human Supervision 27
Summary 28
2. Getting Up and Running. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Introduction 31
Getting Up and Running 32
Installation 33
Tasks Setup 34
Annotator Setup 35
Data Setup 35
Workflow Setup 35
Data Catalog Setup 36
Initial Usage 36
Optimization 36
Tools Overview 37
Training Data for Machine Learning 38
Growing Selection of Tools 38
People, Process, and Data 38
Embedded Supervision 39
Human Computer Supervision 39
Separation of End Concerns 40
Standards 40
Many Personas 40
A Paradigm to Deliver Machine Learning Software 41
Trade-Offs 41
Costs 41
Installed Versus Software as a Service 42
Development System 43
Scale 44
Installation Options 48
Annotation Interfaces 50
Modeling Integration 50
Multi-User versus Single-User Systems 50
Integrations 51
Scope 51
Hidden Assumptions 56
Security 57
Open Source and Closed Source 60
History 63
Open Source Standards 63
vi | Table of Contents
Realizing the Need for Dedicated Tooling 63
Summary 66
3. Schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Schema Deep Dive Introduction 67
Labels and Attributes—What Is It? 68
What Do We Care About? 68
Introduction to Labels 68
Attributes Introduction 69
Attribute Complexity Exceeds Spatial Complexity 73
Technical Overview 76
Spatial Representation—Where Is It? 78
Using Spatial Types to Prevent Social Bias 78
Trade-Offs with Types 82
Computer Vision Spatial Type Examples 83
Relationships, Sequences, Time Series: When Is It? 87
Sequences and Relationships 87
When 87
Guides and Instructions 88
Judgment Calls 89
Relation of Machine Learning Tasks to Training Data 89
Semantic Segmentation 90
Image Classification (Tags) 92
Object Detection 92
Pose Estimation 92
Relationship of Tasks to Training Data Types 93
General Concepts 93
Instance Concept Refresher 93
Upgrading Data Over Time 94
The Boundary Between Modeling and Training Data 95
Raw Data Concepts 96
Summary 97
4. Data Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Introduction 99
Who Wants the Data? 100
A Game of Telephone 101
Planning a Great System 103
Naive and Training Data–Centric Approaches 104
Raw Data Storage 109
Table of Contents | vii
By Reference or by Value 110
Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware 111
Data Storage: Where Does the Data Rest? 111
External Reference Connection 112
Raw Media (BLOB)–Type Specific 112
Formatting and Mapping 114
User-Defined Types (Compound Files) 114
Defining DataMaps 114
Ingest Wizards 114
Organizing Data and Useful Storage 115
Remote Storage 116
Versioning 116
Data Access 118
Disambiguating Storage, Ingestion, Export, and Access 119
File-Based Exports 119
Streaming Data 119
Queries Introduction 120
Integrations with the Ecosystem 121
Security 121
Access Control 121
Identity and Authorization 121
Example of Setting Permissions 122
Signed URLs 122
Personally Identifiable Information 124
Pre-Labeling 124
Updating Data 125
Summary 127
5. Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Introduction 129
Glue Between Tech and People 130
Why Are Human Tasks Needed? 132
Partnering with Non-Software Users in New Ways 132
Getting Started with Human Tasks 132
Basics 133
Schemas’ Staying Power 134
User Roles 135
Training 135
Gold Standard Training 136
Task Assignment Concepts 136
viii | Table of Contents
Do You Need to Customize the Interface? 137
How Long Will the Average Annotator Be Using It? 137
Tasks and Project Structure 137
Quality Assurance 138
Annotator Trust 139
Annotators Are Partners 139
Common Causes of Training Data Errors 141
Task Review Loops 141
Analytics 143
Annotation Metrics Examples 143
Data Exploration 144
Models 146
Using the Model to Debug the Humans 146
Distinctions Between a Dataset, Model, and Model Run 147
Getting Data to Models 148
Dataflow 148
Overview of Streaming 149
Data Organization 149
Pipelines and Processes 150
Direct Annotation 153
Business Process Integration 154
Attributes 154
Depth of Labeling 154
Supervising Existing Data 155
Interactive Automations 155
Example: Semantic Segmentation Auto Bordering 156
Video 157
Summary 162
6. Theories, Concepts, and Maintenance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Introduction 165
Theories 166
A System Is Only as Useful as Its Schema 166
Who Supervises the Data Matters 167
Intentionally Chosen Data Is Best 168
Working with Historical Data 169
Training Data Is Like Code 170
Surface Assumptions Around Usage of Your Training Data 171
Human Supervision Is Different from Classic Datasets 173
General Concepts 176
Table of Contents | ix
Data Relevancy 176
Need for Both Qualitative and Quantitative Evaluations 177
Iterations 178
Prioritization: What to Label 178
Transfer Learning’s Relation to Datasets (Fine-Tuning) 178
Per-Sample Judgment Calls 180
Ethical and Privacy Considerations 181
Bias 181
Bias Is Hard to Escape 183
Metadata 183
Preventing Lost Metadata 184
Train/Val/Test Is the Cherry on Top 185
Sample Creation 185
Simple Schema for a Strawberry Picking System 186
Geometric Representations 187
Binary Classification 188
Let’s Manually Create Our First Set 189
Upgraded Classification 192
Where Is the Traffic Light? 193
Maintenance 193
Actions 193
Net Lift 195
Levels of System Maturity of Training Data Operations 196
Applied Versus Research Sets 197
Training Data Management 198
Quality 199
Completed Tasks 199
Freshness 201
Maintaining Set Metadata 201
Task Management 201
Summary 202
7. AI Transformation and Use Cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Introduction 203
AI Transformation 204
Seeing Your Day-to-Day Work as Annotation 205
The Creative Revolution of Data-centric AI 207
You Can Create New Data 207
You Can Change What Data You Collect 208
You Can Change the Meaning of the Data 209
x | Table of Contents
You Can Create! 209
Think Step Function Improvement for Major Projects 209
Build Your AI Data to Secure Your AI Present and Future 210
Appoint a Leader: The Director of AI Data 210
New Expectations People Have for the Future of AI 211
Sometimes Proposals and Corrections, Sometimes Replacement 212
Upstream Producers and Downstream Consumers 212
Spectrum of Training Data Team Engagement 217
Dedicated Producers and Other Teams 218
Organizing Producers from Other Teams 218
Use Case Discovery 221
Rubric for Good Use Cases 222
Evaluating a Use Case Against the Rubric 225
Conceptual Effects of Use Cases 227
The New “Crowd Sourcing”: Your Own Experts 229
Key Levers on Training Data ROI 230
What the Annotated Data Represents 230
Trade-Offs of Controlling Your Own Training Data 230
The Need for Hardware 231
Common Project Mistakes 231
Modern Training Data Tools 232
Think Learning Curve, Not Perfection 232
New Training and Knowledge Are Required 233
How Companies Produce and Consume Data 234
Trap to Avoid: Premature Optimization in Training Data 234
No Silver Bullets 236
Culture of Training Data 236
New Engineering Principles 237
Summary 238
8. Automation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Introduction 239
Getting Started 240
Motivation: When to Use These Methods? 240
Check What Part of the Schema a Method Is Designed to Work On 241
What Do People Actually Use? 241
What Kind of Results Can I Expect? 242
Common Confusions 243
User Interface Optimizations 244
Risks 244
Table of Contents | xi
Trade-Offs 245
Nature of Automations 246
Setup Costs 246
How to Benchmark Well 246
How to Scope the Automation Relative to the Problem 247
Correction Time 248
Subject Matter Experts 248
Consider How the Automations Stack 249
Pre-Labeling 249
Standard Pre-Labeling 249
Pre-Labeling a Portion of the Data Only 252
Interactive Annotation Automation 254
Creating Your Own 255
Technical Setup Notes 255
What Is a Watcher? (Observer Pattern) 256
How to Use a Watcher 256
Interactive Capturing of a Region of Interest 257
Interactive Drawing Box to Polygon Using GrabCut 257
Full Image Model Prediction Example 258
Example: Person Detection for Different Attribute 258
Quality Assurance Automation 259
Using the Model to Debug the Humans 259
Automated Checklist Example 259
Domain-Specific Reasonableness Checks 260
Data Discovery: What to Label 260
Human Exploration 260
Raw Data Exploration 261
Metadata Exploration 261
Adding Pre-Labeling-Based Metadata 262
Augmentation 262
Better Models Are Better than Better Augmentation 263
To Augment or Not to Augment 263
Simulation and Synthetic Data 265
Simulations Still Need Human Review 265
Media Specific 267
What Methods Work with Which Media? 268
Considerations 269
Media-Specific Research 269
Domain Specific 270
Geometry-Based Labeling 270
xii | Table of Contents
Heuristics-Based Labeling 271
Summary 271
9. Case Studies and Stories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Introduction 273
Industry 274
A Security Startup Adopts Training Data Tools 274
Quality Assurance at a Large-Scale Self-Driving Project 275
Big-Tech Challenges 281
Insurance Tech Startup Lessons 288
Stories 289
An Academic Approach to Training Data 292
Kaggle TSA Competition 292
Summary 295
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Table of Contents | xiii
There are no comments on this title.