Training Data For Machine Learning Human Supervision From Annotation To Data Science

By:

Sarkis A

Language: English Publication details: SPD 2023Description: 306ISBN:

9789355421920

Summary: Table of Contents Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv 1. Training Data Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Training Data Intents 2 What Can You Do With Training Data? 3 What Is Training Data Most Concerned With? 4 Training Data Opportunities 11 Business Transformation 11 Training Data Efficiency 12 Tooling Proficiency 13 Process Improvement Opportunities 13 Why Training Data Matters 13 ML Applications Are Becoming Mainstream 14 The Foundation of Successful AI 15 Training Data Is Here to Stay 16 Training Data Controls the ML Program 16 New Types of Users 17 Training Data in the Wild 18 What Makes Training Data Difficult? 18 The Art of Supervising Machines 20 A New Thing for Data Science 20 ML Program Ecosystem 21 Data-Centric Machine Learning 22 Failures 23 History of Development Affects Training Data Too 24 What Training Data Is Not 25 Generative AI 25 v Human Alignment Is Human Supervision 27 Summary 28 2. Getting Up and Running. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 Introduction 31 Getting Up and Running 32 Installation 33 Tasks Setup 34 Annotator Setup 35 Data Setup 35 Workflow Setup 35 Data Catalog Setup 36 Initial Usage 36 Optimization 36 Tools Overview 37 Training Data for Machine Learning 38 Growing Selection of Tools 38 People, Process, and Data 38 Embedded Supervision 39 Human Computer Supervision 39 Separation of End Concerns 40 Standards 40 Many Personas 40 A Paradigm to Deliver Machine Learning Software 41 Trade-Offs 41 Costs 41 Installed Versus Software as a Service 42 Development System 43 Scale 44 Installation Options 48 Annotation Interfaces 50 Modeling Integration 50 Multi-User versus Single-User Systems 50 Integrations 51 Scope 51 Hidden Assumptions 56 Security 57 Open Source and Closed Source 60 History 63 Open Source Standards 63 vi | Table of Contents Realizing the Need for Dedicated Tooling 63 Summary 66 3. Schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 Schema Deep Dive Introduction 67 Labels and Attributes—What Is It? 68 What Do We Care About? 68 Introduction to Labels 68 Attributes Introduction 69 Attribute Complexity Exceeds Spatial Complexity 73 Technical Overview 76 Spatial Representation—Where Is It? 78 Using Spatial Types to Prevent Social Bias 78 Trade-Offs with Types 82 Computer Vision Spatial Type Examples 83 Relationships, Sequences, Time Series: When Is It? 87 Sequences and Relationships 87 When 87 Guides and Instructions 88 Judgment Calls 89 Relation of Machine Learning Tasks to Training Data 89 Semantic Segmentation 90 Image Classification (Tags) 92 Object Detection 92 Pose Estimation 92 Relationship of Tasks to Training Data Types 93 General Concepts 93 Instance Concept Refresher 93 Upgrading Data Over Time 94 The Boundary Between Modeling and Training Data 95 Raw Data Concepts 96 Summary 97 4. Data Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Introduction 99 Who Wants the Data? 100 A Game of Telephone 101 Planning a Great System 103 Naive and Training Data–Centric Approaches 104 Raw Data Storage 109 Table of Contents | vii By Reference or by Value 110 Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware 111 Data Storage: Where Does the Data Rest? 111 External Reference Connection 112 Raw Media (BLOB)–Type Specific 112 Formatting and Mapping 114 User-Defined Types (Compound Files) 114 Defining DataMaps 114 Ingest Wizards 114 Organizing Data and Useful Storage 115 Remote Storage 116 Versioning 116 Data Access 118 Disambiguating Storage, Ingestion, Export, and Access 119 File-Based Exports 119 Streaming Data 119 Queries Introduction 120 Integrations with the Ecosystem 121 Security 121 Access Control 121 Identity and Authorization 121 Example of Setting Permissions 122 Signed URLs 122 Personally Identifiable Information 124 Pre-Labeling 124 Updating Data 125 Summary 127 5. Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 Introduction 129 Glue Between Tech and People 130 Why Are Human Tasks Needed? 132 Partnering with Non-Software Users in New Ways 132 Getting Started with Human Tasks 132 Basics 133 Schemas’ Staying Power 134 User Roles 135 Training 135 Gold Standard Training 136 Task Assignment Concepts 136 viii | Table of Contents Do You Need to Customize the Interface? 137 How Long Will the Average Annotator Be Using It? 137 Tasks and Project Structure 137 Quality Assurance 138 Annotator Trust 139 Annotators Are Partners 139 Common Causes of Training Data Errors 141 Task Review Loops 141 Analytics 143 Annotation Metrics Examples 143 Data Exploration 144 Models 146 Using the Model to Debug the Humans 146 Distinctions Between a Dataset, Model, and Model Run 147 Getting Data to Models 148 Dataflow 148 Overview of Streaming 149 Data Organization 149 Pipelines and Processes 150 Direct Annotation 153 Business Process Integration 154 Attributes 154 Depth of Labeling 154 Supervising Existing Data 155 Interactive Automations 155 Example: Semantic Segmentation Auto Bordering 156 Video 157 Summary 162 6. Theories, Concepts, and Maintenance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 Introduction 165 Theories 166 A System Is Only as Useful as Its Schema 166 Who Supervises the Data Matters 167 Intentionally Chosen Data Is Best 168 Working with Historical Data 169 Training Data Is Like Code 170 Surface Assumptions Around Usage of Your Training Data 171 Human Supervision Is Different from Classic Datasets 173 General Concepts 176 Table of Contents | ix Data Relevancy 176 Need for Both Qualitative and Quantitative Evaluations 177 Iterations 178 Prioritization: What to Label 178 Transfer Learning’s Relation to Datasets (Fine-Tuning) 178 Per-Sample Judgment Calls 180 Ethical and Privacy Considerations 181 Bias 181 Bias Is Hard to Escape 183 Metadata 183 Preventing Lost Metadata 184 Train/Val/Test Is the Cherry on Top 185 Sample Creation 185 Simple Schema for a Strawberry Picking System 186 Geometric Representations 187 Binary Classification 188 Let’s Manually Create Our First Set 189 Upgraded Classification 192 Where Is the Traffic Light? 193 Maintenance 193 Actions 193 Net Lift 195 Levels of System Maturity of Training Data Operations 196 Applied Versus Research Sets 197 Training Data Management 198 Quality 199 Completed Tasks 199 Freshness 201 Maintaining Set Metadata 201 Task Management 201 Summary 202 7. AI Transformation and Use Cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Introduction 203 AI Transformation 204 Seeing Your Day-to-Day Work as Annotation 205 The Creative Revolution of Data-centric AI 207 You Can Create New Data 207 You Can Change What Data You Collect 208 You Can Change the Meaning of the Data 209 x | Table of Contents You Can Create! 209 Think Step Function Improvement for Major Projects 209 Build Your AI Data to Secure Your AI Present and Future 210 Appoint a Leader: The Director of AI Data 210 New Expectations People Have for the Future of AI 211 Sometimes Proposals and Corrections, Sometimes Replacement 212 Upstream Producers and Downstream Consumers 212 Spectrum of Training Data Team Engagement 217 Dedicated Producers and Other Teams 218 Organizing Producers from Other Teams 218 Use Case Discovery 221 Rubric for Good Use Cases 222 Evaluating a Use Case Against the Rubric 225 Conceptual Effects of Use Cases 227 The New “Crowd Sourcing”: Your Own Experts 229 Key Levers on Training Data ROI 230 What the Annotated Data Represents 230 Trade-Offs of Controlling Your Own Training Data 230 The Need for Hardware 231 Common Project Mistakes 231 Modern Training Data Tools 232 Think Learning Curve, Not Perfection 232 New Training and Knowledge Are Required 233 How Companies Produce and Consume Data 234 Trap to Avoid: Premature Optimization in Training Data 234 No Silver Bullets 236 Culture of Training Data 236 New Engineering Principles 237 Summary 238 8. Automation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 Introduction 239 Getting Started 240 Motivation: When to Use These Methods? 240 Check What Part of the Schema a Method Is Designed to Work On 241 What Do People Actually Use? 241 What Kind of Results Can I Expect? 242 Common Confusions 243 User Interface Optimizations 244 Risks 244 Table of Contents | xi Trade-Offs 245 Nature of Automations 246 Setup Costs 246 How to Benchmark Well 246 How to Scope the Automation Relative to the Problem 247 Correction Time 248 Subject Matter Experts 248 Consider How the Automations Stack 249 Pre-Labeling 249 Standard Pre-Labeling 249 Pre-Labeling a Portion of the Data Only 252 Interactive Annotation Automation 254 Creating Your Own 255 Technical Setup Notes 255 What Is a Watcher? (Observer Pattern) 256 How to Use a Watcher 256 Interactive Capturing of a Region of Interest 257 Interactive Drawing Box to Polygon Using GrabCut 257 Full Image Model Prediction Example 258 Example: Person Detection for Different Attribute 258 Quality Assurance Automation 259 Using the Model to Debug the Humans 259 Automated Checklist Example 259 Domain-Specific Reasonableness Checks 260 Data Discovery: What to Label 260 Human Exploration 260 Raw Data Exploration 261 Metadata Exploration 261 Adding Pre-Labeling-Based Metadata 262 Augmentation 262 Better Models Are Better than Better Augmentation 263 To Augment or Not to Augment 263 Simulation and Synthetic Data 265 Simulations Still Need Human Review 265 Media Specific 267 What Methods Work with Which Media? 268 Considerations 269 Media-Specific Research 269 Domain Specific 270 Geometry-Based Labeling 270 xii | Table of Contents Heuristics-Based Labeling 271 Summary 271 9. Case Studies and Stories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273 Introduction 273 Industry 274 A Security Startup Adopts Training Data Tools 274 Quality Assurance at a Large-Scale Self-Driving Project 275 Big-Tech Challenges 281 Insurance Tech Startup Lessons 288 Stories 289 An Academic Approach to Training Data 292 Kaggle TSA Competition 292 Summary 295 Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297 Table of Contents | xiii

Tags from this library: No tags from this library for this title. Log in to add tags.

Average rating: 0.0 (0 votes)

Holdings
Cover image	Item type	Current library	Home library	Collection	Shelving location	Call number	Materials specified	Vol info	URL	Copy number	Status	Notes	Date due	Barcode	Item holds	Item hold queue priority	Course reserves
	Books	Cummins College of Engineering for Women Pune				006.31 SAR (Browse shelf(Opens below))					Available (not for issue)			CCEP-BK-67498

Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv
1. Training Data Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
Training Data Intents 2
What Can You Do With Training Data? 3
What Is Training Data Most Concerned With? 4
Training Data Opportunities 11
Business Transformation 11
Training Data Efficiency 12
Tooling Proficiency 13
Process Improvement Opportunities 13
Why Training Data Matters 13
ML Applications Are Becoming Mainstream 14
The Foundation of Successful AI 15
Training Data Is Here to Stay 16
Training Data Controls the ML Program 16
New Types of Users 17
Training Data in the Wild 18
What Makes Training Data Difficult? 18
The Art of Supervising Machines 20
A New Thing for Data Science 20
ML Program Ecosystem 21
Data-Centric Machine Learning 22
Failures 23
History of Development Affects Training Data Too 24
What Training Data Is Not 25
Generative AI 25
v
Human Alignment Is Human Supervision 27
Summary 28
2. Getting Up and Running. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Introduction 31
Getting Up and Running 32
Installation 33
Tasks Setup 34
Annotator Setup 35
Data Setup 35
Workflow Setup 35
Data Catalog Setup 36
Initial Usage 36
Optimization 36
Tools Overview 37
Training Data for Machine Learning 38
Growing Selection of Tools 38
People, Process, and Data 38
Embedded Supervision 39
Human Computer Supervision 39
Separation of End Concerns 40
Standards 40
Many Personas 40
A Paradigm to Deliver Machine Learning Software 41
Trade-Offs 41
Costs 41
Installed Versus Software as a Service 42
Development System 43
Scale 44
Installation Options 48
Annotation Interfaces 50
Modeling Integration 50
Multi-User versus Single-User Systems 50
Integrations 51
Scope 51
Hidden Assumptions 56
Security 57
Open Source and Closed Source 60
History 63
Open Source Standards 63
vi | Table of Contents
Realizing the Need for Dedicated Tooling 63
Summary 66
3. Schema. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Schema Deep Dive Introduction 67
Labels and Attributes—What Is It? 68
What Do We Care About? 68
Introduction to Labels 68
Attributes Introduction 69
Attribute Complexity Exceeds Spatial Complexity 73
Technical Overview 76
Spatial Representation—Where Is It? 78
Using Spatial Types to Prevent Social Bias 78
Trade-Offs with Types 82
Computer Vision Spatial Type Examples 83
Relationships, Sequences, Time Series: When Is It? 87
Sequences and Relationships 87
When 87
Guides and Instructions 88
Judgment Calls 89
Relation of Machine Learning Tasks to Training Data 89
Semantic Segmentation 90
Image Classification (Tags) 92
Object Detection 92
Pose Estimation 92
Relationship of Tasks to Training Data Types 93
General Concepts 93
Instance Concept Refresher 93
Upgrading Data Over Time 94
The Boundary Between Modeling and Training Data 95
Raw Data Concepts 96
Summary 97
4. Data Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Introduction 99
Who Wants the Data? 100
A Game of Telephone 101
Planning a Great System 103
Naive and Training Data–Centric Approaches 104
Raw Data Storage 109
Table of Contents | vii
By Reference or by Value 110
Off-the-Shelf Dedicated Training Data Tooling on Your Own Hardware 111
Data Storage: Where Does the Data Rest? 111
External Reference Connection 112
Raw Media (BLOB)–Type Specific 112
Formatting and Mapping 114
User-Defined Types (Compound Files) 114
Defining DataMaps 114
Ingest Wizards 114
Organizing Data and Useful Storage 115
Remote Storage 116
Versioning 116
Data Access 118
Disambiguating Storage, Ingestion, Export, and Access 119
File-Based Exports 119
Streaming Data 119
Queries Introduction 120
Integrations with the Ecosystem 121
Security 121
Access Control 121
Identity and Authorization 121
Example of Setting Permissions 122
Signed URLs 122
Personally Identifiable Information 124
Pre-Labeling 124
Updating Data 125
Summary 127
5. Workflow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
Introduction 129
Glue Between Tech and People 130
Why Are Human Tasks Needed? 132
Partnering with Non-Software Users in New Ways 132
Getting Started with Human Tasks 132
Basics 133
Schemas’ Staying Power 134
User Roles 135
Training 135
Gold Standard Training 136
Task Assignment Concepts 136
viii | Table of Contents
Do You Need to Customize the Interface? 137
How Long Will the Average Annotator Be Using It? 137
Tasks and Project Structure 137
Quality Assurance 138
Annotator Trust 139
Annotators Are Partners 139
Common Causes of Training Data Errors 141
Task Review Loops 141
Analytics 143
Annotation Metrics Examples 143
Data Exploration 144
Models 146
Using the Model to Debug the Humans 146
Distinctions Between a Dataset, Model, and Model Run 147
Getting Data to Models 148
Dataflow 148
Overview of Streaming 149
Data Organization 149
Pipelines and Processes 150
Direct Annotation 153
Business Process Integration 154
Attributes 154
Depth of Labeling 154
Supervising Existing Data 155
Interactive Automations 155
Example: Semantic Segmentation Auto Bordering 156
Video 157
Summary 162
6. Theories, Concepts, and Maintenance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
Introduction 165
Theories 166
A System Is Only as Useful as Its Schema 166
Who Supervises the Data Matters 167
Intentionally Chosen Data Is Best 168
Working with Historical Data 169
Training Data Is Like Code 170
Surface Assumptions Around Usage of Your Training Data 171
Human Supervision Is Different from Classic Datasets 173
General Concepts 176
Table of Contents | ix
Data Relevancy 176
Need for Both Qualitative and Quantitative Evaluations 177
Iterations 178
Prioritization: What to Label 178
Transfer Learning’s Relation to Datasets (Fine-Tuning) 178
Per-Sample Judgment Calls 180
Ethical and Privacy Considerations 181
Bias 181
Bias Is Hard to Escape 183
Metadata 183
Preventing Lost Metadata 184
Train/Val/Test Is the Cherry on Top 185
Sample Creation 185
Simple Schema for a Strawberry Picking System 186
Geometric Representations 187
Binary Classification 188
Let’s Manually Create Our First Set 189
Upgraded Classification 192
Where Is the Traffic Light? 193
Maintenance 193
Actions 193
Net Lift 195
Levels of System Maturity of Training Data Operations 196
Applied Versus Research Sets 197
Training Data Management 198
Quality 199
Completed Tasks 199
Freshness 201
Maintaining Set Metadata 201
Task Management 201
Summary 202
7. AI Transformation and Use Cases. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Introduction 203
AI Transformation 204
Seeing Your Day-to-Day Work as Annotation 205
The Creative Revolution of Data-centric AI 207
You Can Create New Data 207
You Can Change What Data You Collect 208
You Can Change the Meaning of the Data 209
x | Table of Contents
You Can Create! 209
Think Step Function Improvement for Major Projects 209
Build Your AI Data to Secure Your AI Present and Future 210
Appoint a Leader: The Director of AI Data 210
New Expectations People Have for the Future of AI 211
Sometimes Proposals and Corrections, Sometimes Replacement 212
Upstream Producers and Downstream Consumers 212
Spectrum of Training Data Team Engagement 217
Dedicated Producers and Other Teams 218
Organizing Producers from Other Teams 218
Use Case Discovery 221
Rubric for Good Use Cases 222
Evaluating a Use Case Against the Rubric 225
Conceptual Effects of Use Cases 227
The New “Crowd Sourcing”: Your Own Experts 229
Key Levers on Training Data ROI 230
What the Annotated Data Represents 230
Trade-Offs of Controlling Your Own Training Data 230
The Need for Hardware 231
Common Project Mistakes 231
Modern Training Data Tools 232
Think Learning Curve, Not Perfection 232
New Training and Knowledge Are Required 233
How Companies Produce and Consume Data 234
Trap to Avoid: Premature Optimization in Training Data 234
No Silver Bullets 236
Culture of Training Data 236
New Engineering Principles 237
Summary 238
8. Automation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239
Introduction 239
Getting Started 240
Motivation: When to Use These Methods? 240
Check What Part of the Schema a Method Is Designed to Work On 241
What Do People Actually Use? 241
What Kind of Results Can I Expect? 242
Common Confusions 243
User Interface Optimizations 244
Risks 244
Table of Contents | xi
Trade-Offs 245
Nature of Automations 246
Setup Costs 246
How to Benchmark Well 246
How to Scope the Automation Relative to the Problem 247
Correction Time 248
Subject Matter Experts 248
Consider How the Automations Stack 249
Pre-Labeling 249
Standard Pre-Labeling 249
Pre-Labeling a Portion of the Data Only 252
Interactive Annotation Automation 254
Creating Your Own 255
Technical Setup Notes 255
What Is a Watcher? (Observer Pattern) 256
How to Use a Watcher 256
Interactive Capturing of a Region of Interest 257
Interactive Drawing Box to Polygon Using GrabCut 257
Full Image Model Prediction Example 258
Example: Person Detection for Different Attribute 258
Quality Assurance Automation 259
Using the Model to Debug the Humans 259
Automated Checklist Example 259
Domain-Specific Reasonableness Checks 260
Data Discovery: What to Label 260
Human Exploration 260
Raw Data Exploration 261
Metadata Exploration 261
Adding Pre-Labeling-Based Metadata 262
Augmentation 262
Better Models Are Better than Better Augmentation 263
To Augment or Not to Augment 263
Simulation and Synthetic Data 265
Simulations Still Need Human Review 265
Media Specific 267
What Methods Work with Which Media? 268
Considerations 269
Media-Specific Research 269
Domain Specific 270
Geometry-Based Labeling 270
xii | Table of Contents
Heuristics-Based Labeling 271
Summary 271
9. Case Studies and Stories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
Introduction 273
Industry 274
A Security Startup Adopts Training Data Tools 274
Quality Assurance at a Large-Scale Self-Driving Project 275
Big-Tech Challenges 281
Insurance Tech Startup Lessons 288
Stories 289
An Academic Approach to Training Data 292
Kaggle TSA Competition 292
Summary 295
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
Table of Contents | xiii

There are no comments on this title.

to post a comment.