Python Data Science Handbook :Essential Tools For Working With Data
Language: English Publication details: SPD/O'reilly 2024Edition: 2ndDescription: 563ISBN:- 9789355422552
| Cover image | Item type | Current library | Home library | Collection | Shelving location | Call number | Materials specified | Vol info | URL | Copy number | Status | Notes | Date due | Barcode | Item holds | Item hold queue priority | Course reserves | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Books | Cummins College of Engineering for Women Pune | 005.13'3 VAN (Browse shelf(Opens below)) | Available (not for issue) | CCEP-BK-67406 |
Table of Contents
Preface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xix
Part I. Jupyter: Beyond Normal Python
1. Getting Started in IPython and Jupyter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Launching the IPython Shell 3
Launching the Jupyter Notebook 4
Help and Documentation in IPython 4
Accessing Documentation with ? 5
Accessing Source Code with ?? 6
Exploring Modules with Tab Completion 7
Keyboard Shortcuts in the IPython Shell 9
Navigation Shortcuts 10
Text Entry Shortcuts 10
Command History Shortcuts 10
Miscellaneous Shortcuts 12
2. Enhanced Interactive Features. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
IPython Magic Commands 13
Running External Code: %run 13
Timing Code Execution: %timeit 14
Help on Magic Functions: ?, %magic, and %lsmagic 15
Input and Output History 15
IPython’s In and Out Objects 15
Underscore Shortcuts and Previous Outputs 16
Suppressing Output 17
Related Magic Commands 17
v
IPython and Shell Commands 18
Quick Introduction to the Shell 18
Shell Commands in IPython 19
Passing Values to and from the Shell 20
Shell-Related Magic Commands 20
3. Debugging and Proling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
Errors and Debugging 22
Controlling Exceptions: %xmode 22
Debugging: When Reading Tracebacks Is Not Enough 24
Profiling and Timing Code 26
Timing Code Snippets: %timeit and %time 27
Profiling Full Scripts: %prun 28
Line-by-Line Profiling with %lprun 29
Profiling Memory Use: %memit and %mprun 30
More IPython Resources 31
Web Resources 31
Books 32
Part II. Introduction to NumPy
4. Understanding Data Types in Python. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
A Python Integer Is More Than Just an Integer 36
A Python List Is More Than Just a List 37
Fixed-Type Arrays in Python 39
Creating Arrays from Python Lists 39
Creating Arrays from Scratch 40
NumPy Standard Data Types 41
5. The Basics of NumPy Arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
NumPy Array Attributes 44
Array Indexing: Accessing Single Elements 44
Array Slicing: Accessing Subarrays 45
One-Dimensional Subarrays 45
Multidimensional Subarrays 46
Subarrays as No-Copy Views 47
Creating Copies of Arrays 47
Reshaping of Arrays 48
Array Concatenation and Splitting 49
Concatenation of Arrays 49
Splitting of Arrays 50
vi | Table of Contents
6. Computation on NumPy Arrays: Universal Functions. . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
The Slowness of Loops 51
Introducing Ufuncs 52
Exploring NumPy’s Ufuncs 53
Array Arithmetic 53
Absolute Value 55
Trigonometric Functions 55
Exponents and Logarithms 56
Specialized Ufuncs 56
Advanced Ufunc Features 57
Specifying Output 57
Aggregations 58
Outer Products 59
Ufuncs: Learning More 59
7. Aggregations: min, max, and Everything in Between. . . . . . . . . . . . . . . . . . . . . . . . . . . 60
Summing the Values in an Array 60
Minimum and Maximum 61
Multidimensional Aggregates 61
Other Aggregation Functions 62
Example: What Is the Average Height of US Presidents? 63
8. Computation on Arrays: Broadcasting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
Introducing Broadcasting 65
Rules of Broadcasting 67
Broadcasting Example 1 68
Broadcasting Example 2 68
Broadcasting Example 3 69
Broadcasting in Practice 70
Centering an Array 70
Plotting a Two-Dimensional Function 71
9. Comparisons, Masks, and Boolean Logic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Example: Counting Rainy Days 72
Comparison Operators as Ufuncs 73
Working with Boolean Arrays 75
Counting Entries 75
Boolean Operators 76
Boolean Arrays as Masks 77
Using the Keywords and/or Versus the Operators &/| 78
Table of Contents | vii
10. Fancy Indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Exploring Fancy Indexing 80
Combined Indexing 81
Example: Selecting Random Points 82
Modifying Values with Fancy Indexing 84
Example: Binning Data 85
11. Sorting Arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
Fast Sorting in NumPy: np.sort and np.argsort 89
Sorting Along Rows or Columns 89
Partial Sorts: Partitioning 90
Example: k-Nearest Neighbors 90
12. Structured Data: NumPy’s Structured Arrays. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
Exploring Structured Array Creation 96
More Advanced Compound Types 97
Record Arrays: Structured Arrays with a Twist 97
On to Pandas 98
Part III. Data Manipulation with Pandas
13. Introducing Pandas Objects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
The Pandas Series Object 101
Series as Generalized NumPy Array 102
Series as Specialized Dictionary 103
Constructing Series Objects 104
The Pandas DataFrame Object 104
DataFrame as Generalized NumPy Array 105
DataFrame as Specialized Dictionary 106
Constructing DataFrame Objects 106
The Pandas Index Object 108
Index as Immutable Array 108
Index as Ordered Set 108
14. Data Indexing and Selection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
Data Selection in Series 110
Series as Dictionary 110
Series as One-Dimensional Array 111
Indexers: loc and iloc 112
Data Selection in DataFrames 113
viii | Table of Contents
DataFrame as Dictionary 113
DataFrame as Two-Dimensional Array 115
Additional Indexing Conventions 116
15. Operating on Data in Pandas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Ufuncs: Index Preservation 118
Ufuncs: Index Alignment 119
Index Alignment in Series 119
Index Alignment in DataFrames 120
Ufuncs: Operations Between DataFrames and Series 121
16. Handling Missing Data. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Trade-offs in Missing Data Conventions 123
Missing Data in Pandas 124
None as a Sentinel Value 125
NaN: Missing Numerical Data 125
NaN and None in Pandas 126
Pandas Nullable Dtypes 127
Operating on Null Values 128
Detecting Null Values 128
Dropping Null Values 129
Filling Null Values 130
17. Hierarchical Indexing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132
A Multiply Indexed Series 132
The Bad Way 133
The Better Way: The Pandas MultiIndex 133
MultiIndex as Extra Dimension 134
Methods of MultiIndex Creation 136
Explicit MultiIndex Constructors 136
MultiIndex Level Names 137
MultiIndex for Columns 138
Indexing and Slicing a MultiIndex 138
Multiply Indexed Series 139
Multiply Indexed DataFrames 140
Rearranging Multi-Indexes 141
Sorted and Unsorted Indices 141
Stacking and Unstacking Indices 143
Index Setting and Resetting 143
18. Combining Datasets: concat and append. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
Recall: Concatenation of NumPy Arrays 146
Table of Contents | ix
Simple Concatenation with pd.concat 147
Duplicate Indices 148
Concatenation with Joins 149
The append Method 150
19. Combining Datasets: merge and join. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
Relational Algebra 151
Categories of Joins 152
One-to-One Joins 152
Many-to-One Joins 153
Many-to-Many Joins 153
Specification of the Merge Key 154
The on Keyword 154
The left_on and right_on Keywords 155
The left_index and right_index Keywords 155
Specifying Set Arithmetic for Joins 157
Overlapping Column Names: The suffixes Keyword 158
Example: US States Data 159
20. Aggregation and Grouping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
Planets Data 165
Simple Aggregation in Pandas 165
groupby: Split, Apply, Combine 167
Split, Apply, Combine 167
The GroupBy Object 169
Aggregate, Filter, Transform, Apply 171
Specifying the Split Key 174
Grouping Example 175
21. Pivot Tables. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
Motivating Pivot Tables 176
Pivot Tables by Hand 177
Pivot Table Syntax 178
Multilevel Pivot Tables 178
Additional Pivot Table Options 179
Example: Birthrate Data 180
22. Vectorized String Operations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Introducing Pandas String Operations 185
Tables of Pandas String Methods 186
Methods Similar to Python String Methods 186
Methods Using Regular Expressions 187
x | Table of Contents
Miscellaneous Methods 188
Example: Recipe Database 190
A Simple Recipe Recommender 192
Going Further with Recipes 193
23. Working with Time Series. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194
Dates and Times in Python 195
Native Python Dates and Times: datetime and dateutil 195
Typed Arrays of Times: NumPy’s datetime64 196
Dates and Times in Pandas: The Best of Both Worlds 197
Pandas Time Series: Indexing by Time 198
Pandas Time Series Data Structures 199
Regular Sequences: pd.date_range 200
Frequencies and Offsets 201
Resampling, Shifting, and Windowing 202
Resampling and Converting Frequencies 203
Time Shifts 205
Rolling Windows 206
Example: Visualizing Seattle Bicycle Counts 208
Visualizing the Data 209
Digging into the Data 211
24. High-Performance Pandas: eval and query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215
Motivating query and eval: Compound Expressions 215
pandas.eval for Efficient Operations 216
DataFrame.eval for Column-Wise Operations 218
Assignment in DataFrame.eval 219
Local Variables in DataFrame.eval 219
The DataFrame.query Method 220
Performance: When to Use These Functions 220
Further Resources 221
Part IV. Visualization with Matplotlib
25. General Matplotlib Tips. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
Importing Matplotlib 225
Setting Styles 225
show or No show? How to Display Your Plots 226
Plotting from a Script 226
Plotting from an IPython Shell 227
Plotting from a Jupyter Notebook 227
Table of Contents | xi
Saving Figures to File 228
Two Interfaces for the Price of One 230
26. Simple Line Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232
Adjusting the Plot: Line Colors and Styles 235
Adjusting the Plot: Axes Limits 238
Labeling Plots 240
Matplotlib Gotchas 242
27. Simple Scatter Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 244
Scatter Plots with plt.plot 244
Scatter Plots with plt.scatter 247
plot Versus scatter: A Note on Efficiency 250
Visualizing Uncertainties 251
Basic Errorbars 251
Continuous Errors 253
28. Density and Contour Plots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255
Visualizing a Three-Dimensional Function 255
Histograms, Binnings, and Density 260
Two-Dimensional Histograms and Binnings 263
plt.hist2d: Two-Dimensional Histogram 263
plt.hexbin: Hexagonal Binnings 264
Kernel Density Estimation 264
29. Customizing Plot Legends. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
Choosing Elements for the Legend 270
Legend for Size of Points 272
Multiple Legends 274
30. Customizing Colorbars. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 276
Customizing Colorbars 277
Choosing the Colormap 278
Color Limits and Extensions 280
Discrete Colorbars 281
Example: Handwritten Digits 282
31. Multiple Subplots. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 285
plt.axes: Subplots by Hand 285
plt.subplot: Simple Grids of Subplots 287
plt.subplots: The Whole Grid in One Go 289
plt.GridSpec: More Complicated Arrangements 291
xii | Table of Contents
32. Text and Annotation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
Example: Effect of Holidays on US Births 294
Transforms and Text Position 296
Arrows and Annotation 298
33. Customizing Ticks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 302
Major and Minor Ticks 302
Hiding Ticks or Labels 304
Reducing or Increasing the Number of Ticks 306
Fancy Tick Formats 307
Summary of Formatters and Locators 310
34. Customizing Matplotlib: Congurations and Stylesheets. . . . . . . . . . . . . . . . . . . . . . . 312
Plot Customization by Hand 312
Changing the Defaults: rcParams 314
Stylesheets 316
Default Style 317
FiveThiryEight Style 317
ggplot Style 318
Bayesian Methods for Hackers Style 318
Dark Background Style 319
Grayscale Style 319
Seaborn Style 320
35. Three-Dimensional Plotting in Matplotlib. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 321
Three-Dimensional Points and Lines 322
Three-Dimensional Contour Plots 323
Wireframes and Surface Plots 325
Surface Triangulations 328
Example: Visualizing a Möbius Strip 330
36. Visualization with Seaborn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 332
Exploring Seaborn Plots 333
Histograms, KDE, and Densities 333
Pair Plots 335
Faceted Histograms 336
Categorical Plots 338
Joint Distributions 339
Bar Plots 340
Example: Exploring Marathon Finishing Times 342
Further Resources 350
Other Python Visualization Libraries 351
Table of Contents | xiii
Part V. Machine Learning
37. What Is Machine Learning?. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
Categories of Machine Learning 355
Qualitative Examples of Machine Learning Applications 356
Classification: Predicting Discrete Labels 356
Regression: Predicting Continuous Labels 359
Clustering: Inferring Labels on Unlabeled Data 363
Dimensionality Reduction: Inferring Structure of Unlabeled Data 364
Summary 366
38. Introducing Scikit-Learn. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
Data Representation in Scikit-Learn 367
The Features Matrix 368
The Target Array 368
The Estimator API 370
Basics of the API 371
Supervised Learning Example: Simple Linear Regression 372
Supervised Learning Example: Iris Classification 375
Unsupervised Learning Example: Iris Dimensionality 376
Unsupervised Learning Example: Iris Clustering 377
Application: Exploring Handwritten Digits 378
Loading and Visualizing the Digits Data 378
Unsupervised Learning Example: Dimensionality Reduction 380
Classification on Digits 381
Summary 383
39. Hyperparameters and Model Validation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 384
Thinking About Model Validation 384
Model Validation the Wrong Way 385
Model Validation the Right Way: Holdout Sets 385
Model Validation via Cross-Validation 386
Selecting the Best Model 388
The Bias-Variance Trade-off 389
Validation Curves in Scikit-Learn 391
Learning Curves 395
Validation in Practice: Grid Search 400
Summary 401
40. Feature Engineering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
Categorical Features 402
xiv | Table of Contents
Text Features 404
Image Features 405
Derived Features 405
Imputation of Missing Data 408
Feature Pipelines 409
41. In Depth: Naive Bayes Classication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 410
Bayesian Classification 410
Gaussian Naive Bayes 411
Multinomial Naive Bayes 414
Example: Classifying Text 414
When to Use Naive Bayes 417
42. In Depth: Linear Regression. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
Simple Linear Regression 419
Basis Function Regression 422
Polynomial Basis Functions 422
Gaussian Basis Functions 424
Regularization 425
Ridge Regression (L2
Regularization) 427
Lasso Regression (L1
Regularization) 428
Example: Predicting Bicycle Traffic 429
43. In Depth: Support Vector Machines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
Motivating Support Vector Machines 435
Support Vector Machines: Maximizing the Margin 437
Fitting a Support Vector Machine 438
Beyond Linear Boundaries: Kernel SVM 441
Tuning the SVM: Softening Margins 444
Example: Face Recognition 445
Summary 450
44. In Depth: Decision Trees and Random Forests. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 451
Motivating Random Forests: Decision Trees 451
Creating a Decision Tree 452
Decision Trees and Overfitting 455
Ensembles of Estimators: Random Forests 456
Random Forest Regression 458
Example: Random Forest for Classifying Digits 459
Summary 462
Table of Contents | xv
45. In Depth: Principal Component Analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 463
Introducing Principal Component Analysis 463
PCA as Dimensionality Reduction 466
PCA for Visualization: Handwritten Digits 467
What Do the Components Mean? 469
Choosing the Number of Components 470
PCA as Noise Filtering 471
Example: Eigenfaces 473
Summary 476
46. In Depth: Manifold Learning. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 477
Manifold Learning: “HELLO” 478
Multidimensional Scaling 479
MDS as Manifold Learning 482
Nonlinear Embeddings: Where MDS Fails 484
Nonlinear Manifolds: Locally Linear Embedding 486
Some Thoughts on Manifold Methods 488
Example: Isomap on Faces 489
Example: Visualizing Structure in Digits 493
47. In Depth: k-Means Clustering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 496
Introducing k-Means 496
Expectation–Maximization 498
Examples 504
Example 1: k-Means on Digits 504
Example 2: k-Means for Color Compression 507
48. In Depth: Gaussian Mixture Models. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
Motivating Gaussian Mixtures: Weaknesses of k-Means 512
Generalizing E–M: Gaussian Mixture Models 516
Choosing the Covariance Type 520
Gaussian Mixture Models as Density Estimation 520
Example: GMMs for Generating New Data 524
49. In Depth: Kernel Density Estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
Motivating Kernel Density Estimation: Histograms 528
Kernel Density Estimation in Practice 533
Selecting the Bandwidth via Cross-Validation 535
Example: Not-so-Naive Bayes 535
Anatomy of a Custom Estimator 537
Using Our Custom Estimator 539
xvi | Table of Contents
50. Application: A Face Detection Pipeline. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 541
HOG Features 542
HOG in Action: A Simple Face Detector 543
1. Obtain a Set of Positive Training Samples 543
2. Obtain a Set of Negative Training Samples 543
3. Combine Sets and Extract HOG Features 545
4. Train a Support Vector Machine 546
5. Find Faces in a New Image 546
Caveats and Improvements 548
Further Machine Learning Resources 550
Index. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 551
There are no comments on this title.