Pandas
Pandas
toolkit
Release 1.2.0
1 Getting started 3
1.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Intro to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3 Coming from. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.1 Installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4.2 Package overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4.3 Getting started tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.4.4 Comparison with other tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
1.4.5 Community tutorials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
i
2.3.11 Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
2.3.12 Copying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
2.3.13 dtypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
2.3.14 Selecting columns based on dtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227
2.4 IO tools (text, CSV, HDF5, . . . ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
2.4.1 CSV & text files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
2.4.2 JSON . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
2.4.3 HTML . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
2.4.4 Excel files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 289
2.4.5 OpenDocument Spreadsheets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
2.4.6 Binary Excel (.xlsb) files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 296
2.4.7 Clipboard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
2.4.8 Pickling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297
2.4.9 msgpack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
2.4.10 HDF5 (PyTables) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 301
2.4.11 Feather . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 329
2.4.12 Parquet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 330
2.4.13 ORC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
2.4.14 SQL queries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333
2.4.15 Google BigQuery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
2.4.16 Stata format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
2.4.17 SAS formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
2.4.18 SPSS formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
2.4.19 Other file formats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
2.4.20 Performance considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
2.5 Indexing and selecting data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
2.5.1 Different choices for indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
2.5.2 Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
2.5.3 Attribute access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
2.5.4 Slicing ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
2.5.5 Selection by label . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
2.5.6 Selection by position . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
2.5.7 Selection by callable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 363
2.5.8 Combining positional and label-based indexing . . . . . . . . . . . . . . . . . . . . . . . . 365
2.5.9 Indexing with list with missing labels is deprecated . . . . . . . . . . . . . . . . . . . . . . 366
2.5.10 Selecting random samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
2.5.11 Setting with enlargement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 370
2.5.12 Fast scalar value getting and setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 371
2.5.13 Boolean indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 372
2.5.14 Indexing with isin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374
2.5.15 The where() Method and Masking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 376
2.5.16 Setting with enlargement conditionally using numpy() . . . . . . . . . . . . . . . . . . . 380
2.5.17 The query() Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 381
2.5.18 Duplicate data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
2.5.19 Dictionary-like get() method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
2.5.20 Looking up values by index/column labels . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
2.5.21 Index objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
2.5.22 Set / reset index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
2.5.23 Returning a view versus a copy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400
2.6 MultiIndex / advanced indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
2.6.1 Hierarchical indexing (MultiIndex) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 404
2.6.2 Advanced indexing with hierarchical index . . . . . . . . . . . . . . . . . . . . . . . . . . 410
2.6.3 Sorting a MultiIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 421
2.6.4 Take methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 424
ii
2.6.5 Index types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 426
2.6.6 Miscellaneous indexing FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 436
2.7 Merge, join, concatenate and compare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
2.7.1 Concatenating objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
2.7.2 Database-style DataFrame or named Series joining/merging . . . . . . . . . . . . . . . . . 454
2.7.3 Timeseries friendly merging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 474
2.7.4 Comparing objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 476
2.8 Reshaping and pivot tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
2.8.1 Reshaping by pivoting DataFrame objects . . . . . . . . . . . . . . . . . . . . . . . . . . . 478
2.8.2 Reshaping by stacking and unstacking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 480
2.8.3 Reshaping by melt . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 488
2.8.4 Combining with stats and GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 491
2.8.5 Pivot tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 492
2.8.6 Cross tabulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 495
2.8.7 Tiling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
2.8.8 Computing indicator / dummy variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 498
2.8.9 Factorizing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 501
2.8.10 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 502
2.8.11 Exploding a list-like column . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 505
2.9 Working with text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
2.9.1 Text data types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 507
2.9.2 String methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 510
2.9.3 Splitting and replacing strings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 512
2.9.4 Concatenation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 516
2.9.5 Indexing with .str . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 521
2.9.6 Extracting substrings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 522
2.9.7 Testing for strings that match or contain a pattern . . . . . . . . . . . . . . . . . . . . . . . 526
2.9.8 Creating indicator variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 527
2.9.9 Method summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 528
2.10 Working with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
2.10.1 Values considered “missing” . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 529
2.10.2 Inserting missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 532
2.10.3 Calculations with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 533
2.10.4 Sum/prod of empties/nans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 534
2.10.5 NA values in GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
2.10.6 Filling missing values: fillna . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 535
2.10.7 Filling with a PandasObject . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 537
2.10.8 Dropping axis labels with missing data: dropna . . . . . . . . . . . . . . . . . . . . . . . . 538
2.10.9 Interpolation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 539
2.10.10 Replacing generic values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 548
2.10.11 String/regular expression replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 549
2.10.12 Numeric replacement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 552
2.10.13 Experimental NA scalar to denote missing values . . . . . . . . . . . . . . . . . . . . . . . 555
2.11 Duplicate Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 558
2.11.1 Consequences of Duplicate Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 559
2.11.2 Duplicate Label Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
2.11.3 Disallowing Duplicate Labels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 561
2.12 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
2.12.1 Object creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 566
2.12.2 CategoricalDtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 572
2.12.3 Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
2.12.4 Working with categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 573
2.12.5 Sorting and order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 577
2.12.6 Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 580
iii
2.12.7 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 583
2.12.8 Data munging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 584
2.12.9 Getting data in/out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 592
2.12.10 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 593
2.12.11 Differences to R’s factor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
2.12.12 Gotchas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 594
2.13 Nullable integer data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
2.13.1 Construction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 598
2.13.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 600
2.13.3 Scalar NA Value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
2.14 Nullable Boolean data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
2.14.1 Indexing with NA values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
2.14.2 Kleene logical operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 602
2.15 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
2.15.1 Basic plotting: plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 604
2.15.2 Other plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 607
2.15.3 Plotting with missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 638
2.15.4 Plotting tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 639
2.15.5 Plot formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 647
2.15.6 Plotting directly with matplotlib . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 672
2.15.7 Plotting backends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 673
2.16 Computational tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
2.16.1 Statistical functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 674
2.17 Group by: split-apply-combine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 679
2.17.1 Splitting an object into groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 680
2.17.2 Iterating through groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 688
2.17.3 Selecting a group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689
2.17.4 Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 689
2.17.5 Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 696
2.17.6 Filtration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 703
2.17.7 Dispatching to instance methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 704
2.17.8 Flexible apply . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 706
2.17.9 Numba Accelerated Routines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 707
2.17.10 Other useful features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 708
2.17.11 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 719
2.18 Windowing Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 721
2.18.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 722
2.18.2 Rolling window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 724
2.18.3 Weighted window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 731
2.18.4 Expanding window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
2.18.5 Exponentially Weighted window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 732
2.19 Time series / date functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 735
2.19.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 736
2.19.2 Timestamps vs. time spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 738
2.19.3 Converting to timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 739
2.19.4 Generating ranges of timestamps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 743
2.19.5 Timestamp limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 746
2.19.6 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 747
2.19.7 Time/date components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 755
2.19.8 DateOffset objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 757
2.19.9 Time series-related instance methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 772
2.19.10 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 774
2.19.11 Time span representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 783
2.19.12 Converting between representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 790
iv
2.19.13 Representing out-of-bounds spans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 791
2.19.14 Time zone handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 792
2.20 Time deltas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 800
2.20.1 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 801
2.20.2 Operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 803
2.20.3 Reductions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807
2.20.4 Frequency conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 807
2.20.5 Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 809
2.20.6 TimedeltaIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 811
2.20.7 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
2.21 Styling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
2.21.1 Building styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 815
2.21.2 Finer control: slicing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 818
2.21.3 Finer Control: Display Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
2.21.4 Builtin styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 819
2.21.5 Sharing styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 821
2.21.6 Other Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 822
2.21.7 Fun stuff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 825
2.21.8 Export to Excel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 826
2.21.9 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 827
2.22 Options and settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828
2.22.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 828
2.22.2 Getting and setting options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 829
2.22.3 Setting startup options in Python/IPython environment . . . . . . . . . . . . . . . . . . . . 830
2.22.4 Frequently used options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 831
2.22.5 Available options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 838
2.22.6 Number formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 842
2.22.7 Unicode formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 843
2.22.8 Table schema display . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
2.23 Enhancing performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 844
2.23.1 Cython (writing C extensions for pandas) . . . . . . . . . . . . . . . . . . . . . . . . . . . 845
2.23.2 Using Numba . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 849
2.23.3 Expression evaluation via eval() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 852
2.24 Scaling to large datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 860
2.24.1 Load less data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 860
2.24.2 Use efficient datatypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 862
2.24.3 Use chunking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 863
2.24.4 Use other libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 865
2.25 Sparse data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 869
2.25.1 SparseArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871
2.25.2 SparseDtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 871
2.25.3 Sparse accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
2.25.4 Sparse calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 872
2.25.5 Migrating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 873
2.25.6 Interaction with scipy.sparse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 875
2.26 Frequently Asked Questions (FAQ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
2.26.1 DataFrame memory usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 878
2.26.2 Using if/truth statements with pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880
2.26.3 NaN, Integer NA values and NA type promotions . . . . . . . . . . . . . . . . . . . . . . . . 882
2.26.4 Differences with NumPy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
2.26.5 Thread-safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
2.26.6 Byte-ordering issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 884
2.27 Cookbook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
2.27.1 Idioms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 885
v
2.27.2 Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 889
2.27.3 Multiindexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 893
2.27.4 Missing data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 897
2.27.5 Grouping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 898
2.27.6 Timeseries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 910
2.27.7 Merge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 910
2.27.8 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 912
2.27.9 Data in/out . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 913
2.27.10 Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 918
2.27.11 Timedeltas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919
2.27.12 Creating example data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 921
vi
3.4 DataFrame . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1433
3.4.1 Constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1433
3.4.2 Attributes and underlying data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1774
3.4.3 Conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1774
3.4.4 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1774
3.4.5 Binary operator functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1775
3.4.6 Function application, GroupBy & window . . . . . . . . . . . . . . . . . . . . . . . . . . . 1776
3.4.7 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1777
3.4.8 Reindexing / selection / label manipulation . . . . . . . . . . . . . . . . . . . . . . . . . . 1778
3.4.9 Missing data handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1779
3.4.10 Reshaping, sorting, transposing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1779
3.4.11 Combining / comparing / joining / merging . . . . . . . . . . . . . . . . . . . . . . . . . . 1780
3.4.12 Time Series-related . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1780
3.4.13 Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1780
3.4.14 Metadata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1781
3.4.15 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1781
3.4.16 Sparse accessor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1832
3.4.17 Serialization / IO / conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1836
3.5 pandas arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1837
3.5.1 pandas.array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1837
3.5.2 Datetime data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1840
3.5.3 Timedelta data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1861
3.5.4 Timespan data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1870
3.5.5 Period . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1870
3.5.6 Interval data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1887
3.5.7 Nullable integer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1901
3.5.8 Categorical data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1906
3.5.9 Sparse data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1912
3.5.10 Text data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1914
3.5.11 Boolean data with missing values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1916
3.6 Index objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1918
3.6.1 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1918
3.6.2 Numeric Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1983
3.6.3 CategoricalIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1987
3.6.4 IntervalIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1997
3.6.5 MultiIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2008
3.6.6 DatetimeIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2028
3.6.7 TimedeltaIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2059
3.6.8 PeriodIndex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2068
3.7 Date offsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075
3.7.1 DateOffset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2075
3.7.2 BusinessDay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2080
3.7.3 BusinessHour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2085
3.7.4 CustomBusinessDay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2090
3.7.5 CustomBusinessHour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2095
3.7.6 MonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2100
3.7.7 MonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2103
3.7.8 BusinessMonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2107
3.7.9 BusinessMonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2111
3.7.10 CustomBusinessMonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2115
3.7.11 CustomBusinessMonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2120
3.7.12 SemiMonthEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2125
3.7.13 SemiMonthBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2129
3.7.14 Week . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2133
vii
3.7.15 WeekOfMonth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2137
3.7.16 LastWeekOfMonth . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2141
3.7.17 BQuarterEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2146
3.7.18 BQuarterBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2149
3.7.19 QuarterEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2153
3.7.20 QuarterBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2157
3.7.21 BYearEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2161
3.7.22 BYearBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2165
3.7.23 YearEnd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2169
3.7.24 YearBegin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2173
3.7.25 FY5253 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2177
3.7.26 FY5253Quarter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2182
3.7.27 Easter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2188
3.7.28 Tick . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2191
3.7.29 Day . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2195
3.7.30 Hour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2199
3.7.31 Minute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2203
3.7.32 Second . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2207
3.7.33 Milli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2211
3.7.34 Micro . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2215
3.7.35 Nano . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2219
3.8 Frequencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2223
3.8.1 pandas.tseries.frequencies.to_offset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2223
3.9 Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2224
3.9.1 Rolling window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2224
3.9.2 Weighted window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2239
3.9.3 Expanding window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2243
3.9.4 Exponentially-weighted window functions . . . . . . . . . . . . . . . . . . . . . . . . . . . 2257
3.9.5 Window indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2259
3.10 GroupBy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2262
3.10.1 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2262
3.10.2 Function application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2267
3.10.3 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2277
3.11 Resampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2328
3.11.1 Indexing, iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2328
3.11.2 Function application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2329
3.11.3 Upsampling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2334
3.11.4 Computations / descriptive stats . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2345
3.12 Style . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2351
3.12.1 Styler constructor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2351
3.12.2 Styler properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2367
3.12.3 Style application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2368
3.12.4 Builtin styles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2368
3.12.5 Style export and import . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2369
3.13 Plotting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2369
3.13.1 pandas.plotting.andrews_curves . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2369
3.13.2 pandas.plotting.autocorrelation_plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2371
3.13.3 pandas.plotting.bootstrap_plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2372
3.13.4 pandas.plotting.boxplot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2372
3.13.5 pandas.plotting.deregister_matplotlib_converters . . . . . . . . . . . . . . . . . . . . . . . 2380
3.13.6 pandas.plotting.lag_plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2380
3.13.7 pandas.plotting.parallel_coordinates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2380
3.13.8 pandas.plotting.plot_params . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2383
3.13.9 pandas.plotting.radviz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2383
viii
3.13.10 pandas.plotting.register_matplotlib_converters . . . . . . . . . . . . . . . . . . . . . . . . . 2385
3.13.11 pandas.plotting.scatter_matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2387
3.13.12 pandas.plotting.table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2387
3.14 General utility functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2389
3.14.1 Working with options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2389
3.14.2 Testing functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2403
3.14.3 Exceptions and warnings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2408
3.14.4 Data types related functionality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2412
3.14.5 Bug report function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2438
3.15 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2439
3.15.1 pandas.api.extensions.register_extension_dtype . . . . . . . . . . . . . . . . . . . . . . . . 2439
3.15.2 pandas.api.extensions.register_dataframe_accessor . . . . . . . . . . . . . . . . . . . . . . 2439
3.15.3 pandas.api.extensions.register_series_accessor . . . . . . . . . . . . . . . . . . . . . . . . . 2441
3.15.4 pandas.api.extensions.register_index_accessor . . . . . . . . . . . . . . . . . . . . . . . . . 2442
3.15.5 pandas.api.extensions.ExtensionDtype . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2443
3.15.6 pandas.api.extensions.ExtensionArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2447
3.15.7 pandas.arrays.PandasArray . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2459
3.15.8 pandas.api.indexers.check_array_indexer . . . . . . . . . . . . . . . . . . . . . . . . . . . 2460
4 Development 2463
4.1 Contributing to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2463
4.1.1 Where to start? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2464
4.1.2 Bug reports and enhancement requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2465
4.1.3 Working with the code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2465
4.1.4 Contributing to the documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2471
4.1.5 Contributing to the code base . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2489
4.1.6 Contributing your changes to pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2502
4.1.7 Tips for a successful pull request . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2505
4.2 pandas code style guide . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2505
4.2.1 Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2506
4.2.2 String formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2506
4.2.3 Imports (aim for absolute) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2508
4.2.4 Miscellaneous . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2508
4.3 pandas maintenance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2508
4.3.1 Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2508
4.3.2 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2509
4.3.3 Issue triage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2509
4.3.4 Closing issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2510
4.3.5 Reviewing pull requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2510
4.3.6 Cleaning up old issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2511
4.3.7 Cleaning up old pull requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2511
4.3.8 Becoming a pandas maintainer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2511
4.3.9 Merging pull requests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2511
4.4 Internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2512
4.4.1 Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2512
4.4.2 Subclassing pandas data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2513
4.5 Test organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2514
4.6 Debugging C extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2516
4.6.1 Using a debugger . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2516
4.6.2 Checking memory leaks with valgrind . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2517
4.7 Extending pandas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2518
4.7.1 Registering custom accessors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2518
4.7.2 Extension types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2519
4.7.3 Subclassing pandas data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2522
ix
4.7.4 Plotting backends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2525
4.8 Developer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2525
4.8.1 Storing pandas DataFrame objects in Apache Parquet format . . . . . . . . . . . . . . . . . 2525
4.9 Policies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2528
4.9.1 Version policy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2528
4.9.2 Python support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2529
4.10 Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2529
4.10.1 Extensibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2529
4.10.2 String data type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2529
4.10.3 Consistent missing value handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2530
4.10.4 Apache Arrow interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2530
4.10.5 Block manager rewrite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2530
4.10.6 Decoupling of indexing and internals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2531
4.10.7 Numba-accelerated operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2531
4.10.8 Performance monitoring . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2531
4.10.9 Roadmap evolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2531
4.10.10 Completed items . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2532
4.11 Developer meetings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2532
4.11.1 Minutes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2532
4.11.2 Calendar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2532
x
5.7.1 Version 0.22.0 (December 29, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2824
5.8 Version 0.21 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2829
5.8.1 Version 0.21.1 (December 12, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2829
5.8.2 Version 0.21.0 (October 27, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2834
5.9 Version 0.20 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2866
5.9.1 Version 0.20.3 (July 7, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2866
5.9.2 Version 0.20.2 (June 4, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2868
5.9.3 Version 0.20.1 (May 5, 2017) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2872
5.10 Version 0.19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2921
5.10.1 Version 0.19.2 (December 24, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2921
5.10.2 Version 0.19.1 (November 3, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2924
5.10.3 Version 0.19.0 (October 2, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2927
5.11 Version 0.18 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2972
5.11.1 Version 0.18.1 (May 3, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2972
5.11.2 Version 0.18.0 (March 13, 2016) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2992
5.12 Version 0.17 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3026
5.12.1 Version 0.17.1 (November 21, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3026
5.12.2 Version 0.17.0 (October 9, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3033
5.13 Version 0.16 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3064
5.13.1 Version 0.16.2 (June 12, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3064
5.13.2 Version 0.16.1 (May 11, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3069
5.13.3 Version 0.16.0 (March 22, 2015) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3082
5.14 Version 0.15 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3099
5.14.1 Version 0.15.2 (December 12, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3099
5.14.2 Version 0.15.1 (November 9, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3107
5.14.3 Version 0.15.0 (October 18, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3114
5.15 Version 0.14 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3145
5.15.1 Version 0.14.1 (July 11, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3145
5.15.2 Version 0.14.0 (May 31 , 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3152
5.16 Version 0.13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3182
5.16.1 Version 0.13.1 (February 3, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3182
5.16.2 Version 0.13.0 (January 3, 2014) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3193
5.17 Version 0.12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3223
5.17.1 Version 0.12.0 (July 24, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3223
5.18 Version 0.11 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3235
5.18.1 Version 0.11.0 (April 22, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3235
5.19 Version 0.10 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3246
5.19.1 Version 0.10.1 (January 22, 2013) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3246
5.19.2 Version 0.10.0 (December 17, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3252
5.20 Version 0.9 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3264
5.20.1 Version 0.9.1 (November 14, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3264
5.20.2 Version 0.9.0 (October 7, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3269
5.21 Version 0.8 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3271
5.21.1 Version 0.8.1 (July 22, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3271
5.21.2 Version 0.8.0 (June 29, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3272
5.22 Version 0.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3278
5.22.1 Version 0.7.3 (April 12, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3278
5.22.2 Version 0.7.2 (March 16, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3281
5.22.3 Version 0.7.1 (February 29, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3282
5.22.4 Version 0.7.0 (February 9, 2012) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3283
5.23 Version 0.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3289
5.23.1 Version 0.6.1 (December 13, 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3289
5.23.2 Version 0.6.0 (November 25, 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3290
5.24 Version 0.5 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3292
xi
5.24.1 Version 0.5.0 (October 24, 2011) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3292
5.25 Version 0.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3293
5.25.1 Versions 0.4.1 through 0.4.3 (September 25 - October 9, 2011) . . . . . . . . . . . . . . . . 3293
Bibliography 3295
xii
pandas: powerful Python data analysis toolkit, Release 1.2.0
CONTENTS 1
pandas: powerful Python data analysis toolkit, Release 1.2.0
2 CONTENTS
CHAPTER
ONE
GETTING STARTED
1.1 Installation
pandas is part of the Anaconda distribution and can be installed with Anaconda or Miniconda:
Learn more
Straight to tutorial. . .
When working with tabular data, such as data stored in spreadsheets or databases, pandas is the right tool for you.
pandas will help you to explore, clean and process your data. In pandas, a data table is called a DataFrame.
To introduction tutorial
To user guide
Straight to tutorial. . .
pandas supports the integration with many file formats or data sources out of the box (csv, excel, sql, json, parquet,. . . ).
Importing data from each of these data sources is provided by function with the prefix read_*. Similarly, the to_*
methods are used to store data.
To introduction tutorial
To user guide
Straight to tutorial. . .
Selecting or filtering specific rows and/or columns? Filtering the data on a condition? Methods for slicing, selecting,
and extracting the data you need are available in pandas.
To introduction tutorial
3
pandas: powerful Python data analysis toolkit, Release 1.2.0
To user guide
Straight to tutorial. . .
pandas provides plotting your data out of the box, using the power of Matplotlib. You can pick the plot type (scatter,
bar, boxplot,. . . ) corresponding to your data.
To introduction tutorial
To user guide
Straight to tutorial. . .
There is no need to loop over all rows of your data table to do calculations. Data manipulations on a column work
elementwise. Adding a column to a DataFrame based on existing data in other columns is straightforward.
To introduction tutorial
To user guide
Straight to tutorial. . .
Basic statistics (mean, median, min, max, counts. . . ) are easily calculable. These or custom aggregations can be
applied on the entire data set, a sliding window of the data or grouped by categories. The latter is also known as the
split-apply-combine approach.
To introduction tutorial
To user guide
Straight to tutorial. . .
Change the structure of your data table in multiple ways. You can melt() your data table from wide to long/tidy form
or pivot() from long to wide format. With aggregations built-in, a pivot table is created with a single command.
To introduction tutorial
To user guide
Straight to tutorial. . .
Multiple tables can be concatenated both column wise as row wise and database-like join/merge operations are pro-
vided to combine multiple tables of data.
To introduction tutorial
To user guide
Straight to tutorial. . .
pandas has great support for time series and has an extensive set of tools for working with dates, times, and time-
indexed data.
To introduction tutorial
To user guide
Straight to tutorial. . .
Data sets do not only contain numerical data. pandas provides a wide range of functions to clean textual data and
extract useful information from it.
To introduction tutorial
To user guide
Are you familiar with other software for manipulating tablular data? Learn the pandas-equivalent operations compared
to software you already know:
Learn more
Learn more
Learn more
Learn more
1.4 Tutorials
1.4.1 Installation
The easiest way to install pandas is to install it as part of the Anaconda distribution, a cross platform distribution for
data analysis and scientific computing. This is the recommended installation method for most users.
Instructions for installing from source, PyPI, ActivePython, various Linux distributions, or a development version are
also provided.
Installing pandas
Installing pandas and the rest of the NumPy and SciPy stack can be a little difficult for inexperienced users.
The simplest way to install not only pandas, but Python and the most popular packages that make up the SciPy stack
(IPython, NumPy, Matplotlib, . . . ) is with Anaconda, a cross-platform (Linux, macOS, Windows) Python distribution
for data analytics and scientific computing.
After running the installer, the user will have access to pandas and the rest of the SciPy stack without needing to install
anything else, and without needing to wait for any software to be compiled.
The previous section outlined how to get pandas installed as part of the Anaconda distribution. However this approach
means you will install well over one hundred packages and involves downloading the installer which is a few hundred
megabytes in size.
If you want to have more control on which packages, or have a limited internet bandwidth, then installing pandas with
Miniconda may be a better solution.
Conda is the package manager that the Anaconda distribution is built upon. It is a package manager that is both
cross-platform and language agnostic (it can play a similar role to a pip and virtualenv combination).
Miniconda allows you to create a minimal self contained Python installation, and then use the Conda command to
install additional packages.
First you will need Conda to be installed and downloading and running the Miniconda will do this for you. The
installer can be found here
The next step is to create a new conda environment. A conda environment is like a virtualenv that allows you to specify
a specific version of Python and set of libraries. Run the following commands from a terminal window:
This will create a minimal environment with only Python installed in it. To put your self inside this environment run:
activate name_of_my_env
The final step required is to install pandas. This can be done with the following command:
If you need packages that are available to pip but not conda, then install pip, and then use pip to install those packages:
Installation instructions for ActivePython can be found here. Versions 2.7, 3.5 and 3.6 include pandas.
The commands in this table will install pandas for Python 3 from your distribution.
However, the packages in the linux package managers are often a few versions behind, so to get the newest version of
pandas, it’s recommended to install using the pip or conda methods described above.
Handling ImportErrors
If you encounter an ImportError, it usually means that Python couldn’t find pandas in the list of available libraries.
Python internally has a list of directories it searches through, to find packages. You can obtain these directories with:
import sys
sys.path
One way you could be encountering this error is if you have multiple Python installations on your system and you don’t
have pandas installed in the Python installation you’re currently using. In Linux/Mac you can run which python
on your terminal and it will tell you which Python installation you’re using. If it’s something like “/usr/bin/python”,
you’re using the Python from the system, which is not recommended.
It is highly recommended to use conda, for quick installation and for package and dependency updates. You
can find simple installation instructions for pandas in this document: installation instructions </
getting_started.html>.
1.4. Tutorials 7
pandas: powerful Python data analysis toolkit, Release 1.2.0
See the contributing guide for complete instructions on building from the git source tree. Further, see creating a
development environment if you wish to create a pandas development environment.
pandas is equipped with an exhaustive set of unit tests, covering about 97% of the code base as of this writing. To
run it on your machine to verify that everything is working (and that you have all of the dependencies, soft and hard,
installed), make sure you have pytest >= 5.0.1 and Hypothesis >= 3.58, then run:
>>> pd.test()
running: pytest --skip-slow --skip-network C:\Users\TP\Anaconda3\envs\py36\lib\site-
˓→packages\pandas
..................................................................S......
........S................................................................
.........................................................................
Dependencies
Recommended dependencies
• numexpr: for accelerating certain numerical operations. numexpr uses multiple cores as well as smart chunk-
ing and caching to achieve large speedups. If installed, must be Version 2.6.8 or higher.
• bottleneck: for accelerating certain types of nan evaluations. bottleneck uses specialized cython routines
to achieve large speedups. If installed, must be Version 1.2.1 or higher.
Note: You are highly encouraged to install these libraries, as they provide speed improvements, especially when
working with large data sets.
Optional dependencies
pandas has many optional dependencies that are only used for specific methods. For example, pandas.
read_hdf() requires the pytables package, while DataFrame.to_markdown() requires the tabulate
package. If the optional dependency is not installed, pandas will raise an ImportError when the method requiring
that dependency is called.
One of the following combinations of libraries is needed to use the top-level read_html() function:
• BeautifulSoup4 and html5lib
• BeautifulSoup4 and lxml
• BeautifulSoup4 and html5lib and lxml
• Only lxml, although see HTML Table Parsing for reasons as to why you should probably not take this approach.
1.4. Tutorials 9
pandas: powerful Python data analysis toolkit, Release 1.2.0
Warning:
• if you install BeautifulSoup4 you must install either lxml or html5lib or both. read_html() will not work
with only BeautifulSoup4 installed.
• You are highly encouraged to read HTML Table Parsing gotchas. It explains issues surrounding the installa-
tion and usage of the above three libraries.
pandas is a Python package providing fast, flexible, and expressive data structures designed to make working with
“relational” or “labeled” data both easy and intuitive. It aims to be the fundamental high-level building block for doing
practical, real-world data analysis in Python. Additionally, it has the broader goal of becoming the most powerful
and flexible open source data analysis/manipulation tool available in any language. It is already well on its way
toward this goal.
pandas is well suited for many different kinds of data:
• Tabular data with heterogeneously-typed columns, as in an SQL table or Excel spreadsheet
• Ordered and unordered (not necessarily fixed-frequency) time series data.
• Arbitrary matrix data (homogeneously typed or heterogeneous) with row and column labels
• Any other form of observational / statistical data sets. The data need not be labeled at all to be placed into a
pandas data structure
The two primary data structures of pandas, Series (1-dimensional) and DataFrame (2-dimensional), handle the
vast majority of typical use cases in finance, statistics, social science, and many areas of engineering. For R users,
DataFrame provides everything that R’s data.frame provides and much more. pandas is built on top of NumPy
and is intended to integrate well within a scientific computing environment with many other 3rd party libraries.
Here are just a few of the things that pandas does well:
• Easy handling of missing data (represented as NaN) in floating point as well as non-floating point data
• Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects
• Automatic and explicit data alignment: objects can be explicitly aligned to a set of labels, or the user can simply
ignore the labels and let Series, DataFrame, etc. automatically align the data for you in computations
• Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both ag-
gregating and transforming data
• Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into
DataFrame objects
• Intelligent label-based slicing, fancy indexing, and subsetting of large data sets
• Intuitive merging and joining data sets
• Flexible reshaping and pivoting of data sets
• Hierarchical labeling of axes (possible to have multiple labels per tick)
• Robust IO tools for loading data from flat files (CSV and delimited), Excel files, databases, and saving / loading
data from the ultrafast HDF5 format
• Time series-specific functionality: date range generation and frequency conversion, moving window statistics,
date shifting, and lagging.
Many of these principles are here to address the shortcomings frequently experienced using other languages / scientific
research environments. For data scientists, working with data is typically divided into multiple stages: munging and
cleaning data, analyzing / modeling it, then organizing the results of the analysis into a form suitable for plotting or
tabular display. pandas is the ideal tool for all of these tasks.
Some other notes
• pandas is fast. Many of the low-level algorithmic bits have been extensively tweaked in Cython code. However,
as with anything else generalization usually sacrifices performance. So if you focus on one feature for your
application you may be able to create a faster specialized tool.
• pandas is a dependency of statsmodels, making it an important part of the statistical computing ecosystem in
Python.
• pandas has been used extensively in production in financial applications.
Data structures
The best way to think about the pandas data structures is as flexible containers for lower dimensional data. For
example, DataFrame is a container for Series, and Series is a container for scalars. We would like to be able to insert
and remove objects from these containers in a dictionary-like fashion.
Also, we would like sensible default behaviors for the common API functions which take into account the typical
orientation of time series and cross-sectional data sets. When using the N-dimensional array (ndarrays) to store 2- and
3-dimensional data, a burden is placed on the user to consider the orientation of the data set when writing functions;
axes are considered more or less equivalent (except when C- or Fortran-contiguousness matters for performance). In
pandas, the axes are intended to lend more semantic meaning to the data; i.e., for a particular data set, there is likely
to be a “right” way to orient the data. The goal, then, is to reduce the amount of mental effort required to code up data
transformations in downstream functions.
For example, with tabular data (DataFrame) it is more semantically helpful to think of the index (the rows) and the
columns rather than axis 0 and axis 1. Iterating through the columns of the DataFrame thus results in more readable
code:
1.4. Tutorials 11
pandas: powerful Python data analysis toolkit, Release 1.2.0
All pandas data structures are value-mutable (the values they contain can be altered) but not always size-mutable. The
length of a Series cannot be changed, but, for example, columns can be inserted into a DataFrame. However, the vast
majority of methods produce new objects and leave the input data untouched. In general we like to favor immutability
where sensible.
Getting support
The first stop for pandas issues and ideas is the Github Issue Tracker. If you have a general question, pandas community
experts can answer through Stack Overflow.
Community
pandas is actively supported today by a community of like-minded individuals around the world who contribute their
valuable time and energy to help make open source pandas possible. Thanks to all of our contributors.
If you’re interested in contributing, please visit the contributing guide.
pandas is a NumFOCUS sponsored project. This will help ensure the success of the development of pandas as a
world-class open-source project and makes it possible to donate to the project.
Project governance
The governance process that pandas project has used informally since its inception in 2008 is formalized in Project
Governance documents. The documents clarify how decisions are made and how the various elements of our commu-
nity interact, including the relationship between open source collaborative development and work that may be funded
by for-profit or non-profit entities.
Wes McKinney is the Benevolent Dictator for Life (BDFL).
Development team
The list of the Core Team members and more detailed information can be found on the people’s page of the governance
repo.
Institutional partners
The information about current institutional partners can be found on pandas website page.
License
Copyright (c) 2008-2011, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData
˓→Development Team
* Redistributions of source code must retain the above copyright notice, this
list of conditions and the following disclaimer.
* Neither the name of the copyright holder nor the names of its
contributors may be used to endorse or promote products derived from
this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS"
AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE
FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR
SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER
CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
To load the pandas package and start working with it, import the package. The community agreed alias for pandas is
pd, so loading pandas as pd is assumed standard practice for all of the pandas documentation.
I want to store passenger data of the Titanic. For a number of passengers, I know the name (characters), age (integers)
and sex (male/female) data.
In [2]: df = pd.DataFrame(
...: {
...: "Name": [
...: "Braund, Mr. Owen Harris",
...: "Allen, Mr. William Henry",
...: "Bonnell, Miss. Elizabeth",
...: ],
...: "Age": [22, 35, 58],
...: "Sex": ["male", "male", "female"],
...: }
...: )
(continues on next page)
1.4. Tutorials 13
pandas: powerful Python data analysis toolkit, Release 1.2.0
In [3]: df
Out[3]:
Name Age Sex
0 Braund, Mr. Owen Harris 22 male
1 Allen, Mr. William Henry 35 male
2 Bonnell, Miss. Elizabeth 58 female
To manually store data in a table, create a DataFrame. When using a Python dictionary of lists, the dictionary keys
will be used as column headers and the values in each list as columns of the DataFrame.
A DataFrame is a 2-dimensional data structure that can store data of different types (including characters, integers,
floating point values, categorical data and more) in columns. It is similar to a spreadsheet, a SQL table or the data.
frame in R.
• The table has 3 columns, each of them with a column label. The column labels are respectively Name, Age and
Sex.
• The column Name consists of textual data with each value a string, the column Age are numbers and the column
Sex is textual data.
In spreadsheet software, the table representation of our data would look very similar:
I’m just interested in working with the data in the column Age
In [4]: df["Age"]
Out[4]:
0 22
1 35
2 58
Name: Age, dtype: int64
When selecting a single column of a pandas DataFrame, the result is a pandas Series. To select the column, use
the column label in between square brackets [].
Note: If you are familiar to Python dictionaries, the selection of a single column is very similar to selection of
dictionary values based on the key.
In [6]: ages
Out[6]:
0 22
1 35
2 58
Name: Age, dtype: int64
A pandas Series has no column labels, as it is just a single column of a DataFrame. A Series does have row
labels.
In [7]: df["Age"].max()
Out[7]: 58
Or to the Series:
In [8]: ages.max()
Out[8]: 58
As illustrated by the max() method, you can do things with a DataFrame or Series. pandas provides a lot of
functionalities, each of them a method you can apply to a DataFrame or Series. As methods are functions, do not
forget to use parentheses ().
I’m interested in some basic statistics of the numerical data of my data table
In [9]: df.describe()
Out[9]:
(continues on next page)
1.4. Tutorials 15
pandas: powerful Python data analysis toolkit, Release 1.2.0
The describe() method provides a quick overview of the numerical data in a DataFrame. As the Name and Sex
columns are textual data, these are by default not taken into account by the describe() method.
Many pandas operations return a DataFrame or a Series. The describe() method is an example of a pandas
operation returning a pandas Series.
Check more options on describe in the user guide section about aggregations with describe
Note: This is just a starting point. Similar to spreadsheet software, pandas represents data as a table with columns
and rows. Apart from the representation, also the data manipulations and calculations you would do in spreadsheet
software are supported by pandas. Continue reading the next tutorials to get started!
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.
pandas provides the read_csv() function to read data stored as a csv file into a pandas DataFrame. pandas
supports many different file formats or data sources out of the box (csv, excel, sql, json, parquet, . . . ), each of them
with the prefix read_*.
Make sure to always have a check on the data after reading in the data. When displaying a DataFrame, the first and
last 5 rows will be shown by default:
In [3]: titanic
Out[3]:
PassengerId Survived Pclass Name
˓→ ... Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris
˓→ ... A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th...
˓→ ... PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina
˓→ ... STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel)
˓→ ... 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry
˓→ ... 373450 8.0500 NaN S
.. ... ... ... ...
˓→ ... ... ... ... ...
886 887 0 2 Montvila, Rev. Juozas
˓→ ... 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith
˓→ ... 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie"
˓→ ... W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell
˓→ ... 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick
˓→ ... 370376 7.7500 NaN Q
In [4]: titanic.head(8)
Out[4]:
PassengerId Survived Pclass Name .
˓→.. Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris .
˓→.. A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... .
˓→.. PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina .
˓→.. STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) .
˓→.. 113803 53.1000 C123 S
(continues on next page)
1.4. Tutorials 17
pandas: powerful Python data analysis toolkit, Release 1.2.0
[8 rows x 12 columns]
To see the first N rows of a DataFrame, use the head() method with the required number of rows (in this case 8)
as argument.
Note: Interested in the last N rows instead? pandas also provides a tail() method. For example, titanic.
tail(10) will return the last 10 rows of the DataFrame.
A check on how pandas interpreted each of the column data types can be done by requesting the pandas dtypes
attribute:
In [5]: titanic.dtypes
Out[5]:
PassengerId int64
Survived int64
Pclass int64
Name object
Sex object
Age float64
SibSp int64
Parch int64
Ticket object
Fare float64
Cabin object
Embarked object
dtype: object
For each of the columns, the used data type is enlisted. The data types in this DataFrame are integers (int64),
floats (float64) and strings (object).
Note: When asking for the dtypes, no brackets are used! dtypes is an attribute of a DataFrame and
Series. Attributes of DataFrame or Series do not need brackets. Attributes represent a characteristic of a
DataFrame/Series, whereas a method (which requires brackets) do something with the DataFrame/Series as
introduced in the first tutorial.
Whereas read_* functions are used to read data to pandas, the to_* methods are used to store data. The
to_excel() method stores the data as an excel file. In the example here, the sheet_name is named passen-
gers instead of the default Sheet1. By setting index=False the row index labels are not saved in the spreadsheet.
The equivalent read function read_excel() will reload the data to a DataFrame:
In [8]: titanic.head()
Out[8]:
PassengerId Survived Pclass Name .
˓→.. Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris .
˓→.. A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... .
˓→.. PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina .
˓→.. STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) .
˓→.. 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry .
˓→.. 373450 8.0500 NaN S
[5 rows x 12 columns]
In [9]: titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
The method info() provides technical information about a DataFrame, so let’s explain the output in more detail:
• It is indeed a DataFrame.
• There are 891 entries, i.e. 891 rows.
• Each row has a row label (aka the index) with values ranging from 0 to 890.
• The table has 12 columns. Most columns have a value for each of the rows (all 891 values are non-null).
Some columns do have missing values and less than 891 non-null values.
• The columns Name, Sex, Cabin and Embarked consists of textual data (strings, aka object). The other
columns are numerical data with some of them whole numbers (aka integer) and others are real numbers
(aka float).
• The kind of data (characters, integers,. . . ) in the different columns are summarized by listing the dtypes.
• The approximate amount of RAM used to hold the DataFrame is provided as well.
1.4. Tutorials 19
pandas: powerful Python data analysis toolkit, Release 1.2.0
• Getting data in to pandas from many different file formats or data sources is supported by read_* functions.
• Exporting data out of pandas is provided by different to_*methods.
• The head/tail/info methods and the dtypes attribute are convenient for a first check.
For a complete overview of the input and output possibilities from and to pandas, see the user guide section about
reader and writer functions.
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature has value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passengers have siblings and spouses.
• Parch: Whether a passenger is alone or has a family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name .
˓→.. Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris .
˓→.. A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... .
˓→.. PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina .
˓→.. STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) .
˓→.. 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry .
˓→.. 373450 8.0500 NaN S
[5 rows x 12 columns]
In [5]: ages.head()
Out[5]:
0 22.0
1 38.0
2 26.0
3 35.0
4 35.0
Name: Age, dtype: float64
To select a single column, use square brackets [] with the column name of the column of interest.
Each column in a DataFrame is a Series. As a single column is selected, the returned object is a pandas Series.
We can verify this by checking the type of the output:
In [6]: type(titanic["Age"])
Out[6]: pandas.core.series.Series
In [7]: titanic["Age"].shape
Out[7]: (891,)
DataFrame.shape is an attribute (remember tutorial on reading and writing, do not use parentheses for attributes)
of a pandas Series and DataFrame containing the number of rows and columns: (nrows, ncolumns). A pandas
Series is 1-dimensional and only the number of rows is returned.
I’m interested in the age and sex of the Titanic passengers.
In [9]: age_sex.head()
Out[9]:
Age Sex
0 22.0 male
1 38.0 female
2 26.0 female
3 35.0 female
4 35.0 male
To select multiple columns, use a list of column names within the selection brackets [].
Note: The inner square brackets define a Python list with column names, whereas the outer brackets are used to select
the data from a pandas DataFrame as seen in the previous example.
1.4. Tutorials 21
pandas: powerful Python data analysis toolkit, Release 1.2.0
The selection returned a DataFrame with 891 rows and 2 columns. Remember, a DataFrame is 2-dimensional
with both a row and column dimension.
For basic information on indexing, see the user guide section on indexing and selecting data.
In [13]: above_35.head()
Out[13]:
PassengerId Survived Pclass Name
˓→ Sex ... Parch Ticket Fare Cabin Embarked
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th...
˓→female ... 0 PC 17599 71.2833 C85 C
6 7 0 1 McCarthy, Mr. Timothy J
˓→ male ... 0 17463 51.8625 E46 S
11 12 1 1 Bonnell, Miss. Elizabeth
˓→female ... 0 113783 26.5500 C103 S
13 14 0 3 Andersson, Mr. Anders Johan
˓→ male ... 5 347082 31.2750 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome)
˓→female ... 0 248706 16.0000 NaN S
[5 rows x 12 columns]
To select rows based on a conditional expression, use a condition inside the selection brackets [].
The condition inside the selection brackets titanic["Age"] > 35 checks for which rows the Age column has a
value larger than 35:
The output of the conditional expression (>, but also ==, !=, <, <=,. . . would work) is actually a pandas Series of
boolean values (either True or False) with the same number of rows as the original DataFrame. Such a Series
of boolean values can be used to filter the DataFrame by putting it in between the selection brackets []. Only rows
for which the value is True will be selected.
We know from before that the original Titanic DataFrame consists of 891 rows. Let’s have a look at the number of
rows which satisfy the condition by checking the shape attribute of the resulting DataFrame above_35:
In [15]: above_35.shape
Out[15]: (217, 12)
In [17]: class_23.head()
Out[17]:
PassengerId Survived Pclass Name Sex Age SibSp
˓→ Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1
˓→ 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0
˓→ 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0
˓→ 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0
˓→ 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3
˓→ 1 349909 21.0750 NaN S
Similar to the conditional expression, the isin() conditional function returns a True for each row the values are in
the provided list. To filter the rows based on such a function, use the conditional function inside the selection brackets
[]. In this case, the condition inside the selection brackets titanic["Pclass"].isin([2, 3]) checks for
which rows the Pclass column is either 2 or 3.
The above is equivalent to filtering by rows for which the class is either 2 or 3 and combining the two statements with
an | (or) operator:
In [18]: class_23 = titanic[(titanic["Pclass"] == 2) | (titanic["Pclass"] == 3)]
In [19]: class_23.head()
Out[19]:
PassengerId Survived Pclass Name Sex Age SibSp
˓→ Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1
˓→ 0 A/5 21171 7.2500 NaN S
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0
˓→ 0 STON/O2. 3101282 7.9250 NaN S
4 5 0 3 Allen, Mr. William Henry male 35.0 0
˓→ 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0
˓→ 0 330877 8.4583 NaN Q
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3
˓→ 1 349909 21.0750 NaN S
Note: When combining multiple conditional statements, each condition must be surrounded by parentheses ().
Moreover, you can not use or/and but need to use the or operator | and the and operator &.
1.4. Tutorials 23
pandas: powerful Python data analysis toolkit, Release 1.2.0
See the dedicated section in the user guide about boolean indexing or about the isin function.
I want to work with passenger data for which the age is known.
In [21]: age_no_na.head()
Out[21]:
PassengerId Survived Pclass Name .
˓→.. Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris .
˓→.. A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... .
˓→.. PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina .
˓→.. STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) .
˓→.. 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry .
˓→.. 373450 8.0500 NaN S
[5 rows x 12 columns]
The notna() conditional function returns a True for each row the values are not an Null value. As such, this can
be combined with the selection brackets [] to filter the data table.
You might wonder what actually changed, as the first 5 lines are still the same values. One way to verify is to check if
the shape has changed:
In [22]: age_no_na.shape
Out[22]: (714, 12)
For more dedicated functions on missing values, see the user guide section about handling missing data.
In [24]: adult_names.head()
Out[24]:
1 Cumings, Mrs. John Bradley (Florence Briggs Th...
6 McCarthy, Mr. Timothy J
11 Bonnell, Miss. Elizabeth
13 Andersson, Mr. Anders Johan
15 Hewlett, Mrs. (Mary D Kingcome)
Name: Name, dtype: object
In this case, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient
anymore. The loc/iloc operators are required in front of the selection brackets []. When using loc/iloc, the
part before the comma is the rows you want, and the part after the comma is the columns you want to select.
When using the column names, row labels or a condition expression, use the loc operator in front of the selection
brackets []. For both the part before and after the comma, you can use a single label, a list of labels, a slice of labels,
a conditional expression or a colon. Using a colon specifies you want to select all rows or columns.
Again, a subset of both rows and columns is made in one go and just using selection brackets [] is not sufficient
anymore. When specifically interested in certain rows and/or columns based on their position in the table, use the
iloc operator in front of the selection brackets [].
When selecting specific rows and/or columns with loc or iloc, new values can be assigned to the selected data. For
example, to assign the name anonymous to the first 3 elements of the third column:
In [27]: titanic.head()
Out[27]:
PassengerId Survived Pclass Name ...
˓→ Ticket Fare Cabin Embarked
0 1 0 3 anonymous ...
˓→ A/5 21171 7.2500 NaN S
1 2 1 1 anonymous ...
˓→ PC 17599 71.2833 C85 C
2 3 1 3 anonymous ...
˓→STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) ...
˓→ 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry ...
˓→ 373450 8.0500 NaN S
[5 rows x 12 columns]
See the user guide section on different choices for indexing to get more insight in the usage of loc and iloc.
• When selecting subsets of data, square brackets [] are used.
• Inside these brackets, you can use a single column/row label, a list of column/row labels, a slice of labels, a
conditional expression or a colon.
• Select specific rows and/or columns using loc when using the row and column names
• Select specific rows and/or columns using iloc when using the positions in the table
• You can assign new values to a selection based on loc/iloc.
A full overview of indexing is provided in the user guide pages on indexing and selecting data.
1.4. Tutorials 25
pandas: powerful Python data analysis toolkit, Release 1.2.0
For this tutorial, air quality data about 𝑁 𝑂2 is used, made available by openaq and using the py-openaq package.
The air_quality_no2.csv data set provides 𝑁 𝑂2 values for the measurement stations FR04014, BETR801 and
London Westminster in respectively Paris, Antwerp and London.
In [4]: air_quality.head()
Out[4]:
station_antwerp station_paris station_london
datetime
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN
Note: The usage of the index_col and parse_dates parameters of the read_csv function to define the first
(0th) column as index of the resulting DataFrame and convert the dates in the column to Timestamp objects,
respectively.
In [5]: air_quality.plot()
Out[5]: <AxesSubplot:xlabel='datetime'>
With a DataFrame, pandas creates by default one line plot for each of the columns with numeric data.
I want to plot only the columns of the data table with the data from Paris.
In [6]: air_quality["station_paris"].plot()
Out[6]: <AxesSubplot:xlabel='datetime'>
1.4. Tutorials 27
pandas: powerful Python data analysis toolkit, Release 1.2.0
To plot a specific column, use the selection method of the subset data tutorial in combination with the plot()
method. Hence, the plot() method works on both Series and DataFrame.
I want to visually compare the 𝑁 02 values measured in London versus Paris.
Apart from the default line plot when using the plot function, a number of alternatives are available to plot data.
Let’s use some standard Python to get an overview of the available plot methods:
In [8]: [
...: method_name
...: for method_name in dir(air_quality.plot)
...: if not method_name.startswith("_")
...: ]
...:
Out[8]:
['area',
'bar',
'barh',
'box',
'density',
'hexbin',
'hist',
'kde',
'line',
'pie',
'scatter']
Note: In many development environments as well as IPython and Jupyter Notebook, use the TAB button to get an
overview of the available methods, for example air_quality.plot. + TAB.
1.4. Tutorials 29
pandas: powerful Python data analysis toolkit, Release 1.2.0
One of the options is DataFrame.plot.box(), which refers to a boxplot. The box method is applicable on the
air quality example data:
In [9]: air_quality.plot.box()
Out[9]: <AxesSubplot:>
For an introduction to plots other than the default line plot, see the user guide section about supported plot styles.
I want each of the columns in a separate subplot.
Separate subplots for each of the data columns are supported by the subplots argument of the plot functions. The
builtin options available in each of the pandas plot functions that are worthwhile to have a look.
Some more formatting options are explained in the user guide section on plot formatting.
I want to further customize, extend or save the resulting plot.
In [12]: air_quality.plot.area(ax=axs)
Out[12]: <AxesSubplot:xlabel='datetime'>
In [14]: fig.savefig("no2_concentrations.png")
Each of the plot objects created by pandas is a matplotlib object. As Matplotlib provides plenty of options to customize
plots, making the link between pandas and Matplotlib explicit enables all the power of matplotlib to the plot. This
strategy is applied in the previous example:
For this tutorial, air quality data about 𝑁 𝑂2 is used, made available by openaq and using the py-openaq package.
The air_quality_no2.csv data set provides 𝑁 𝑂2 values for the measurement stations FR04014, BETR801 and
London Westminster in respectively Paris, Antwerp and London.
1.4. Tutorials 31
pandas: powerful Python data analysis toolkit, Release 1.2.0
In [3]: air_quality.head()
Out[3]:
station_antwerp station_paris station_london
datetime
2019-05-07 02:00:00 NaN NaN 23.0
2019-05-07 03:00:00 50.5 25.0 19.0
2019-05-07 04:00:00 45.0 27.7 19.0
2019-05-07 05:00:00 NaN 50.4 16.0
2019-05-07 06:00:00 NaN 61.9 NaN
In [5]: air_quality.head()
Out[5]:
station_antwerp station_paris station_london london_mg_per_
˓→cubic
datetime
˓→
To create a new column, use the [] brackets with the new column name at the left side of the assignment.
Note: The calculation of the values is done element_wise. This means all values in the given column are multiplied
by the value 1.882 at once. You do not need to use a loop to iterate each of the rows!
I want to check the ratio of the values in Paris versus Antwerp and save the result in a new column
In [6]: air_quality["ratio_paris_antwerp"] = (
...: air_quality["station_paris"] / air_quality["station_antwerp"]
...: )
...:
In [7]: air_quality.head()
(continues on next page)
The calculation is again element-wise, so the / is applied for the values in each row.
Also other mathematical operators (+, -, *, /) or logical operators (<, >, =,. . . ) work element wise. The latter was
already used in the subset data tutorial to filter rows of a table using a conditional expression.
I want to rename the data columns to the corresponding station identifiers used by openAQ
In [9]: air_quality_renamed.head()
Out[9]:
BETR801 FR04014 London Westminster london_mg_per_cubic ratio_
˓→paris_antwerp
datetime
˓→
The rename() function can be used for both row labels and column labels. Provide a dictionary with the keys the
current names and the values the new names to update the corresponding names.
The mapping should not be restricted to fixed names only, but can be a mapping function as well. For example,
converting the column names to lowercase letters can be done using a function as well:
1.4. Tutorials 33
pandas: powerful Python data analysis toolkit, Release 1.2.0
datetime
˓→
Details about column or row label renaming is provided in the user guide section on renaming labels.
• Create a new column by assigning the output to the DataFrame with a new column name in between the [].
• Operations are element-wise, no need to loop over rows.
• Use rename with a dictionary or function to rename row labels or column names.
The user guide contains a separate section on column addition and deletion.
In [1]: import pandas as pd
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.
In [2]: titanic = pd.read_csv("data/titanic.csv")
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name .
˓→.. Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris .
˓→.. A/5 21171 7.2500 NaN S
(continues on next page)
[5 rows x 12 columns]
Aggregating statistics
In [4]: titanic["Age"].mean()
Out[4]: 29.69911764705882
Different statistics are available and can be applied to columns with numerical data. Operations in general exclude
missing data and operate across rows by default.
What is the median age and ticket fare price of the Titanic passengers?
The statistic applied to multiple columns of a DataFrame (the selection of two columns return a DataFrame, see
the subset data tutorial) is calculated for each numeric column.
The aggregating statistic can be calculated for multiple columns at the same time. Remember the describe function
from first tutorial tutorial?
Instead of the predefined statistics, specific combinations of aggregating statistics for given columns can be defined
using the DataFrame.agg() method:
1.4. Tutorials 35
pandas: powerful Python data analysis toolkit, Release 1.2.0
In [7]: titanic.agg(
...: {
...: "Age": ["min", "max", "median", "skew"],
...: "Fare": ["min", "max", "median", "mean"],
...: }
...: )
...:
Out[7]:
Age Fare
min 0.420000 0.000000
max 80.000000 512.329200
median 28.000000 14.454200
skew 0.389108 NaN
mean NaN 32.204208
Details about descriptive statistics are provided in the user guide section on descriptive statistics.
What is the average age for male versus female Titanic passengers?
As our interest is the average age for each gender, a subselection on these two columns is made first: titanic[[
"Sex", "Age"]]. Next, the groupby() method is applied on the Sex column to make a group per category.
The average age for each gender is calculated and returned.
Calculating a given statistic (e.g. mean age) for each category in a column (e.g. male/female in the Sex column) is a
common pattern. The groupby method is used to support this type of operations. More general, this fits in the more
general split-apply-combine pattern:
• Split the data into groups
• Apply a function to each group independently
• Combine the results into a data structure
The apply and combine steps are typically done together in pandas.
In the previous example, we explicitly selected the 2 columns first. If not, the mean method is applied to each column
containing numerical columns:
In [9]: titanic.groupby("Sex").mean()
Out[9]:
PassengerId Survived Pclass Age SibSp Parch Fare
Sex
female 431.028662 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818
male 454.147314 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893
It does not make much sense to get the average value of the Pclass. if we are only interested in the average age for
each gender, the selection of columns (rectangular brackets [] as usual) is supported on the grouped data as well:
In [10]: titanic.groupby("Sex")["Age"].mean()
Out[10]:
Sex
female 27.915709
male 30.726645
Name: Age, dtype: float64
Note: The Pclass column contains numerical data but actually represents 3 categories (or factors) with respectively
the labels ‘1’, ‘2’ and ‘3’. Calculating statistics on these does not make much sense. Therefore, pandas provides a
Categorical data type to handle this type of data. More information is provided in the user guide Categorical data
section.
What is the mean ticket fare price for each of the sex and cabin class combinations?
Grouping can be done by multiple columns at the same time. Provide the column names as a list to the groupby()
method.
A full description on the split-apply-combine approach is provided in the user guide section on groupby operations.
In [12]: titanic["Pclass"].value_counts()
Out[12]:
3 491
1 216
2 184
Name: Pclass, dtype: int64
The value_counts() method counts the number of records for each category in a column.
The function is a shortcut, as it is actually a groupby operation in combination with counting of the number of records
within each group:
In [13]: titanic.groupby("Pclass")["Pclass"].count()
Out[13]:
Pclass
1 216
2 184
(continues on next page)
1.4. Tutorials 37
pandas: powerful Python data analysis toolkit, Release 1.2.0
Note: Both size and count can be used in combination with groupby. Whereas size includes NaN values and
just provides the number of rows (size of the table), count excludes the missing values. In the value_counts
method, use the dropna argument to include or exclude the NaN values.
The user guide has a dedicated section on value_counts , see page on discretization.
• Aggregation statistics can be calculated on entire columns or rows
• groupby provides the power of the split-apply-combine pattern
• value_counts is a convenient shortcut to count the number of entries in each category of a variable
A full description on the split-apply-combine approach is provided in the user guide pages about groupby operations.
In [1]: import pandas as pd
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
• Pclass: There are 3 classes: Class 1, Class 2 and Class 3.
• Name: Name of passenger.
• Sex: Gender of passenger.
• Age: Age of passenger.
• SibSp: Indication that passenger have siblings and spouse.
• Parch: Whether a passenger is alone or have family.
• Ticket: Ticket number of passenger.
• Fare: Indicating the fare.
• Cabin: The cabin of passenger.
• Embarked: The embarked category.
In [2]: titanic = pd.read_csv("data/titanic.csv")
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name .
˓→.. Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris .
˓→.. A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... .
˓→.. PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina .
˓→.. STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) .
˓→.. 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry .
˓→.. 373450 8.0500 NaN S
(continues on next page)
[5 rows x 12 columns]
This tutorial uses air quality data about 𝑁 𝑂2 and Particulate matter less than 2.5 micrometers, made available by
openaq and using the py-openaq package. The air_quality_long.csv data set provides 𝑁 𝑂2 and 𝑃 𝑀25 values
for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp and London.
The air-quality data set has the following columns:
• city: city where the sensor is used, either Paris, Antwerp or London
• country: country where the sensor is used, either FR, BE or GB
• location: the id of the sensor, either FR04014, BETR801 or London Westminster
• parameter: the parameter measured by the sensor, either 𝑁 𝑂2 or Particulate matter
• value: the measured value
• unit: the unit of the measured parameter, in this case ‘µg/m3 ’
and the index of the DataFrame is datetime, the datetime of the measurement.
Note: The air-quality data is provided in a so-called long format data representation with each observation on a
separate row and each variable a separate column of the data table. The long/narrow format is also known as the tidy
data format.
In [5]: air_quality.head()
Out[5]:
city country location parameter value unit
date.utc
2019-06-18 06:00:00+00:00 Antwerpen BE BETR801 pm25 18.0 µg/m3
2019-06-17 08:00:00+00:00 Antwerpen BE BETR801 pm25 6.5 µg/m3
2019-06-17 07:00:00+00:00 Antwerpen BE BETR801 pm25 18.5 µg/m3
2019-06-17 06:00:00+00:00 Antwerpen BE BETR801 pm25 16.0 µg/m3
2019-06-17 05:00:00+00:00 Antwerpen BE BETR801 pm25 7.5 µg/m3
I want to sort the Titanic data according to the age of the passengers.
In [6]: titanic.sort_values(by="Age").head()
Out[6]:
PassengerId Survived Pclass Name Sex Age
˓→SibSp Parch Ticket Fare Cabin Embarked
803 804 1 3 Thomas, Master. Assad Alexander male 0.42
˓→ 0 1 2625 8.5167 NaN C
755 756 1 2 Hamalainen, Master. Viljo male 0.67
˓→ 1 1 250649 14.5000 NaN S
(continues on next page)
1.4. Tutorials 39
pandas: powerful Python data analysis toolkit, Release 1.2.0
I want to sort the Titanic data according to the cabin class and age in descending order.
In [7]: titanic.sort_values(by=['Pclass', 'Age'], ascending=False).head()
Out[7]:
PassengerId Survived Pclass Name Sex Age SibSp
˓→Parch Ticket Fare Cabin Embarked
851 852 0 3 Svensson, Mr. Johan male 74.0 0
˓→ 0 347060 7.7750 NaN S
116 117 0 3 Connors, Mr. Patrick male 70.5 0
˓→ 0 370369 7.7500 NaN Q
280 281 0 3 Duane, Mr. Frank male 65.0 0
˓→ 0 336439 7.7500 NaN Q
483 484 1 3 Turkula, Mrs. (Hedwig) female 63.0 0
˓→ 0 4134 9.5875 NaN S
326 327 0 3 Nysveen, Mr. Johan Hansen male 61.0 0
˓→ 0 345364 6.2375 NaN S
With Series.sort_values(), the rows in the table are sorted according to the defined column(s). The index
will follow the row order.
More details about sorting of tables is provided in the using guide section on sorting data.
Let’s use a small subset of the air quality data set. We focus on 𝑁 𝑂2 data and only use the first two measurements of
each location (i.e. the head of each group). The subset of data will be called no2_subset
# filter for no2 data only
In [8]: no2 = air_quality[air_quality["parameter"] == "no2"]
In [10]: no2_subset
Out[10]:
city country location parameter value
˓→unit
date.utc
˓→
I want the values for the three stations as separate columns next to each other
The pivot() function is purely reshaping of the data: a single value for each index/column combination is required.
As pandas support plotting of multiple columns (see plotting tutorial) out of the box, the conversion from long to wide
table format enables the plotting of the different time series at the same time:
In [12]: no2.head()
Out[12]:
city country location parameter value unit
date.utc
2019-06-21 00:00:00+00:00 Paris FR FR04014 no2 20.0 µg/m3
2019-06-20 23:00:00+00:00 Paris FR FR04014 no2 21.8 µg/m3
2019-06-20 22:00:00+00:00 Paris FR FR04014 no2 26.5 µg/m3
2019-06-20 21:00:00+00:00 Paris FR FR04014 no2 24.9 µg/m3
2019-06-20 20:00:00+00:00 Paris FR FR04014 no2 21.4 µg/m3
1.4. Tutorials 41
pandas: powerful Python data analysis toolkit, Release 1.2.0
Note: When the index parameter is not defined, the existing index (row labels) is used.
For more information about pivot(), see the user guide section on pivoting DataFrame objects.
Pivot table
I want the mean concentrations for 𝑁 𝑂2 and 𝑃 𝑀2.5 in each of the stations in table form
In [14]: air_quality.pivot_table(
....: values="value", index="location", columns="parameter", aggfunc="mean"
....: )
....:
Out[14]:
parameter no2 pm25
location
BETR801 26.950920 23.169492
FR04014 29.374284 NaN
London Westminster 29.740050 13.443568
In the case of pivot(), the data is only rearranged. When multiple values need to be aggregated (in this specific
case, the values on different time steps) pivot_table() can be used, providing an aggregation function (e.g. mean)
In [15]: air_quality.pivot_table(
....: values="value",
....: index="location",
....: columns="parameter",
....: aggfunc="mean",
....: margins=True,
....: )
....:
Out[15]:
parameter no2 pm25 All
location
BETR801 26.950920 23.169492 24.982353
FR04014 29.374284 NaN 29.374284
London Westminster 29.740050 13.443568 21.491708
All 29.430316 14.386849 24.222743
For more information about pivot_table(), see the user guide section on pivot tables.
Note: In case you are wondering, pivot_table() is indeed directly linked to groupby(). The same result can
be derived by grouping on both parameter and location:
air_quality.groupby(["parameter", "location"]).mean()
Have a look at groupby() in combination with unstack() at the user guide section on combining stats and
groupby.
Starting again from the wide format table created in the previous section:
In [17]: no2_pivoted.head()
Out[17]:
location date.utc BETR801 FR04014 London Westminster
0 2019-04-09 01:00:00+00:00 22.5 24.4 NaN
1 2019-04-09 02:00:00+00:00 53.5 27.4 67.0
2 2019-04-09 03:00:00+00:00 54.5 34.2 67.0
3 2019-04-09 04:00:00+00:00 34.5 48.5 41.0
4 2019-04-09 05:00:00+00:00 46.5 59.5 41.0
I want to collect all air quality 𝑁 𝑂2 measurements in a single column (long format)
In [19]: no_2.head()
Out[19]:
date.utc location value
(continues on next page)
1.4. Tutorials 43
pandas: powerful Python data analysis toolkit, Release 1.2.0
The pandas.melt() method on a DataFrame converts the data table from wide format to long format. The
column headers become the variable names in a newly created column.
The solution is the short version on how to apply pandas.melt(). The method will melt all columns NOT men-
tioned in id_vars together into two columns: A column with the column header names and a column with the values
itself. The latter column gets by default the name value.
The pandas.melt() method can be defined in more detail:
In [21]: no_2.head()
Out[21]:
date.utc id_location NO_2
0 2019-04-09 01:00:00+00:00 BETR801 22.5
1 2019-04-09 02:00:00+00:00 BETR801 53.5
2 2019-04-09 03:00:00+00:00 BETR801 54.5
3 2019-04-09 04:00:00+00:00 BETR801 34.5
4 2019-04-09 05:00:00+00:00 BETR801 46.5
For this tutorial, air quality data about 𝑁 𝑂2 is used, made available by openaq and downloaded using the py-openaq
package.
The air_quality_no2_long.csv data set provides 𝑁 𝑂2 values for the measurement stations FR04014,
BETR801 and London Westminster in respectively Paris, Antwerp and London.
In [4]: air_quality_no2.head()
Out[4]:
date.utc location parameter value
0 2019-06-21 00:00:00+00:00 FR04014 no2 20.0
1 2019-06-20 23:00:00+00:00 FR04014 no2 21.8
2 2019-06-20 22:00:00+00:00 FR04014 no2 26.5
3 2019-06-20 21:00:00+00:00 FR04014 no2 24.9
4 2019-06-20 20:00:00+00:00 FR04014 no2 21.4
For this tutorial, air quality data about Particulate matter less than 2.5 micrometers is used, made available by openaq
and downloaded using the py-openaq package.
The air_quality_pm25_long.csv data set provides 𝑃 𝑀25 values for the measurement stations FR04014,
BETR801 and London Westminster in respectively Paris, Antwerp and London.
In [7]: air_quality_pm25.head()
Out[7]:
date.utc location parameter value
0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
1 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5
2 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5
3 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0
4 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5
Concatenating objects
I want to combine the measurements of 𝑁 𝑂2 and 𝑃 𝑀25 , two tables with a similar structure, in a single table
In [9]: air_quality.head()
Out[9]:
date.utc location parameter value
(continues on next page)
1.4. Tutorials 45
pandas: powerful Python data analysis toolkit, Release 1.2.0
The concat() function performs concatenation operations of multiple tables along one of the axis (row-wise or
column-wise).
By default concatenation is along axis 0, so the resulting table combines the rows of the input tables. Let’s check the
shape of the original and the concatenated tables to verify the operation:
Note: The axis argument will return in a number of pandas methods that can be applied along an axis. A
DataFrame has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second
running horizontally across columns (axis 1). Most operations like concatenation or summary statistics are by default
across rows (axis 0), but can be applied across columns as well.
Sorting the table on the datetime information illustrates also the combination of both tables, with the parameter
column defining the origin of the table (either no2 from table air_quality_no2 or pm25 from table
air_quality_pm25):
In [14]: air_quality.head()
Out[14]:
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
1098 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
1109 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0
In this specific example, the parameter column provided by the data ensures that each of the original tables can be
identified. This is not always the case. the concat function provides a convenient solution with the keys argument,
adding an additional (hierarchical) row index. For example:
In [16]: air_quality_.head()
Out[16]:
date.utc location parameter value
PM25 0 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0
(continues on next page)
Note: The existence of multiple row/column indices at the same time has not been mentioned within these tutorials.
Hierarchical indexing or MultiIndex is an advanced and powerful pandas feature to analyze higher dimensional data.
Multi-indexing is out of scope for this pandas introduction. For the moment, remember that the func-
tion reset_index can be used to convert any level of an index to a column, e.g. air_quality.
reset_index(level=0)
Feel free to dive into the world of multi-indexing at the user guide section on advanced indexing.
More options on table concatenation (row and column wise) and how concat can be used to define the logic (union
or intersection) of the indexes on the other axes is provided at the section on object concatenation.
Add the station coordinates, provided by the stations metadata table, to the corresponding rows in the measurements
table.
Warning: The air quality measurement station coordinates are stored in a data file air_quality_stations.
csv, downloaded using the py-openaq package.
In [18]: stations_coord.head()
Out[18]:
location coordinates.latitude coordinates.longitude
0 BELAL01 51.23619 4.38522
1 BELHB23 51.17030 4.34100
2 BELLD01 51.10998 5.00486
3 BELLD02 51.12038 5.02155
4 BELR833 51.32766 4.36226
Note: The stations used in this example (FR04014, BETR801 and London Westminster) are just three entries enlisted
in the metadata table. We only want to add the coordinates of these three to the measurements table, each on the
corresponding rows of the air_quality table.
In [19]: air_quality.head()
Out[19]:
date.utc location parameter value
2067 2019-05-07 01:00:00+00:00 London Westminster no2 23.0
1003 2019-05-07 01:00:00+00:00 FR04014 no2 25.0
100 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5
1098 2019-05-07 01:00:00+00:00 BETR801 no2 50.5
1109 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0
1.4. Tutorials 47
pandas: powerful Python data analysis toolkit, Release 1.2.0
In [21]: air_quality.head()
Out[21]:
date.utc location parameter value coordinates.
˓→latitude coordinates.longitude
0 2019-05-07 01:00:00+00:00 London Westminster no2 23.0 51.
˓→49467 -0.13193
1 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 48.
˓→83724 2.39390
2 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 48.
˓→83722 2.39390
3 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5 51.
˓→20966 4.43182
4 2019-05-07 01:00:00+00:00 BETR801 no2 50.5 51.
˓→20966 4.43182
Using the merge() function, for each of the rows in the air_quality table, the corresponding coordinates are
added from the air_quality_stations_coord table. Both tables have the column location in common
which is used as a key to combine the information. By choosing the left join, only the locations available in the
air_quality (left) table, i.e. FR04014, BETR801 and London Westminster, end up in the resulting table. The
merge function supports multiple join options similar to database-style operations.
Add the parameter full description and name, provided by the parameters metadata table, to the measurements table
Warning: The air quality parameters metadata are stored in a data file air_quality_parameters.csv,
downloaded using the py-openaq package.
In [23]: air_quality_parameters.head()
Out[23]:
id description name
0 bc Black Carbon BC
1 co Carbon Monoxide CO
2 no2 Nitrogen Dioxide NO2
3 o3 Ozone O3
4 pm10 Particulate matter less than 10 micrometers in... PM10
In [25]: air_quality.head()
Out[25]:
date.utc location parameter ... id
˓→ description name
0 2019-05-07 01:00:00+00:00 London Westminster no2 ... no2
˓→ Nitrogen Dioxide NO2
1 2019-05-07 01:00:00+00:00 FR04014 no2 ... no2
˓→ Nitrogen Dioxide NO2
2 2019-05-07 01:00:00+00:00 FR04014 no2 ... no2
˓→ Nitrogen Dioxide NO2
3 2019-05-07 01:00:00+00:00 BETR801 pm25 ... pm25 Particulate
˓→matter less than 2.5 micrometers i... PM2.5 (continues on next page)
[5 rows x 9 columns]
Compared to the previous example, there is no common column name. However, the parameter column in the
air_quality table and the id column in the air_quality_parameters_name both provide the measured
variable in a common format. The left_on and right_on arguments are used here (instead of just on) to make
the link between the two tables.
pandas supports also inner, outer, and right joins. More information on join/merge of tables is provided in the user
guide section on database style merging of tables. Or have a look at the comparison with SQL page.
• Multiple tables can be concatenated both column-wise and row-wise using the concat function.
• For database-like merging/joining of tables, use the merge function.
See the user guide for a full description of the various facilities to combine data tables.
In [1]: import pandas as pd
For this tutorial, air quality data about 𝑁 𝑂2 and Particulate matter less than 2.5 micrometers is used, made available
by openaq and downloaded using the py-openaq package. The air_quality_no2_long.csv" data set provides
𝑁 𝑂2 values for the measurement stations FR04014, BETR801 and London Westminster in respectively Paris, Antwerp
and London.
In [3]: air_quality = pd.read_csv("data/air_quality_no2_long.csv")
In [5]: air_quality.head()
Out[5]:
city country datetime location parameter value unit
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m3
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m3
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m3
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m3
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m3
In [6]: air_quality.city.unique()
Out[6]: array(['Paris', 'Antwerpen', 'London'], dtype=object)
I want to work with the dates in the column datetime as datetime objects instead of plain text
In [7]: air_quality["datetime"] = pd.to_datetime(air_quality["datetime"])
In [8]: air_quality["datetime"]
Out[8]:
0 2019-06-21 00:00:00+00:00
(continues on next page)
1.4. Tutorials 49
pandas: powerful Python data analysis toolkit, Release 1.2.0
Initially, the values in datetime are character strings and do not provide any datetime operations (e.g. extract the
year, day of the week,. . . ). By applying the to_datetime function, pandas interprets the strings and convert these to
datetime (i.e. datetime64[ns, UTC]) objects. In pandas we call these datetime objects similar to datetime.
datetime from the standard library as pandas.Timestamp.
Note: As many data sets do contain datetime information in one of the columns, pandas input function like pandas.
read_csv() and pandas.read_json() can do the transformation to dates when reading the data using the
parse_dates parameter with a list of the columns to read as Timestamp:
pd.read_csv("../data/air_quality_no2_long.csv", parse_dates=["datetime"])
Why are these pandas.Timestamp objects useful? Let’s illustrate the added value with some example cases.
What is the start and end date of the time series data set we are working with?
Using pandas.Timestamp for datetimes enables us to calculate with date information and make them comparable.
Hence, we can use this to get the length of our time series:
The result is a pandas.Timedelta object, similar to datetime.timedelta from the standard Python library
and defining a time duration.
The various time concepts supported by pandas are explained in the user guide section on time related concepts.
I want to add a new column to the DataFrame containing only the month of the measurement
In [12]: air_quality.head()
Out[12]:
city country datetime location parameter value unit month
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m3 6
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m3 6
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m3 6
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m3 6
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m3 6
By using Timestamp objects for dates, a lot of time-related properties are provided by pandas. For example the
month, but also year, weekofyear, quarter,. . . All of these properties are accessible by the dt accessor.
An overview of the existing date properties is given in the time and date components overview table. More details
about the dt accessor to return datetime like properties are explained in a dedicated section on the dt accessor.
What is the average 𝑁 𝑂2 concentration for each day of the week for each of the measurement locations?
In [13]: air_quality.groupby(
....: [air_quality["datetime"].dt.weekday, "location"])["value"].mean()
....:
Out[13]:
datetime location
0 BETR801 27.875000
FR04014 24.856250
London Westminster 23.969697
1 BETR801 22.214286
FR04014 30.999359
...
5 FR04014 25.266154
London Westminster 24.977612
6 BETR801 21.896552
FR04014 23.274306
London Westminster 24.859155
Name: value, Length: 21, dtype: float64
Remember the split-apply-combine pattern provided by groupby from the tutorial on statistics calculation? Here,
we want to calculate a given statistic (e.g. mean 𝑁 𝑂2 ) for each weekday and for each measurement location. To
group on weekdays, we use the datetime property weekday (with Monday=0 and Sunday=6) of pandas Timestamp,
which is also accessible by the dt accessor. The grouping on both locations and weekdays can be done to split the
calculation of the mean on each of these combinations.
Danger: As we are working with a very short time series in these examples, the analysis does not provide a
long-term representative result!
Plot the typical 𝑁 𝑂2 pattern during the day of our time series of all stations together. In other words, what is the
average value for each hour of the day?
In [15]: air_quality.groupby(air_quality["datetime"].dt.hour)["value"].mean().plot(
....: kind='bar', rot=0, ax=axs
....: )
....:
Out[15]: <AxesSubplot:xlabel='datetime'>
1.4. Tutorials 51
pandas: powerful Python data analysis toolkit, Release 1.2.0
Similar to the previous case, we want to calculate a given statistic (e.g. mean 𝑁 𝑂2 ) for each hour of the day and
we can use the split-apply-combine approach again. For this case, we use the datetime property hour of pandas
Timestamp, which is also accessible by the dt accessor.
Datetime as index
In the tutorial on reshaping, pivot() was introduced to reshape the data table with each of the measurements
locations as a separate column:
In [19]: no_2.head()
Out[19]:
location BETR801 FR04014 London Westminster
datetime
2019-05-07 01:00:00+00:00 50.5 25.0 23.0
2019-05-07 02:00:00+00:00 45.0 27.7 19.0
2019-05-07 03:00:00+00:00 NaN 50.4 19.0
2019-05-07 04:00:00+00:00 NaN 61.9 16.0
2019-05-07 05:00:00+00:00 NaN 72.4 NaN
Note: By pivoting the data, the datetime information became the index of the table. In general, setting a column as
an index can be achieved by the set_index function.
Working with a datetime index (i.e. DatetimeIndex) provides powerful functionalities. For example, we do not
need the dt accessor to get the time series properties, but have these properties available on the index directly:
Some other advantages are the convenient subsetting of time period or the adapted time scale on plots. Let’s apply this
on our data.
Create a plot of the 𝑁 𝑂2 values in the different stations from the 20th of May till the end of 21st of May
In [21]: no_2["2019-05-20":"2019-05-21"].plot();
By providing a string that parses to a datetime, a specific subset of the data can be selected on a DatetimeIndex.
More information on the DatetimeIndex and the slicing by using strings is provided in the section on time series
indexing.
Aggregate the current hourly time series values to the monthly maximum value in each of the stations.
In [23]: monthly_max
Out[23]:
location BETR801 FR04014 London Westminster
datetime
2019-05-31 00:00:00+00:00 74.5 97.0 97.0
2019-06-30 00:00:00+00:00 52.5 84.7 52.0
1.4. Tutorials 53
pandas: powerful Python data analysis toolkit, Release 1.2.0
A very powerful method on time series data with a datetime index, is the ability to resample() time series to
another frequency (e.g., converting secondly data into 5-minutely data).
The resample() method is similar to a groupby operation:
• it provides a time-based grouping, by using a string (e.g. M, 5H,. . . ) that defines the target frequency
• it requires an aggregation function such as mean, max,. . .
An overview of the aliases used to define time series frequencies is given in the offset aliases overview table.
When defined, the frequency of the time series is provided by the freq attribute:
In [24]: monthly_max.index.freq
Out[24]: <MonthEnd>
More details on the power of time series resampling is provided in the user guide section on resampling.
• Valid date strings can be converted to datetime objects using to_datetime function or as part of read func-
tions.
• Datetime objects in pandas support calculations, logical operations and convenient date-related properties using
the dt accessor.
• A DatetimeIndex contains these date-related properties and supports convenient slicing.
• Resample is a powerful method to change the frequency of a time series.
A full overview on time series is given on the pages on time series and date functionality.
This tutorial uses the Titanic data set, stored as CSV. The data consists of the following data columns:
• PassengerId: Id of every passenger.
• Survived: This feature have value 0 and 1. 0 for not survived and 1 for survived.
In [3]: titanic.head()
Out[3]:
PassengerId Survived Pclass Name .
˓→.. Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris .
˓→.. A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... .
˓→.. PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina .
˓→.. STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) .
˓→.. 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry .
˓→.. 373450 8.0500 NaN S
[5 rows x 12 columns]
In [4]: titanic["Name"].str.lower()
Out[4]:
0 braund, mr. owen harris
1 cumings, mrs. john bradley (florence briggs th...
2 heikkinen, miss. laina
3 futrelle, mrs. jacques heath (lily may peel)
4 allen, mr. william henry
...
886 montvila, rev. juozas
887 graham, miss. margaret edith
888 johnston, miss. catherine helen "carrie"
889 behr, mr. karl howell
890 dooley, mr. patrick
Name: Name, Length: 891, dtype: object
To make each of the strings in the Name column lowercase, select the Name column (see the tutorial on selection of
data), add the str accessor and apply the lower method. As such, each of the strings is converted element-wise.
1.4. Tutorials 55
pandas: powerful Python data analysis toolkit, Release 1.2.0
Similar to datetime objects in the time series tutorial having a dt accessor, a number of specialized string methods are
available when using the str accessor. These methods have in general matching names with the equivalent built-in
string methods for single elements, but are applied element-wise (remember element-wise calculations?) on each of
the values of the columns.
Create a new column Surname that contains the surname of the passengers by extracting the part before the comma.
In [5]: titanic["Name"].str.split(",")
Out[5]:
0 [Braund, Mr. Owen Harris]
1 [Cumings, Mrs. John Bradley (Florence Briggs ...
2 [Heikkinen, Miss. Laina]
3 [Futrelle, Mrs. Jacques Heath (Lily May Peel)]
4 [Allen, Mr. William Henry]
...
886 [Montvila, Rev. Juozas]
887 [Graham, Miss. Margaret Edith]
888 [Johnston, Miss. Catherine Helen "Carrie"]
889 [Behr, Mr. Karl Howell]
890 [Dooley, Mr. Patrick]
Name: Name, Length: 891, dtype: object
Using the Series.str.split() method, each of the values is returned as a list of 2 elements. The first element
is the part before the comma and the second element is the part after the comma.
In [7]: titanic["Surname"]
Out[7]:
0 Braund
1 Cumings
2 Heikkinen
3 Futrelle
4 Allen
...
886 Montvila
887 Graham
888 Johnston
889 Behr
890 Dooley
Name: Surname, Length: 891, dtype: object
As we are only interested in the first part representing the surname (element 0), we can again use the str accessor
and apply Series.str.get() to extract the relevant part. Indeed, these string functions can be concatenated to
combine multiple functions at once!
More information on extracting parts of strings is available in the user guide section on splitting and replacing strings.
Extract the passenger data about the countesses on board of the Titanic.
In [8]: titanic["Name"].str.contains("Countess")
Out[8]:
0 False
1 False
2 False
3 False
4 False
...
(continues on next page)
In [9]: titanic[titanic["Name"].str.contains("Countess")]
Out[9]:
PassengerId Survived Pclass Name
˓→ Sex ... Ticket Fare Cabin Embarked Surname
759 760 1 1 Rothes, the Countess. of (Lucy Noel Martha Dye...
˓→ female ... 110152 86.5 B77 S Rothes
[1 rows x 13 columns]
Note: More powerful extractions on strings are supported, as the Series.str.contains() and Series.
str.extract() methods accept regular expressions, but out of scope of this tutorial.
More information on extracting parts of strings is available in the user guide section on string matching and extracting.
Which passenger of the Titanic has the longest name?
In [10]: titanic["Name"].str.len()
Out[10]:
0 23
1 51
2 22
3 44
4 24
..
886 21
887 28
888 40
889 21
890 19
Name: Name, Length: 891, dtype: int64
To get the longest name we first have to get the lengths of each of the names in the Name column. By using pandas
string methods, the Series.str.len() function is applied to each of the names individually (element-wise).
In [11]: titanic["Name"].str.len().idxmax()
Out[11]: 307
Next, we need to get the corresponding location, preferably the index label, in the table for which the name length is
the largest. The idxmax() method does exactly that. It is not a string method and is applied to integers, so no str
is used.
1.4. Tutorials 57
pandas: powerful Python data analysis toolkit, Release 1.2.0
Based on the index name of the row (307) and the column (Name), we can do a selection using the loc operator,
introduced in the tutorial on subsetting.
In the “Sex” column, replace values of “male” by “M” and values of “female” by “F”.
In [14]: titanic["Sex_short"]
Out[14]:
0 M
1 F
2 F
3 F
4 M
..
886 M
887 F
888 F
889 M
890 M
Name: Sex_short, Length: 891, dtype: object
Whereas replace() is not a string method, it provides a convenient way to use mappings or vocabularies to translate
certain values. It requires a dictionary to define the mapping {from : to}.
Warning: There is also a replace() method available to replace a specific set of characters. However, when
having a mapping of multiple values, this would become:
titanic["Sex_short"] = titanic["Sex"].str.replace("female", "F")
titanic["Sex_short"] = titanic["Sex_short"].str.replace("male", "M")
This would become cumbersome and easily lead to mistakes. Just think (or try out yourself) what would happen if
those two statements are applied in the opposite order. . .
Since pandas aims to provide a lot of the data manipulation and analysis functionality that people use R for, this page
was started to provide a more detailed look at the R language and its many third party libraries as they relate to pandas.
In comparisons with R and CRAN libraries, we care about the following things:
• Functionality / flexibility: what can/cannot be done with each tool
• Performance: how fast are operations. Hard numbers/benchmarks are preferable
• Ease-of-use: Is one tool easier/harder to use (you may have to be the judge of this, given side-by-side code
comparisons)
This page is also here to offer a bit of a translation guide for users of these R packages.
For transfer of DataFrame objects from pandas to R, one option is to use HDF5 files, see External compatibility for
an example.
Quick reference
We’ll start off with a quick reference guide pairing some common R operations using dplyr with pandas equivalents.
R pandas
dim(df) df.shape
head(df) df.head()
slice(df, 1:10) df.iloc[:9]
filter(df, col1 == 1, col2 == 1) df.query('col1 == 1 & col2 == 1')
df[df$col1 == 1 & df$col2 == 1,] df[(df.col1 == 1) & (df.col2 == 1)]
select(df, col1, col2) df[['col1', 'col2']]
select(df, col1:col3) df.loc[:, 'col1':'col3']
select(df, -(col1:col3)) df.drop(cols_to_drop, axis=1) but see1
distinct(select(df, col1)) df[['col1']].drop_duplicates()
distinct(select(df, col1, col2)) df[['col1', 'col2']].drop_duplicates()
sample_n(df, 10) df.sample(n=10)
sample_frac(df, 0.01) df.sample(frac=0.01)
Sorting
R pandas
arrange(df, col1, col2) df.sort_values(['col1', 'col2'])
arrange(df, desc(col1)) df.sort_values('col1', ascending=False)
Transforming
R pandas
select(df, col_one = df.rename(columns={'col1': 'col_one'})['col_one']
col1)
rename(df, col_one = df.rename(columns={'col1': 'col_one'})
col1)
mutate(df, c=a-b) df.assign(c=df['a']-df['b'])
1 R’s shorthand for a subrange of columns (select(df, col1:col3)) can be approached cleanly in pandas, if you have the list of columns,
for example df[cols[1:3]] or df.drop(cols[1:3]), but doing this by column name is a bit messy.
1.4. Tutorials 59
pandas: powerful Python data analysis toolkit, Release 1.2.0
R pandas
summary(df) df.describe()
gdf <- group_by(df, col1) gdf = df.groupby('col1')
summarise(gdf, avg=mean(col1, na. df.groupby('col1').agg({'col1':
rm=TRUE)) 'mean'})
summarise(gdf, total=sum(col1)) df.groupby('col1').sum()
Base R
or by integer location
Selecting multiple noncontiguous columns by integer location can be achieved with a combination of the iloc indexer
attribute and numpy.r_.
In [5]: n = 30
aggregate
In R you may want to split data into subsets and compute the mean for each. Using a data.frame called df and splitting
it into groups by1 and by2:
df <- data.frame(
v1 = c(1,3,5,7,8,3,5,NA,4,5,7,9),
v2 = c(11,33,55,77,88,33,55,NA,44,55,77,99),
by1 = c("red", "blue", 1, 2, NA, "big", 1, 2, "red", 1, NA, 12),
by2 = c("wet", "dry", 99, 95, NA, "damp", 95, 99, "red", 99, NA, NA))
aggregate(x=df[, c("v1", "v2")], by=list(mydf2$by1, mydf2$by2), FUN = mean)
In [9]: df = pd.DataFrame(
...: {
(continues on next page)
1.4. Tutorials 61
pandas: powerful Python data analysis toolkit, Release 1.2.0
...: "by2": [
...: "wet",
...: "dry",
...: 99,
...: 95,
...: np.nan,
...: "damp",
...: 95,
...: 99,
...: "red",
...: 99,
...: np.nan,
...: np.nan,
...: ],
...: }
...: )
...:
match / %in%
A common way to select data in R is using %in% which is defined using the function match. The operator %in% is
used to return a logical vector indicating if there is a match or not:
s <- 0:4
s %in% c(2,4)
The match function returns a vector of the positions of matches of its first argument in its second:
s <- 0:4
match(s, c(2,4))
tapply
tapply is similar to aggregate, but data can be in a ragged array, since the subclass sizes are possibly irregular.
Using a data.frame called baseball, and retrieving information based on the array team:
baseball <-
data.frame(team = gl(5, 5,
labels = paste("Team", LETTERS[1:5])),
player = sample(letters, 25),
batting.average = runif(25, .200, .400))
tapply(baseball$batting.average, baseball.example$team,
max)
1.4. Tutorials 63
pandas: powerful Python data analysis toolkit, Release 1.2.0
subset
The query() method is similar to the base R subset function. In R you might want to get the rows of a data.
frame where one column’s values are less than another column’s values:
In pandas, there are a few ways to perform subsetting. You can use query() or pass an expression as if it were an
index/slice as well as standard boolean indexing:
with
An expression using a data.frame called df in R with the columns a and b would be evaluated using with like so:
In pandas the equivalent expression, using the eval() method, would be:
In certain cases eval() will be much faster than evaluation in pure Python. For more details and examples see the
eval documentation.
plyr
plyr is an R library for the split-apply-combine strategy for data analysis. The functions revolve around three data
structures in R, a for arrays, l for lists, and d for data.frame. The table below shows how these data
structures could be mapped in Python.
R Python
array list
lists dictionary or list of objects
data.frame dataframe
1.4. Tutorials 65
pandas: powerful Python data analysis toolkit, Release 1.2.0
ddply
require(plyr)
df <- data.frame(
x = runif(120, 1, 168),
y = runif(120, 7, 334),
z = runif(120, 1.7, 20.7),
month = rep(c(5,6,7,8),30),
week = sample(1:4, 120, TRUE)
)
In pandas the equivalent expression, using the groupby() method, would be:
In [25]: df = pd.DataFrame(
....: {
....: "x": np.random.uniform(1.0, 168.0, 120),
....: "y": np.random.uniform(7.0, 334.0, 120),
....: "z": np.random.uniform(1.7, 20.7, 120),
....: "month": [5, 6, 7, 8] * 30,
....: "week": np.random.randint(1, 4, 120),
....: }
....: )
....:
reshape / reshape2
melt.array
An expression using a 3 dimensional array called a in R where you want to melt it into a data.frame:
melt.list
An expression using a list called a in R where you want to melt it into a data.frame:
In Python, this list would be a list of tuples, so DataFrame() method would convert it to a dataframe as required.
In [31]: pd.DataFrame(a)
Out[31]:
0 1
0 0 1.0
1 1 2.0
2 2 3.0
3 3 4.0
4 4 NaN
For more details and examples see the Into to Data Structures documentation.
1.4. Tutorials 67
pandas: powerful Python data analysis toolkit, Release 1.2.0
melt.data.frame
An expression using a data.frame called cheese in R where you want to reshape the data.frame:
cast
In R acast is an expression using a data.frame called df in R to cast into a higher dimensional array:
df <- data.frame(
x = runif(12, 1, 168),
y = runif(12, 7, 334),
z = runif(12, 1.7, 20.7),
month = rep(c(5,6,7),4),
week = rep(c(1,2), 6)
)
In [35]: df = pd.DataFrame(
....: {
....: "x": np.random.uniform(1.0, 168.0, 12),
....: "y": np.random.uniform(7.0, 334.0, 12),
....: "z": np.random.uniform(1.7, 20.7, 12),
....: "month": [5, 6, 7] * 4,
....: "week": [1, 2] * 6,
....: }
....: )
....:
In [37]: pd.pivot_table(
....: mdf,
....: values="value",
....: index=["variable", "week"],
....: columns=["month"],
....: aggfunc=np.mean,
....: )
....:
Out[37]:
month 5 6 7
variable week
x 1 93.888747 98.762034 55.219673
2 94.391427 38.112932 83.942781
y 1 94.306912 279.454811 227.840449
2 87.392662 193.028166 173.899260
z 1 11.016009 10.079307 16.170549
2 8.476111 17.638509 19.003494
Similarly for dcast which uses a data.frame called df in R to aggregate information based on Animal and
FeedType:
df <- data.frame(
Animal = c('Animal1', 'Animal2', 'Animal3', 'Animal2', 'Animal1',
'Animal2', 'Animal3'),
FeedType = c('A', 'B', 'A', 'A', 'B', 'B', 'A'),
Amount = c(10, 7, 4, 2, 5, 6, 2)
)
Python can approach this in two different ways. Firstly, similar to above using pivot_table():
In [38]: df = pd.DataFrame(
....: {
....: "Animal": [
....: "Animal1",
(continues on next page)
1.4. Tutorials 69
pandas: powerful Python data analysis toolkit, Release 1.2.0
Out[39]:
FeedType A B
Animal
Animal1 10.0 5.0
Animal2 2.0 13.0
Animal3 6.0 NaN
For more details and examples see the reshaping documentation or the groupby documentation.
factor
cut(c(1,2,3,4,5,6), 3)
factor(c(1,2,3,2,2,3))
For more details and examples see categorical introduction and the API documentation. There is also a documentation
regarding the differences to R’s factor.
Since many potential pandas users have some familiarity with SQL, this page is meant to provide some examples of
how various SQL operations would be performed using pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows:
Most of the examples will utilize the tips dataset found within pandas tests. We’ll read the data into a DataFrame
called tips and assume we have a database table of the same name and structure.
In [3]: url = (
...: "https://raw.github.com/pandas-dev"
...: "/pandas/master/pandas/tests/io/data/csv/tips.csv"
...: )
...:
In [5]: tips.head()
Out[5]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
1.4. Tutorials 71
pandas: powerful Python data analysis toolkit, Release 1.2.0
SELECT
In SQL, selection is done using a comma-separated list of columns you’d like to select (or a * to select all columns):
With pandas, column selection is done by passing a list of column names to your DataFrame:
Calling the DataFrame without the list of column names would display all columns (akin to SQL’s *).
In SQL, you can add a calculated column:
With pandas, you can use the DataFrame.assign() method of a DataFrame to append a new column:
WHERE
SELECT *
FROM tips
WHERE time = 'Dinner'
LIMIT 5;
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing
The above statement is simply passing a Series of True/False objects to the DataFrame, returning all rows with
True.
In [10]: is_dinner.value_counts()
Out[10]:
True 176
False 68
Name: time, dtype: int64
In [11]: tips[is_dinner].head(5)
Out[11]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Just like SQL’s OR and AND, multiple conditions can be passed to a DataFrame using | (OR) and & (AND).
-- tips by parties of at least 5 diners OR bill total was more than $45
SELECT *
FROM tips
WHERE size >= 5 OR total_bill > 45;
1.4. Tutorials 73
pandas: powerful Python data analysis toolkit, Release 1.2.0
# tips by parties of at least 5 diners OR bill total was more than $45
In [13]: tips[(tips["size"] >= 5) | (tips["total_bill"] > 45)]
Out[13]:
total_bill tip sex smoker day time size
59 48.27 6.73 Male No Sat Dinner 4
125 29.80 4.20 Female No Thur Lunch 6
141 34.30 6.70 Male No Thur Lunch 6
142 41.19 5.00 Male No Thur Lunch 5
143 27.05 5.00 Female No Thur Lunch 6
155 29.85 5.14 Female No Sun Dinner 5
156 48.17 5.00 Male No Sun Dinner 6
170 50.81 10.00 Male Yes Sat Dinner 3
182 45.35 3.50 Male Yes Sun Dinner 3
185 20.69 5.00 Male No Sun Dinner 5
187 30.46 2.00 Male Yes Sun Dinner 5
212 48.33 9.00 Male No Sat Dinner 4
216 28.15 3.00 Male Yes Sat Dinner 5
....: )
....:
In [15]: frame
Out[15]:
col1 col2
0 A F
1 B NaN
2 NaN G
3 C H
4 D I
Assume we have a table of the same structure as our DataFrame above. We can see only the records where col2 IS
NULL with the following query:
SELECT *
FROM frame
WHERE col2 IS NULL;
In [16]: frame[frame["col2"].isna()]
Out[16]:
col1 col2
1 B NaN
Getting items where col1 IS NOT NULL can be done with notna().
SELECT *
FROM frame
WHERE col1 IS NOT NULL;
In [17]: frame[frame["col1"].notna()]
Out[17]:
col1 col2
(continues on next page)
GROUP BY
In pandas, SQL’s GROUP BY operations are performed using the similarly named groupby() method.
groupby() typically refers to a process where we’d like to split a dataset into groups, apply some function (typically
aggregation) , and then combine the groups together.
A common SQL operation would be getting the count of records in each group throughout a dataset. For instance, a
query getting us the number of tips left by sex:
In [18]: tips.groupby("sex").size()
Out[18]:
sex
Female 87
Male 157
dtype: int64
Notice that in the pandas code we used size() and not count(). This is because count() applies the function
to each column, returning the number of not null records within each.
In [19]: tips.groupby("sex").count()
Out[19]:
total_bill tip smoker day time size
sex
Female 87 87 87 87 87 87
Male 157 157 157 157 157 157
In [20]: tips.groupby("sex")["total_bill"].count()
Out[20]:
sex
Female 87
Male 157
Name: total_bill, dtype: int64
Multiple functions can also be applied at once. For instance, say we’d like to see how tip amount differs by day of
the week - agg() allows you to pass a dictionary to your grouped DataFrame, indicating which functions to apply to
specific columns.
1.4. Tutorials 75
pandas: powerful Python data analysis toolkit, Release 1.2.0
Grouping by more than one column is done by passing a list of columns to the groupby() method.
JOIN
JOINs can be performed with join() or merge(). By default, join() will join the DataFrames on their indices.
Each method has parameters allowing you to specify the type of join to perform (LEFT, RIGHT, INNER, FULL) or
the columns to join on (column names or indices).
Assume we have two database tables of the same name and structure as our DataFrames.
Now let’s go over the various types of JOINs.
INNER JOIN
SELECT *
FROM df1
INNER JOIN df2
ON df1.key = df2.key;
merge() also offers parameters for cases when you’d like to join one DataFrame’s column with another DataFrame’s
index.
1.4. Tutorials 77
pandas: powerful Python data analysis toolkit, Release 1.2.0
RIGHT JOIN
FULL JOIN
pandas also allows for FULL JOINs, which display both sides of the dataset, whether or not the joined columns find a
match. As of writing, FULL JOINs are not supported in all RDBMS (MySQL).
UNION
....: )
....:
SQL’s UNION is similar to UNION ALL, however UNION will remove duplicate rows.
SELECT city, rank
FROM df1
UNION
SELECT city, rank
FROM df2;
-- notice that there is only one Chicago record this time
/*
city rank
Chicago 1
San Francisco 2
New York City 3
Boston 4
Los Angeles 5
*/
1.4. Tutorials 79
pandas: powerful Python data analysis toolkit, Release 1.2.0
-- MySQL
SELECT * FROM tips
ORDER BY tip DESC
LIMIT 10 OFFSET 5;
In [36]: (
....: tips.assign(
....: rn=tips.sort_values(["total_bill"], ascending=False)
....: .groupby(["day"])
....: .cumcount()
....: + 1
....: )
....: .query("rn < 3")
(continues on next page)
In [37]: (
....: tips.assign(
....: rnk=tips.groupby(["day"])["total_bill"].rank(
....: method="first", ascending=False
....: )
....: )
....: .query("rnk < 3")
....: .sort_values(["day", "rnk"])
....: )
....:
Out[37]:
total_bill tip sex smoker day time size rnk
95 40.17 4.73 Male Yes Fri Dinner 4 1.0
90 28.97 3.00 Male Yes Fri Dinner 2 2.0
170 50.81 10.00 Male Yes Sat Dinner 3 1.0
212 48.33 9.00 Male No Sat Dinner 4 2.0
156 48.17 5.00 Male No Sun Dinner 6 1.0
182 45.35 3.50 Male Yes Sun Dinner 3 2.0
197 43.11 5.00 Female Yes Thur Lunch 4 1.0
142 41.19 5.00 Male No Thur Lunch 5 2.0
Let’s find tips with (rank < 3) per gender group for (tips < 2). Notice that when using rank(method='min')
function rnk_min remains the same for the same tip (as Oracle’s RANK() function)
In [38]: (
....: tips[tips["tip"] < 2]
....: .assign(rnk_min=tips.groupby(["sex"])["tip"].rank(method="min"))
....: .query("rnk_min < 3")
....: .sort_values(["sex", "rnk_min"])
(continues on next page)
1.4. Tutorials 81
pandas: powerful Python data analysis toolkit, Release 1.2.0
UPDATE
UPDATE tips
SET tip = tip*2
WHERE tip < 2;
DELETE
In pandas we select the rows that should remain, instead of deleting them
For potential users coming from SAS this page is meant to demonstrate how different SAS operations would be
performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows:
Note: Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which displays
the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter notebook or
terminal) - the equivalent in SAS would be:
Data structures
pandas SAS
DataFrame data set
column variable
row observation
groupby BY-group
NaN .
DataFrame / Series
A DataFrame in pandas is analogous to a SAS data set - a two-dimensional data source with labeled columns that
can be of different types. As will be shown in this document, almost any operation that can be applied to a data set
using SAS’s DATA step, can also be accomplished in pandas.
A Series is the data structure that represents one column of a DataFrame. SAS doesn’t have a separate data
structure for a single column, but in general, working with a Series is analogous to referencing a column in the
DATA step.
Index
Every DataFrame and Series has an Index - which are labels on the rows of the data. SAS does not have an
exactly analogous concept. A data set’s rows are essentially unlabeled, other than an implicit integer index that can be
accessed during the DATA step (_N_).
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on).
While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part
of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as
a collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.
A SAS data set can be built from specified values by placing the data after a datalines statement and specifying
the column names.
data df;
input x y;
datalines;
1 2
3 4
5 6
;
run;
A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often
convenient to specify it as a Python dictionary, where the keys are the column names and the values are the data.
1.4. Tutorials 83
pandas: powerful Python data analysis toolkit, Release 1.2.0
In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6
Like SAS, pandas provides utilities for reading in data from many formats. The tips dataset, found within the pandas
tests (csv) will be used in many of the following examples.
SAS provides PROC IMPORT to read csv data into a data set.
In [5]: url = (
...: "https://raw.github.com/pandas-dev/"
...: "pandas/master/pandas/tests/io/data/csv/tips.csv"
...: )
...:
In [7]: tips.head()
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Like PROC IMPORT, read_csv can take a number of parameters to specify how the data should be parsed. For
example, if the data was instead tab delimited, and did not have column names, the pandas command would be:
In addition to text/csv, pandas supports a variety of other data formats such as Excel, HDF5, and SQL databases. These
are all read via a pd.read_* function. See the IO documentation for more details.
Exporting data
Similarly in pandas, the opposite of read_csv is to_csv(), and other data formats follow a similar api.
tips.to_csv("tips2.csv")
Data operations
Operations on columns
In the DATA step, arbitrary math expressions can be used on new or existing columns.
data tips;
set tips;
total_bill = total_bill - 2;
new_bill = total_bill / 2;
run;
pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way.
In [8]: tips["total_bill"] = tips["total_bill"] - 2
In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295
Filtering
data tips;
set tips;
where total_bill > 10;
/* equivalent in this case - where happens before the
DATA step begins and can also be used in PROC statements */
run;
1.4. Tutorials 85
pandas: powerful Python data analysis toolkit, Release 1.2.0
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing
If/then logic
data tips;
set tips;
format bucket $4.;
The same operation in pandas can be accomplished using the where method from numpy.
In [13]: tips.head()
Out[13]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high
Date functionality
data tips;
set tips;
format date1 date2 date1_plusmonth mmddyy10.;
date1 = mdy(1, 15, 2013);
date2 = mdy(2, 15, 2015);
date1_year = year(date1);
date2_month = month(date2);
* shift date to beginning of next interval;
date1_next = intnx('MONTH', date1, 1);
* count intervals between dates;
months_between = intck('MONTH', date1, date2);
run;
The equivalent pandas operations are shown below. In addition to these functions pandas supports other Time Series
features not available in Base SAS (such as resampling and custom offsets) - see the timeseries documentation for
more details.
In [20]: tips[
....: ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_
˓→between"]
....: ].head()
....:
Out[20]:
date1 date2 date1_year date2_month date1_next months_between
0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
Selection of columns
SAS provides keywords in the DATA step to select, drop, and rename columns.
data tips;
set tips;
keep sex total_bill tip;
run;
data tips;
set tips;
drop sex;
run;
data tips;
set tips;
rename total_bill=total_bill_2;
run;
# keep
In [21]: tips[["sex", "total_bill", "tip"]].head()
Out[21]:
sex total_bill tip
0 Female 14.99 1.01
(continues on next page)
1.4. Tutorials 87
pandas: powerful Python data analysis toolkit, Release 1.2.0
# drop
In [22]: tips.drop("sex", axis=1).head()
Out[22]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
3 21.68 3.31 No Sun Dinner 2
4 22.59 3.61 No Sun Dinner 4
# rename
In [23]: tips.rename(columns={"total_bill": "total_bill_2"}).head()
Out[23]:
total_bill_2 tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
Sorting by values
pandas objects have a sort_values() method, which takes a list of columns to sort by.
In [25]: tips.head()
Out[25]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
String processing
Length
SAS determines the length of a character string with the LENGTHN and LENGTHC functions. LENGTHN excludes
trailing blanks and LENGTHC includes trailing blanks.
data _null_;
set tips;
put(LENGTHN(time));
put(LENGTHC(time));
run;
Python determines the length of a character string with the len function. len includes trailing blanks. Use len and
rstrip to exclude trailing blanks.
In [26]: tips["time"].str.len().head()
Out[26]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
In [27]: tips["time"].str.rstrip().str.len().head()
Out[27]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
Find
SAS determines the position of a character in a string with the FINDW function. FINDW takes the string defined by
the first argument and searches for the first position of the substring you supply as the second argument.
data _null_;
set tips;
put(FINDW(sex,'ale'));
run;
Python determines the position of a character in a string with the find function. find searches for the first position
of the substring. If the substring is found, the function returns its position. Keep in mind that Python indexes are
zero-based and the function will return -1 if it fails to find the substring.
In [28]: tips["sex"].str.find("ale").head()
Out[28]:
67 3
92 3
111 3
145 3
(continues on next page)
1.4. Tutorials 89
pandas: powerful Python data analysis toolkit, Release 1.2.0
Substring
SAS extracts a substring from a string based on its position with the SUBSTR function.
data _null_;
set tips;
put(substr(sex,1,1));
run;
With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that
Python indexes are zero-based.
In [29]: tips["sex"].str[0:1].head()
Out[29]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object
Scan
The SAS SCAN function returns the nth word from a string. The first argument is the string you want to parse and the
second argument specifies which word you want to extract.
data firstlast;
input String $60.;
First_Name = scan(string, 1);
Last_Name = scan(string, -1);
datalines2;
John Smith;
Jane Cook;
;;;
run;
Python extracts a substring from a string based on its text by using regular expressions. There are much more powerful
approaches, but this just shows a simple approach.
In [33]: firstlast
Out[33]:
String First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane
The SAS UPCASE LOWCASE and PROPCASE functions change the case of the argument.
data firstlast;
input String $60.;
string_up = UPCASE(string);
string_low = LOWCASE(string);
string_prop = PROPCASE(string);
datalines2;
John Smith;
Jane Cook;
;;;
run;
In [38]: firstlast
Out[38]:
String string_up string_low string_prop
0 John Smith JOHN SMITH john smith John Smith
1 Jane Cook JANE COOK jane cook Jane Cook
Merging
In [40]: df1
Out[40]:
key value
0 A 0.469112
1 B -0.282863
2 C -1.509059
3 D -1.135632
In [42]: df2
Out[42]:
key value
0 B 1.212112
1 D -0.173215
2 D 0.119209
3 E -1.044236
1.4. Tutorials 91
pandas: powerful Python data analysis toolkit, Release 1.2.0
In SAS, data must be explicitly sorted before merging. Different types of joins are accomplished using the in= dummy
variables to track whether a match was found in one or both input frames.
proc sort data=df1;
by key;
run;
pandas DataFrames have a merge() method, which provides similar functionality. Note that the data does not have
to be sorted ahead of time, and different join types are accomplished via the how keyword.
In [43]: inner_join = df1.merge(df2, on=["key"], how="inner")
In [44]: inner_join
Out[44]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
In [46]: left_join
Out[46]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
In [48]: right_join
Out[48]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
3 E NaN -1.044236
In [50]: outer_join
Out[50]:
key value_x value_y
(continues on next page)
Missing data
Like SAS, pandas has a representation for missing data - which is the special float value NaN (not a number). Many
of the semantics are the same, for example missing data propagates through numeric operations, and is ignored by
default for aggregations.
In [51]: outer_join
Out[51]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236
In [53]: outer_join["value_x"].sum()
Out[53]: -3.5940742896293765
One difference is that missing data cannot be compared to its sentinel value. For example, in SAS you could do this
to filter missing values.
data outer_join_nulls;
set outer_join;
if value_x = .;
run;
data outer_join_no_nulls;
set outer_join;
if value_x ^= .;
run;
Which doesn’t work in pandas. Instead, the pd.isna or pd.notna functions should be used for comparisons.
In [54]: outer_join[pd.isna(outer_join["value_x"])]
Out[54]:
key value_x value_y
(continues on next page)
1.4. Tutorials 93
pandas: powerful Python data analysis toolkit, Release 1.2.0
In [55]: outer_join[pd.notna(outer_join["value_x"])]
Out[55]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
pandas also provides a variety of methods to work with missing data - some of which would be challenging to express
in SAS. For example, there are methods to drop all rows with any missing values, replacing missing values with a
specified value, like the mean, or forward filling from previous rows. See the missing data documentation for more.
In [56]: outer_join.dropna()
Out[56]:
key value_x value_y
1 B -0.282863 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
In [57]: outer_join.fillna(method="ffill")
Out[57]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E -1.135632 -1.044236
In [58]: outer_join["value_x"].fillna(outer_join["value_x"].mean())
Out[58]:
0 0.469112
1 -0.282863
2 -1.509059
3 -1.135632
4 -1.135632
5 -0.718815
Name: value_x, dtype: float64
GroupBy
Aggregation
SAS’s PROC SUMMARY can be used to group by one or more key variables and compute aggregations on numeric
columns.
pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for
more details and examples.
In [60]: tips_summed.head()
Out[60]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07
Transformation
In SAS, if the group aggregations need to be used with the original frame, it must be merged back together. For
example, to subtract the mean for each observation by smoker group.
data tips;
merge tips(in=a) smoker_means(in=b);
by smoker;
adj_total_bill = total_bill - group_bill;
if a and b;
run;
pandas groupby provides a transform mechanism that allows these type of operations to be succinctly expressed
in one operation.
In [61]: gb = tips.groupby("smoker")["total_bill"]
In [63]: tips.head()
Out[63]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278
1.4. Tutorials 95
pandas: powerful Python data analysis toolkit, Release 1.2.0
By group processing
In addition to aggregation, pandas groupby can be used to replicate most other by group processing from SAS. For
example, this DATA step reads the data by sex/smoker group and filters to the first entry for each.
data tips_first;
set tips;
by sex smoker;
if FIRST.sex or FIRST.smoker then output;
run;
Other considerations
Disk vs memory
pandas operates exclusively in memory, where a SAS data set exists on disk. This means that the size of data able to
be loaded in pandas is limited by your machine’s memory, but also that the operations on that data may be faster.
If out of core processing is needed, one possibility is the dask.dataframe library (currently in development) which
provides a subset of pandas functionality for an on-disk DataFrame
Data interop
pandas provides a read_sas() method that can read SAS data saved in the XPORT or SAS7BDAT binary format.
df = pd.read_sas("transport-file.xpt")
df = pd.read_sas("binary-file.sas7bdat")
You can also specify the file format directly. By default, pandas will try to infer the file format based on its extension.
df = pd.read_sas("transport-file.xpt", format="xport")
df = pd.read_sas("binary-file.sas7bdat", format="sas7bdat")
XPORT is a relatively limited format and the parsing of it is not as optimized as some of the other pandas readers. An
alternative way to interop data between SAS and pandas is to serialize to csv.
For potential users coming from Stata this page is meant to demonstrate how different Stata operations would be
performed in pandas.
If you’re new to pandas, you might want to first read through 10 Minutes to pandas to familiarize yourself with the
library.
As is customary, we import pandas and NumPy as follows. This means that we can refer to the libraries as pd and np,
respectively, for the rest of the document.
Note: Throughout this tutorial, the pandas DataFrame will be displayed by calling df.head(), which displays
the first N (default 5) rows of the DataFrame. This is often used in interactive work (e.g. Jupyter notebook or
terminal) – the equivalent in Stata would be:
list in 1/5
Data structures
pandas Stata
DataFrame data set
column variable
row observation
groupby bysort
NaN .
1.4. Tutorials 97
pandas: powerful Python data analysis toolkit, Release 1.2.0
DataFrame / Series
A DataFrame in pandas is analogous to a Stata data set – a two-dimensional data source with labeled columns that
can be of different types. As will be shown in this document, almost any operation that can be applied to a data set in
Stata can also be accomplished in pandas.
A Series is the data structure that represents one column of a DataFrame. Stata doesn’t have a separate data
structure for a single column, but in general, working with a Series is analogous to referencing a column of a data
set in Stata.
Index
Every DataFrame and Series has an Index – labels on the rows of the data. Stata does not have an exactly
analogous concept. In Stata, a data set’s rows are essentially unlabeled, other than an implicit integer index that can
be accessed with _n.
In pandas, if no index is specified, an integer index is also used by default (first row = 0, second row = 1, and so on).
While using a labeled Index or MultiIndex can enable sophisticated analyses and is ultimately an important part
of pandas to understand, for this comparison we will essentially ignore the Index and just treat the DataFrame as
a collection of columns. Please see the indexing documentation for much more on how to use an Index effectively.
A Stata data set can be built from specified values by placing the data after an input statement and specifying the
column names.
input x y
1 2
3 4
5 6
end
A pandas DataFrame can be constructed in many different ways, but for a small number of values, it is often
convenient to specify it as a Python dictionary, where the keys are the column names and the values are the data.
In [4]: df
Out[4]:
x y
0 1 2
1 3 4
2 5 6
Like Stata, pandas provides utilities for reading in data from many formats. The tips data set, found within the
pandas tests (csv) will be used in many of the following examples.
Stata provides import delimited to read csv data into a data set in memory. If the tips.csv file is in the
current working directory, we can import it as follows.
The pandas method is read_csv(), which works similarly. Additionally, it will automatically download the data
set if presented with a url.
In [5]: url = (
...: "https://raw.github.com/pandas-dev"
...: "/pandas/master/pandas/tests/io/data/csv/tips.csv"
...: )
...:
In [7]: tips.head()
Out[7]:
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
3 23.68 3.31 Male No Sun Dinner 2
4 24.59 3.61 Female No Sun Dinner 4
Like import delimited, read_csv() can take a number of parameters to specify how the data should be
parsed. For example, if the data were instead tab delimited, did not have column names, and existed in the current
working directory, the pandas command would be:
pandas can also read Stata data sets in .dta format with the read_stata() function.
df = pd.read_stata("data.dta")
In addition to text/csv and Stata files, pandas supports a variety of other data formats such as Excel, SAS, HDF5,
Parquet, and SQL databases. These are all read via a pd.read_* function. See the IO documentation for more
details.
1.4. Tutorials 99
pandas: powerful Python data analysis toolkit, Release 1.2.0
Exporting data
tips.to_csv("tips2.csv")
pandas can also export to Stata file format with the DataFrame.to_stata() method.
tips.to_stata("tips2.dta")
Data operations
Operations on columns
In Stata, arbitrary math expressions can be used with the generate and replace commands on new or existing
columns. The drop command drops the column from the data set.
pandas provides similar vectorized operations by specifying the individual Series in the DataFrame. New
columns can be assigned in the same way. The DataFrame.drop() method drops a column from the DataFrame.
In [10]: tips.head()
Out[10]:
total_bill tip sex smoker day time size new_bill
0 14.99 1.01 Female No Sun Dinner 2 7.495
1 8.34 1.66 Male No Sun Dinner 3 4.170
2 19.01 3.50 Male No Sun Dinner 3 9.505
3 21.68 3.31 Male No Sun Dinner 2 10.840
4 22.59 3.61 Female No Sun Dinner 4 11.295
Filtering
DataFrames can be filtered in multiple ways; the most intuitive of which is using boolean indexing.
If/then logic
The same operation in pandas can be accomplished using the where method from numpy.
In [14]: tips.head()
Out[14]:
total_bill tip sex smoker day time size bucket
0 14.99 1.01 Female No Sun Dinner 2 high
1 8.34 1.66 Male No Sun Dinner 3 low
2 19.01 3.50 Male No Sun Dinner 3 high
3 21.68 3.31 Male No Sun Dinner 2 high
4 22.59 3.61 Female No Sun Dinner 4 high
Date functionality
The equivalent pandas operations are shown below. In addition to these functions, pandas supports other Time Series
features not available in Stata (such as time zone handling and custom offsets) – see the timeseries documentation for
more details.
In [21]: tips[
....: ["date1", "date2", "date1_year", "date2_month", "date1_next", "months_
˓→between"]
....: ].head()
....:
Out[21]:
date1 date2 date1_year date2_month date1_next months_between
0 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
1 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
2 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
3 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
4 2013-01-15 2015-02-15 2013 2 2013-02-01 <25 * MonthEnds>
Selection of columns
drop sex
The same operations are expressed in pandas below. Note that in contrast to Stata, these operations do not happen in
place. To make these changes persist, assign the operation back to a variable.
# keep
In [22]: tips[["sex", "total_bill", "tip"]].head()
Out[22]:
sex total_bill tip
0 Female 14.99 1.01
1 Male 8.34 1.66
2 Male 19.01 3.50
3 Male 21.68 3.31
4 Female 22.59 3.61
# drop
In [23]: tips.drop("sex", axis=1).head()
Out[23]:
total_bill tip smoker day time size
0 14.99 1.01 No Sun Dinner 2
1 8.34 1.66 No Sun Dinner 3
2 19.01 3.50 No Sun Dinner 3
(continues on next page)
# rename
In [24]: tips.rename(columns={"total_bill": "total_bill_2"}).head()
Out[24]:
total_bill_2 tip sex smoker day time size
0 14.99 1.01 Female No Sun Dinner 2
1 8.34 1.66 Male No Sun Dinner 3
2 19.01 3.50 Male No Sun Dinner 3
3 21.68 3.31 Male No Sun Dinner 2
4 22.59 3.61 Female No Sun Dinner 4
Sorting by values
pandas objects have a DataFrame.sort_values() method, which takes a list of columns to sort by.
In [26]: tips.head()
Out[26]:
total_bill tip sex smoker day time size
67 1.07 1.00 Female Yes Sat Dinner 1
92 3.75 1.00 Female Yes Fri Dinner 2
111 5.25 1.00 Female No Sat Dinner 1
145 6.35 1.50 Female No Thur Lunch 2
135 6.51 1.25 Female No Thur Lunch 2
String processing
Stata determines the length of a character string with the strlen() and ustrlen() functions for ASCII and
Unicode strings, respectively.
Python determines the length of a character string with the len function. In Python 3, all strings are Unicode strings.
len includes trailing blanks. Use len and rstrip to exclude trailing blanks.
In [27]: tips["time"].str.len().head()
Out[27]:
67 6
92 6
111 6
145 5
135 5
(continues on next page)
In [28]: tips["time"].str.rstrip().str.len().head()
Out[28]:
67 6
92 6
111 6
145 5
135 5
Name: time, dtype: int64
Stata determines the position of a character in a string with the strpos() function. This takes the string defined by
the first argument and searches for the first position of the substring you supply as the second argument.
Python determines the position of a character in a string with the find() function. find searches for the first
position of the substring. If the substring is found, the function returns its position. Keep in mind that Python indexes
are zero-based and the function will return -1 if it fails to find the substring.
In [29]: tips["sex"].str.find("ale").head()
Out[29]:
67 3
92 3
111 3
145 3
135 3
Name: sex, dtype: int64
Stata extracts a substring from a string based on its position with the substr() function.
With pandas you can use [] notation to extract a substring from a string by position locations. Keep in mind that
Python indexes are zero-based.
In [30]: tips["sex"].str[0:1].head()
Out[30]:
67 F
92 F
111 F
145 F
135 F
Name: sex, dtype: object
The Stata word() function returns the nth word from a string. The first argument is the string you want to parse and
the second argument specifies which word you want to extract.
clear
input str20 string
"John Smith"
"Jane Cook"
end
Python extracts a substring from a string based on its text by using regular expressions. There are much more powerful
approaches, but this just shows a simple approach.
In [34]: firstlast
Out[34]:
string First_Name Last_Name
0 John Smith John John
1 Jane Cook Jane Jane
Changing case
clear
input str20 string
"John Smith"
"Jane Cook"
end
Merging
In [41]: df1
Out[41]:
key value
0 A 0.469112
1 B -0.282863
2 C -1.509059
3 D -1.135632
In [43]: df2
Out[43]:
key value
0 B 1.212112
1 D -0.173215
2 D 0.119209
3 E -1.044236
In Stata, to perform a merge, one data set must be in memory and the other must be referenced as a file name on disk.
In contrast, Python must have both DataFrames already in memory.
By default, Stata performs an outer join, where all observations from both data sets are left in memory after the merge.
One can keep only observations from the initial data set, the merged data set, or the intersection of the two by using
the values created in the _merge variable.
preserve
* Left join
merge 1:n key using df2.dta
keep if _merge == 1
* Right join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 2
* Inner join
restore, preserve
merge 1:n key using df2.dta
keep if _merge == 3
* Outer join
restore
merge 1:n key using df2.dta
pandas DataFrames have a DataFrame.merge() method, which provides similar functionality. Note that different
join types are accomplished via the how keyword.
In [45]: inner_join
Out[45]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
In [47]: left_join
Out[47]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
In [49]: right_join
Out[49]:
key value_x value_y
0 B -0.282863 1.212112
1 D -1.135632 -0.173215
2 D -1.135632 0.119209
(continues on next page)
In [51]: outer_join
Out[51]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236
Missing data
Like Stata, pandas has a representation for missing data – the special float value NaN (not a number). Many of the
semantics are the same; for example missing data propagates through numeric operations, and is ignored by default
for aggregations.
In [52]: outer_join
Out[52]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E NaN -1.044236
In [54]: outer_join["value_x"].sum()
Out[54]: -3.5940742896293765
One difference is that missing data cannot be compared to its sentinel value. For example, in Stata you could do this
to filter missing values.
This doesn’t work in pandas. Instead, the pd.isna() or pd.notna() functions should be used for comparisons.
In [55]: outer_join[pd.isna(outer_join["value_x"])]
Out[55]:
(continues on next page)
In [56]: outer_join[pd.notna(outer_join["value_x"])]
Out[56]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 NaN
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
pandas also provides a variety of methods to work with missing data – some of which would be challenging to express
in Stata. For example, there are methods to drop all rows with any missing values, replacing missing values with a
specified value, like the mean, or forward filling from previous rows. See the missing data documentation for more.
# Fill forwards
In [58]: outer_join.fillna(method="ffill")
Out[58]:
key value_x value_y
0 A 0.469112 NaN
1 B -0.282863 1.212112
2 C -1.509059 1.212112
3 D -1.135632 -0.173215
4 D -1.135632 0.119209
5 E -1.135632 -1.044236
GroupBy
Aggregation
Stata’s collapse can be used to group by one or more key variables and compute aggregations on numeric columns.
pandas provides a flexible groupby mechanism that allows similar aggregations. See the groupby documentation for
more details and examples.
In [61]: tips_summed.head()
Out[61]:
total_bill tip
sex smoker
Female No 869.68 149.77
Yes 527.27 96.74
Male No 1725.75 302.00
Yes 1217.07 183.07
Transformation
In Stata, if the group aggregations need to be used with the original data set, one would usually use bysort with
egen(). For example, to subtract the mean for each observation by smoker group.
pandas groupby provides a transform mechanism that allows these type of operations to be succinctly expressed
in one operation.
In [62]: gb = tips.groupby("smoker")["total_bill"]
In [64]: tips.head()
Out[64]:
total_bill tip sex smoker day time size adj_total_bill
67 1.07 1.00 Female Yes Sat Dinner 1 -17.686344
92 3.75 1.00 Female Yes Fri Dinner 2 -15.006344
111 5.25 1.00 Female No Sat Dinner 1 -11.938278
145 6.35 1.50 Female No Thur Lunch 2 -10.838278
135 6.51 1.25 Female No Thur Lunch 2 -10.678278
By group processing
In addition to aggregation, pandas groupby can be used to replicate most other bysort processing from Stata. For
example, the following example lists the first observation in the current sort order by sex/smoker group.
Other considerations
Disk vs memory
pandas and Stata both operate exclusively in memory. This means that the size of data able to be loaded in pandas is
limited by your machine’s memory. If out of core processing is needed, one possibility is the dask.dataframe library,
which provides a subset of pandas functionality for an on-disk DataFrame.
This is a guide to many pandas tutorials by the community, geared mainly for new users.
The goal of this 2015 cookbook (by Julia Evans) is to give you some concrete examples for getting started with pandas.
These are examples with real-world data, and all the bugs and weirdness that entails. For the table of contents, see the
pandas-cookbook GitHub repository.
This guide is an introduction to the data analysis process using the Python data ecosystem and an interesting open
dataset. There are four sections covering selected topics as munging data, aggregating data, visualizing data and time
series.
Practice your skills with real data sets and exercises. For more resources, please visit the main repository.
Modern pandas
Tutorial series written in 2016 by Tom Augspurger. The source may be found in the GitHub repository
TomAugspurger/effective-pandas.
• Modern Pandas
• Method Chaining
• Indexes
• Performance
• Tidy Data
• Visualization
• Timeseries
Video tutorials
Various tutorials
TWO
USER GUIDE
The User Guide covers all of pandas by topic area. Each of the subsections introduces a topic (such as “working with
missing data”), and discusses how pandas approaches the problem, with many examples throughout.
Users brand-new to pandas should start with 10min.
For a high level summary of the pandas fundamentals, see Intro to data structures and Essential basic functionality.
Further information on any specific method can be obtained in the API reference. {{ header }}
This is a short introduction to pandas, geared mainly for new users. You can see more complex recipes in the Cookbook.
Customarily, we import as follows:
In [1]: import numpy as np
In [4]: s
Out[4]:
0 1.0
1 3.0
2 5.0
3 NaN
4 6.0
5 8.0
dtype: float64
Creating a DataFrame by passing a NumPy array, with a datetime index and labeled columns:
In [5]: dates = pd.date_range("20130101", periods=6)
In [6]: dates
(continues on next page)
113
pandas: powerful Python data analysis toolkit, Release 1.2.0
In [8]: df
Out[8]:
A B C D
2013-01-01 -0.589048 1.169468 -0.566654 0.426696
2013-01-02 0.918426 -1.397296 1.858663 0.841534
2013-01-03 -1.345281 0.007690 -0.033279 1.012771
2013-01-04 -0.233397 -0.599248 0.341060 0.865200
2013-01-05 0.135838 0.236250 0.760612 -0.240318
2013-01-06 -1.414459 -0.343968 0.026729 0.244390
In [10]: df2
Out[10]:
A B C D E F
0 1.0 2013-01-02 1.0 3 test foo
1 1.0 2013-01-02 1.0 3 train foo
2 1.0 2013-01-02 1.0 3 test foo
3 1.0 2013-01-02 1.0 3 train foo
In [11]: df2.dtypes
Out[11]:
A float64
B datetime64[ns]
C float32
D int32
E category
F object
dtype: object
If you’re using IPython, tab completion for column names (as well as public attributes) is automatically enabled.
Here’s a subset of the attributes that will be completed:
As you can see, the columns A, B, C, and D are automatically tab completed. E and F are there as well; the rest of the
attributes have been truncated for brevity.
In [13]: df.head()
Out[13]:
A B C D
2013-01-01 -0.589048 1.169468 -0.566654 0.426696
2013-01-02 0.918426 -1.397296 1.858663 0.841534
2013-01-03 -1.345281 0.007690 -0.033279 1.012771
2013-01-04 -0.233397 -0.599248 0.341060 0.865200
2013-01-05 0.135838 0.236250 0.760612 -0.240318
In [14]: df.tail(3)
Out[14]:
A B C D
2013-01-04 -0.233397 -0.599248 0.341060 0.865200
2013-01-05 0.135838 0.236250 0.760612 -0.240318
2013-01-06 -1.414459 -0.343968 0.026729 0.244390
In [15]: df.index
Out[15]:
DatetimeIndex(['2013-01-01', '2013-01-02', '2013-01-03', '2013-01-04',
'2013-01-05', '2013-01-06'],
dtype='datetime64[ns]', freq='D')
In [16]: df.columns
Out[16]: Index(['A', 'B', 'C', 'D'], dtype='object')
DataFrame.to_numpy() gives a NumPy representation of the underlying data. Note that this can be an expensive
operation when your DataFrame has columns with different data types, which comes down to a fundamental differ-
ence between pandas and NumPy: NumPy arrays have one dtype for the entire array, while pandas DataFrames
have one dtype per column. When you call DataFrame.to_numpy(), pandas will find the NumPy dtype that
can hold all of the dtypes in the DataFrame. This may end up being object, which requires casting every value to a
Python object.
For df, our DataFrame of all floating-point values, DataFrame.to_numpy() is fast and doesn’t require copying
data.
In [17]: df.to_numpy()
Out[17]:
array([[-0.58904772, 1.16946809, -0.56665398, 0.42669597],
[ 0.91842619, -1.39729612, 1.85866285, 0.84153428],
[-1.34528109, 0.00768962, -0.03327893, 1.0127712 ],
[-0.23339659, -0.59924775, 0.34106029, 0.86520026],
[ 0.13583795, 0.23625047, 0.76061191, -0.2403179 ],
[-1.41445893, -0.34396829, 0.02672904, 0.24438951]])
For df2, the DataFrame with multiple dtypes, DataFrame.to_numpy() is relatively expensive.
In [18]: df2.to_numpy()
Out[18]:
array([[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'test', 'foo'],
[1.0, Timestamp('2013-01-02 00:00:00'), 1.0, 3, 'train', 'foo']],
dtype=object)
Note: DataFrame.to_numpy() does not include the index or column labels in the output.
Sorting by an axis:
In [21]: df.sort_index(axis=1, ascending=False)
Out[21]:
D C B A
2013-01-01 0.426696 -0.566654 1.169468 -0.589048
2013-01-02 0.841534 1.858663 -1.397296 0.918426
2013-01-03 1.012771 -0.033279 0.007690 -1.345281
(continues on next page)
Sorting by values:
In [22]: df.sort_values(by="B")
Out[22]:
A B C D
2013-01-02 0.918426 -1.397296 1.858663 0.841534
2013-01-04 -0.233397 -0.599248 0.341060 0.865200
2013-01-06 -1.414459 -0.343968 0.026729 0.244390
2013-01-03 -1.345281 0.007690 -0.033279 1.012771
2013-01-05 0.135838 0.236250 0.760612 -0.240318
2013-01-01 -0.589048 1.169468 -0.566654 0.426696
2.1.3 Selection
Note: While standard Python / Numpy expressions for selecting and setting are intuitive and come in handy for
interactive work, for production code, we recommend the optimized pandas data access methods, .at, .iat, .loc
and .iloc.
See the indexing documentation Indexing and Selecting Data and MultiIndex / Advanced Indexing.
Getting
In [23]: df["A"]
Out[23]:
2013-01-01 -0.589048
2013-01-02 0.918426
2013-01-03 -1.345281
2013-01-04 -0.233397
2013-01-05 0.135838
2013-01-06 -1.414459
Freq: D, Name: A, dtype: float64
In [24]: df[0:3]
Out[24]:
A B C D
2013-01-01 -0.589048 1.169468 -0.566654 0.426696
2013-01-02 0.918426 -1.397296 1.858663 0.841534
2013-01-03 -1.345281 0.007690 -0.033279 1.012771
In [25]: df["20130102":"20130104"]
Out[25]:
A B C D
2013-01-02 0.918426 -1.397296 1.858663 0.841534
(continues on next page)
Selection by label
In [26]: df.loc[dates[0]]
Out[26]:
A -0.589048
B 1.169468
C -0.566654
D 0.426696
Name: 2013-01-01 00:00:00, dtype: float64
Selection by position
In [32]: df.iloc[3]
Out[32]:
A -0.233397
B -0.599248
C 0.341060
D 0.865200
Name: 2013-01-04 00:00:00, dtype: float64
In [35]: df.iloc[1:3, :]
Out[35]:
A B C D
2013-01-02 0.918426 -1.397296 1.858663 0.841534
2013-01-03 -1.345281 0.007690 -0.033279 1.012771
In [37]: df.iloc[1, 1]
Out[37]: -1.397296123766777
In [38]: df.iat[1, 1]
Out[38]: -1.397296123766777
Boolean indexing
In [43]: df2
Out[43]:
A B C D E
2013-01-01 -0.589048 1.169468 -0.566654 0.426696 one
2013-01-02 0.918426 -1.397296 1.858663 0.841534 one
2013-01-03 -1.345281 0.007690 -0.033279 1.012771 two
2013-01-04 -0.233397 -0.599248 0.341060 0.865200 three
2013-01-05 0.135838 0.236250 0.760612 -0.240318 four
2013-01-06 -1.414459 -0.343968 0.026729 0.244390 three
Setting
In [46]: s1
Out[46]:
2013-01-02 1
2013-01-03 2
2013-01-04 3
2013-01-05 4
2013-01-06 5
2013-01-07 6
Freq: D, dtype: int64
In [47]: df["F"] = s1
In [49]: df.iat[0, 1] = 0
In [51]: df
Out[51]:
A B C D F
2013-01-01 0.000000 0.000000 -0.566654 5 NaN
2013-01-02 0.918426 -1.397296 1.858663 5 1.0
2013-01-03 -1.345281 0.007690 -0.033279 5 2.0
2013-01-04 -0.233397 -0.599248 0.341060 5 3.0
2013-01-05 0.135838 0.236250 0.760612 5 4.0
2013-01-06 -1.414459 -0.343968 0.026729 5 5.0
In [54]: df2
Out[54]:
A B C D F
2013-01-01 0.000000 0.000000 -0.566654 -5 NaN
2013-01-02 -0.918426 -1.397296 -1.858663 -5 -1.0
2013-01-03 -1.345281 -0.007690 -0.033279 -5 -2.0
2013-01-04 -0.233397 -0.599248 -0.341060 -5 -3.0
2013-01-05 -0.135838 -0.236250 -0.760612 -5 -4.0
2013-01-06 -1.414459 -0.343968 -0.026729 -5 -5.0
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See
the Missing Data section.
Reindexing allows you to change/add/delete the index on a specified axis. This returns a copy of the data.
In [57]: df1
Out[57]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.566654 5 NaN 1.0
2013-01-02 0.918426 -1.397296 1.858663 5 1.0 1.0
2013-01-03 -1.345281 0.007690 -0.033279 5 2.0 NaN
2013-01-04 -0.233397 -0.599248 0.341060 5 3.0 NaN
In [58]: df1.dropna(how="any")
Out[58]:
A B C D F E
2013-01-02 0.918426 -1.397296 1.858663 5 1.0 1.0
In [59]: df1.fillna(value=5)
Out[59]:
A B C D F E
2013-01-01 0.000000 0.000000 -0.566654 5 5.0 1.0
2013-01-02 0.918426 -1.397296 1.858663 5 1.0 1.0
2013-01-03 -1.345281 0.007690 -0.033279 5 2.0 5.0
2013-01-04 -0.233397 -0.599248 0.341060 5 3.0 5.0
In [60]: pd.isna(df1)
Out[60]:
A B C D F E
2013-01-01 False False False False True False
2013-01-02 False False False False False False
2013-01-03 False False False False False True
2013-01-04 False False False False False True
2.1.5 Operations
Stats
In [61]: df.mean()
Out[61]:
A -0.323145
B -0.349429
C 0.397855
D 5.000000
F 3.000000
dtype: float64
In [62]: df.mean(1)
Out[62]:
2013-01-01 1.108337
2013-01-02 1.475959
2013-01-03 1.125826
2013-01-04 1.501683
2013-01-05 2.026540
2013-01-06 1.653660
Freq: D, dtype: float64
Operating with objects that have different dimensionality and need alignment. In addition, pandas automatically
broadcasts along the specified dimension.
In [64]: s
Out[64]:
2013-01-01 NaN
2013-01-02 NaN
2013-01-03 1.0
2013-01-04 3.0
2013-01-05 5.0
2013-01-06 NaN
Freq: D, dtype: float64
Apply
In [66]: df.apply(np.cumsum)
Out[66]:
A B C D F
2013-01-01 0.000000 0.000000 -0.566654 5 NaN
2013-01-02 0.918426 -1.397296 1.292009 10 1.0
2013-01-03 -0.426855 -1.389606 1.258730 15 3.0
2013-01-04 -0.660251 -1.988854 1.599790 20 6.0
2013-01-05 -0.524414 -1.752604 2.360402 25 10.0
2013-01-06 -1.938872 -2.096572 2.387131 30 15.0
Histogramming
In [69]: s
Out[69]:
0 2
1 2
2 3
3 0
4 4
5 3
6 3
7 4
8 5
9 5
dtype: int64
In [70]: s.value_counts()
Out[70]:
3 3
2 2
4 2
5 2
0 1
dtype: int64
String Methods
Series is equipped with a set of string processing methods in the str attribute that make it easy to operate on each
element of the array, as in the code snippet below. Note that pattern-matching in str generally uses regular expressions
by default (and in some cases always uses them). See more at Vectorized String Methods.
In [71]: s = pd.Series(["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"])
In [72]: s.str.lower()
Out[72]:
0 a
1 b
2 c
3 aaba
4 baca
5 NaN
6 caba
7 dog
8 cat
dtype: object
2.1.6 Merge
Concat
pandas provides various facilities for easily combining together Series and DataFrame objects with various kinds of
set logic for the indexes and relational algebra functionality in the case of join / merge-type operations.
See the Merging section.
Concatenating pandas objects together with concat():
In [74]: df
Out[74]:
0 1 2 3
0 -0.173796 0.294287 1.035954 0.486969
1 0.756078 -1.290473 0.264892 -1.223834
2 0.299342 0.923117 0.501113 -1.138437
3 -1.328275 1.766024 -0.616740 -2.408952
4 -0.459283 1.250767 1.152071 0.474141
5 1.567727 -0.488712 -0.025496 -0.729733
6 0.308275 -0.458486 -0.185078 0.084662
7 -0.433103 -2.417387 0.789715 0.851223
8 0.429768 0.330878 -1.232125 -0.946525
9 -0.346267 0.860548 -0.072269 1.365033
In [76]: pd.concat(pieces)
Out[76]:
0 1 2 3
0 -0.173796 0.294287 1.035954 0.486969
1 0.756078 -1.290473 0.264892 -1.223834
(continues on next page)
Note: Adding a column to a DataFrame is relatively fast. However, adding a row requires a copy, and may be
expensive. We recommend passing a pre-built list of records to the DataFrame constructor instead of building a
DataFrame by iteratively appending records to it. See Appending to dataframe for more.
Join
In [79]: left
Out[79]:
key lval
0 foo 1
1 foo 2
In [80]: right
Out[80]:
key rval
0 foo 4
1 foo 5
In [84]: left
Out[84]:
key lval
0 foo 1
1 bar 2
2.1.7 Grouping
By “group by” we are referring to a process involving one or more of the following steps:
• Splitting the data into groups based on some criteria
• Applying a function to each group independently
• Combining the results into a data structure
See the Grouping section.
In [87]: df = pd.DataFrame(
....: {
....: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
....: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
....: "C": np.random.randn(8),
....: "D": np.random.randn(8),
....: }
....: )
....:
In [88]: df
Out[88]:
A B C D
0 foo one 1.061156 -0.119197
1 bar one 1.125236 -0.042551
2 foo two -0.922532 -1.329419
3 bar three 0.870067 0.710917
4 foo two 0.829397 0.334992
5 bar two -0.567249 0.153956
6 foo one -2.256723 -0.302493
7 foo three -1.562813 0.437561
Grouping and then applying the sum() function to the resulting groups.
In [89]: df.groupby("A").sum()
Out[89]:
C D
A
bar 1.428054 0.822321
foo -2.851516 -0.978555
Grouping by multiple columns forms a hierarchical index, and again we can apply the sum() function.
2.1.8 Reshaping
Stack
In [95]: df2
Out[95]:
A B
first second
bar one 0.230999 0.846525
two -0.069210 0.395198
baz one -0.863593 -0.931998
two 0.538124 -1.513383
In [97]: stacked
Out[97]:
first second
bar one A 0.230999
B 0.846525
two A -0.069210
B 0.395198
baz one A -0.863593
(continues on next page)
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack() is
unstack(), which by default unstacks the last level:
In [98]: stacked.unstack()
Out[98]:
A B
first second
bar one 0.230999 0.846525
two -0.069210 0.395198
baz one -0.863593 -0.931998
two 0.538124 -1.513383
In [99]: stacked.unstack(1)
Out[99]:
second one two
first
bar A 0.230999 -0.069210
B 0.846525 0.395198
baz A -0.863593 0.538124
B -0.931998 -1.513383
In [100]: stacked.unstack(0)
Out[100]:
first bar baz
second
one A 0.230999 -0.863593
B 0.846525 -0.931998
two A -0.069210 0.538124
B 0.395198 -1.513383
Pivot tables
In [101]: df = pd.DataFrame(
.....: {
.....: "A": ["one", "one", "two", "three"] * 3,
.....: "B": ["A", "B", "C"] * 4,
.....: "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 2,
.....: "D": np.random.randn(12),
.....: "E": np.random.randn(12),
.....: }
.....: )
.....:
In [102]: df
Out[102]:
A B C D E
0 one A foo -0.101231 0.875230
1 one B foo -0.603943 0.184850
(continues on next page)
pandas has simple, powerful, and efficient functionality for performing resampling operations during frequency con-
version (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to, financial
applications. See the Time Series section.
In [104]: rng = pd.date_range("1/1/2012", periods=100, freq="S")
In [106]: ts.resample("5Min").sum()
Out[106]:
2012-01-01 25806
Freq: 5T, dtype: int64
In [109]: ts
Out[109]:
2012-03-06 0.228353
2012-03-07 -0.625865
2012-03-08 1.527161
2012-03-09 -0.033187
2012-03-10 -0.232993
(continues on next page)
In [111]: ts_utc
Out[111]:
2012-03-06 00:00:00+00:00 0.228353
2012-03-07 00:00:00+00:00 -0.625865
2012-03-08 00:00:00+00:00 1.527161
2012-03-09 00:00:00+00:00 -0.033187
2012-03-10 00:00:00+00:00 -0.232993
Freq: D, dtype: float64
In [115]: ts
Out[115]:
2012-01-31 -0.630047
2012-02-29 0.017854
2012-03-31 -0.448239
2012-04-30 -2.580608
2012-05-31 -1.549450
Freq: M, dtype: float64
In [116]: ps = ts.to_period()
In [117]: ps
Out[117]:
2012-01 -0.630047
2012-02 0.017854
2012-03 -0.448239
2012-04 -2.580608
2012-05 -1.549450
Freq: M, dtype: float64
In [118]: ps.to_timestamp()
Out[118]:
2012-01-01 -0.630047
2012-02-01 0.017854
2012-03-01 -0.448239
2012-04-01 -2.580608
2012-05-01 -1.549450
(continues on next page)
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following
example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following
the quarter end:
In [122]: ts.head()
Out[122]:
1990-03-01 09:00 -2.156529
1990-06-01 09:00 -1.997974
1990-09-01 09:00 -0.017075
1990-12-01 09:00 0.116179
1991-03-01 09:00 0.651319
Freq: H, dtype: float64
2.1.10 Categoricals
pandas can include categorical data in a DataFrame. For full docs, see the categorical introduction and the API
documentation.
In [123]: df = pd.DataFrame(
.....: {"id": [1, 2, 3, 4, 5, 6], "raw_grade": ["a", "b", "b", "a", "a", "e"]}
.....: )
.....:
In [125]: df["grade"]
Out[125]:
0 a
1 b
2 b
3 a
4 a
5 e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
Reorder the categories and simultaneously add the missing categories (methods under Series.cat() return a new
Series by default).
In [128]: df["grade"]
Out[128]:
0 very good
1 good
2 good
3 very good
4 very good
5 very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
In [129]: df.sort_values(by="grade")
Out[129]:
id raw_grade grade
5 6 e very bad
1 2 b good
2 3 b good
0 1 a very good
3 4 a very good
4 5 a very good
In [130]: df.groupby("grade").size()
Out[130]:
grade
very bad 1
bad 0
medium 0
good 2
very good 3
dtype: int64
2.1.11 Plotting
In [132]: plt.close("all")
In [134]: ts = ts.cumsum()
On a DataFrame, the plot() method is a convenience to plot all of the columns with labels:
In [136]: df = pd.DataFrame(
.....: np.random.randn(1000, 4), index=ts.index, columns=["A", "B", "C", "D"]
.....: )
.....:
In [137]: df = df.cumsum()
In [138]: plt.figure()
Out[138]: <Figure size 640x480 with 0 Axes>
In [139]: df.plot()
Out[139]: <AxesSubplot:>
In [140]: plt.legend(loc='best')
Out[140]: <matplotlib.legend.Legend at 0x7f472de49ac0>
CSV
In [141]: df.to_csv("foo.csv")
In [142]: pd.read_csv("foo.csv")
Out[142]:
Unnamed: 0 A B C D
0 2000-01-01 -0.141576 -1.249164 -0.509289 0.863434
1 2000-01-02 2.280935 -0.557876 -1.164561 0.719622
2 2000-01-03 2.847338 -0.325527 -0.030954 -1.329971
3 2000-01-04 2.520513 -0.430386 0.077867 0.315138
4 2000-01-05 3.471048 0.250893 0.436884 2.036795
.. ... ... ... ... ...
995 2002-09-22 -35.593843 75.106278 -28.061908 -33.244577
996 2002-09-23 -37.146841 76.676146 -28.548138 -33.440373
997 2002-09-24 -38.244108 76.902454 -28.967187 -34.192784
998 2002-09-25 -39.334166 76.112146 -29.071529 -33.432710
(continues on next page)
HDF5
Excel
2.1.13 Gotchas
If you are attempting to perform an operation you might see an exception like:
We’ll start with a quick, non-comprehensive overview of the fundamental data structures in pandas to get you started.
The fundamental behavior about data types, indexing, and axis labeling / alignment apply across all of the objects. To
get started, import NumPy and load pandas into your namespace:
Here is a basic tenet to keep in mind: data alignment is intrinsic. The link between labels and data will not be broken
unless done so explicitly by you.
We’ll give a brief intro to the data structures, then consider all of the broad categories of functionality and methods in
separate sections.
2.2.1 Series
Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers,
Python objects, etc.). The axis labels are collectively referred to as the index. The basic method to create a Series is
to call:
If data is an ndarray, index must be the same length as data. If no index is passed, one will be created having values
[0, ..., len(data) - 1].
In [4]: s
Out[4]:
a 0.469112
b -0.282863
c -1.509059
d -1.135632
e 1.212112
dtype: float64
In [5]: s.index
Out[5]: Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
In [6]: pd.Series(np.random.randn(5))
Out[6]:
0 -0.173215
1 0.119209
2 -1.044236
3 -0.861849
4 -2.104569
dtype: float64
Note: pandas supports non-unique index values. If an operation that does not support duplicate index values is
attempted, an exception will be raised at that time. The reason for being lazy is nearly all performance-based (there
are many instances in computations, like parts of GroupBy, where the index is not used).
From dict
Series can be instantiated from dicts:
In [8]: pd.Series(d)
Out[8]:
b 1
a 0
c 2
dtype: int64
Note: When the data is a dict, and an index is not passed, the Series index will be ordered by the dict’s insertion
order, if you’re using Python version >= 3.6 and pandas version >= 0.23.
If you’re using Python < 3.6 or pandas < 0.23, and an index is not passed, the Series index will be the lexically
ordered list of dict keys.
In the example above, if you were on a Python version lower than 3.6 or a pandas version lower than 0.23, the Series
would be ordered by the lexical order of the dict keys (i.e. ['a', 'b', 'c'] rather than ['b', 'a', 'c']).
If an index is passed, the values in data corresponding to the labels in the index will be pulled out.
In [10]: pd.Series(d)
Out[10]:
a 0.0
b 1.0
c 2.0
dtype: float64
Note: NaN (not a number) is the standard missing data marker used in pandas.
Series is ndarray-like
Series acts very similarly to a ndarray, and is a valid argument to most NumPy functions. However, operations
such as slicing will also slice the index.
In [13]: s[0]
Out[13]: 0.4691122999071863
In [14]: s[:3]
Out[14]:
a 0.469112
b -0.282863
c -1.509059
dtype: float64
In [17]: np.exp(s)
Out[17]:
a 1.598575
b 0.753623
c 0.221118
d 0.321219
e 3.360575
dtype: float64
In [18]: s.dtype
Out[18]: dtype('float64')
This is often a NumPy dtype. However, pandas and 3rd-party libraries extend NumPy’s type system in a few places,
in which case the dtype would be an ExtensionDtype. Some examples within pandas are Categorical data and
Nullable integer data type. See dtypes for more.
If you need the actual array backing a Series, use Series.array.
In [19]: s.array
Out[19]:
<PandasArray>
[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
-1.1356323710171934, 1.2121120250208506]
Length: 5, dtype: float64
Accessing the array can be useful when you need to do some operation without the index (to disable automatic
alignment, for example).
Series.array will always be an ExtensionArray. Briefly, an ExtensionArray is a thin wrapper around one
or more concrete arrays like a numpy.ndarray. pandas knows how to take an ExtensionArray and store it in
a Series or a column of a DataFrame. See dtypes for more.
While Series is ndarray-like, if you need an actual ndarray, then use Series.to_numpy().
In [20]: s.to_numpy()
Out[20]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])
Even if the Series is backed by a ExtensionArray, Series.to_numpy() will return a NumPy ndarray.
Series is dict-like
A Series is like a fixed-size dict in that you can get and set values by index label:
In [21]: s["a"]
Out[21]: 0.4691122999071863
In [23]: s
Out[23]:
a 0.469112
b -0.282863
c -1.509059
d -1.135632
e 12.000000
dtype: float64
In [24]: "e" in s
Out[24]: True
In [25]: "f" in s
Out[25]: False
>>> s["f"]
KeyError: 'f'
Using the get method, a missing label will return None or specified default:
In [26]: s.get("f")
When working with raw NumPy arrays, looping through value-by-value is usually not necessary. The same is true
when working with Series in pandas. Series can also be passed into most NumPy methods expecting an ndarray.
In [28]: s + s
Out[28]:
a 0.938225
b -0.565727
c -3.018117
d -2.271265
e 24.000000
dtype: float64
In [29]: s * 2
Out[29]:
a 0.938225
b -0.565727
(continues on next page)
In [30]: np.exp(s)
Out[30]:
a 1.598575
b 0.753623
c 0.221118
d 0.321219
e 162754.791419
dtype: float64
A key difference between Series and ndarray is that operations between Series automatically align the data based on
label. Thus, you can write computations without giving consideration to whether the Series involved have the same
labels.
In [31]: s[1:] + s[:-1]
Out[31]:
a NaN
b -0.565727
c -3.018117
d -2.271265
e NaN
dtype: float64
The result of an operation between unaligned Series will have the union of the indexes involved. If a label is not found
in one Series or the other, the result will be marked as missing NaN. Being able to write code without doing any explicit
data alignment grants immense freedom and flexibility in interactive data analysis and research. The integrated data
alignment features of the pandas data structures set pandas apart from the majority of related tools for working with
labeled data.
Note: In general, we chose to make the default result of operations between differently indexed objects yield the
union of the indexes in order to avoid loss of information. Having an index label, though the data is missing, is
typically important information as part of a computation. You of course have the option of dropping labels with
missing data via the dropna function.
Name attribute
In [33]: s
Out[33]:
0 -0.494929
1 1.071804
2 0.721555
3 -0.706771
4 -1.039575
Name: something, dtype: float64
The Series name will be assigned automatically in many cases, in particular when taking 1D slices of DataFrame as
you will see below.
You can rename a Series with the pandas.Series.rename() method.
In [35]: s2 = s.rename("different")
In [36]: s2.name
Out[36]: 'different'
2.2.2 DataFrame
DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it
like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.
Like Series, DataFrame accepts many different kinds of input:
• Dict of 1D ndarrays, lists, dicts, or Series
• 2-D numpy.ndarray
• Structured or record ndarray
• A Series
• Another DataFrame
Along with the data, you can optionally pass index (row labels) and columns (column labels) arguments. If you pass
an index and / or columns, you are guaranteeing the index and / or columns of the resulting DataFrame. Thus, a dict
of Series plus a specific index will discard all data not matching up to the passed index.
If axis labels are not passed, they will be constructed from the input data based on common sense rules.
Note: When the data is a dict, and columns is not specified, the DataFrame columns will be ordered by the dict’s
insertion order, if you are using Python version >= 3.6 and pandas >= 0.23.
If you are using Python < 3.6 or pandas < 0.23, and columns is not specified, the DataFrame columns will be the
lexically ordered list of dict keys.
The resulting index will be the union of the indexes of the various Series. If there are any nested dicts, these will first
be converted to Series. If no columns are passed, the columns will be the ordered list of dict keys.
In [37]: d = {
....: "one": pd.Series([1.0, 2.0, 3.0], index=["a", "b", "c"]),
....: "two": pd.Series([1.0, 2.0, 3.0, 4.0], index=["a", "b", "c", "d"]),
....: }
....:
In [38]: df = pd.DataFrame(d)
(continues on next page)
In [39]: df
Out[39]:
one two
a 1.0 1.0
b 2.0 2.0
c 3.0 3.0
d NaN 4.0
The row and column labels can be accessed respectively by accessing the index and columns attributes:
Note: When a particular set of columns is passed along with a dict of data, the passed columns override the keys in
the dict.
In [42]: df.index
Out[42]: Index(['a', 'b', 'c', 'd'], dtype='object')
In [43]: df.columns
Out[43]: Index(['one', 'two'], dtype='object')
The ndarrays must all be the same length. If an index is passed, it must clearly also be the same length as the arrays.
If no index is passed, the result will be range(n), where n is the array length.
In [44]: d = {"one": [1.0, 2.0, 3.0, 4.0], "two": [4.0, 3.0, 2.0, 1.0]}
In [45]: pd.DataFrame(d)
Out[45]:
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0
In [49]: pd.DataFrame(data)
Out[49]:
A B C
0 1 2.0 b'Hello'
1 2 3.0 b'World'
Note: DataFrame is not intended to work exactly like a 2-dimensional NumPy ndarray.
In [52]: data2 = [{"a": 1, "b": 2}, {"a": 5, "b": 10, "c": 20}]
In [53]: pd.DataFrame(data2)
Out[53]:
a b c
0 1 2 NaN
1 5 10 20.0
In [56]: pd.DataFrame(
....: {
....: ("a", "b"): {("A", "B"): 1, ("A", "C"): 2},
....: ("a", "a"): {("A", "C"): 3, ("A", "B"): 4},
....: ("a", "c"): {("A", "B"): 5, ("A", "C"): 6},
....: ("b", "a"): {("A", "C"): 7, ("A", "B"): 8},
....: ("b", "b"): {("A", "D"): 9, ("A", "B"): 10},
....: }
....: )
....:
Out[56]:
a b
b a c a b
A B 1.0 4.0 5.0 8.0 10.0
C 2.0 3.0 6.0 7.0 NaN
D NaN NaN NaN NaN 9.0
From a Series
The result will be a DataFrame with the same index as the input Series, and with one column whose name is the
original name of the Series (only if no other column name provided).
The field names of the first namedtuple in the list determine the columns of the DataFrame. The remaining
namedtuples (or tuples) are simply unpacked and their values are fed into the rows of the DataFrame. If any of those
tuples is shorter than the first namedtuple then the later columns in the corresponding row are marked as missing
values. If any are longer than the first namedtuple, a ValueError is raised.
Missing data
Much more will be said on this topic in the Missing data section. To construct a DataFrame with missing data, we use
np.nan to represent missing values. Alternatively, you may pass a numpy.MaskedArray as the data argument to
the DataFrame constructor, and its masked entries will be considered missing.
Alternate constructors
DataFrame.from_dict
DataFrame.from_dict takes a dict of dicts or a dict of array-like sequences and returns a DataFrame. It operates
like the DataFrame constructor except for the orient parameter which is 'columns' by default, but which can
be set to 'index' in order to use the dict keys as row labels.
In [65]: pd.DataFrame.from_dict(dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]))
Out[65]:
A B
0 1 4
1 2 5
2 3 6
If you pass orient='index', the keys will be the row labels. In this case, you can also pass the desired column
names:
In [66]: pd.DataFrame.from_dict(
....: dict([("A", [1, 2, 3]), ("B", [4, 5, 6])]),
....: orient="index",
....: columns=["one", "two", "three"],
....: )
....:
Out[66]:
(continues on next page)
DataFrame.from_records
DataFrame.from_records takes a list of tuples or an ndarray with structured dtype. It works analogously to the
normal DataFrame constructor, except that the resulting DataFrame index may be a specific field of the structured
dtype. For example:
In [67]: data
Out[67]:
array([(1, 2., b'Hello'), (2, 3., b'World')],
dtype=[('A', '<i4'), ('B', '<f4'), ('C', 'S10')])
You can treat a DataFrame semantically like a dict of like-indexed Series objects. Getting, setting, and deleting
columns works with the same syntax as the analogous dict operations:
In [69]: df["one"]
Out[69]:
a 1.0
b 2.0
c 3.0
d NaN
Name: one, dtype: float64
In [72]: df
Out[72]:
one two three flag
a 1.0 1.0 1.0 False
b 2.0 2.0 4.0 False
c 3.0 3.0 9.0 True
d NaN 4.0 NaN False
In [75]: df
Out[75]:
(continues on next page)
When inserting a scalar value, it will naturally be propagated to fill the column:
In [77]: df
Out[77]:
one flag foo
a 1.0 False bar
b 2.0 False bar
c 3.0 True bar
d NaN False bar
When inserting a Series that does not have the same index as the DataFrame, it will be conformed to the DataFrame’s
index:
In [79]: df
Out[79]:
one flag foo one_trunc
a 1.0 False bar 1.0
b 2.0 False bar 2.0
c 3.0 True bar NaN
d NaN False bar NaN
You can insert raw ndarrays but their length must match the length of the DataFrame’s index.
By default, columns get inserted at the end. The insert function is available to insert at a particular location in the
columns:
In [81]: df
Out[81]:
one bar flag foo one_trunc
a 1.0 1.0 False bar 1.0
b 2.0 2.0 False bar 2.0
c 3.0 3.0 True bar NaN
d NaN NaN False bar NaN
Inspired by dplyr’s mutate verb, DataFrame has an assign() method that allows you to easily create new columns
that are potentially derived from existing columns.
In [83]: iris.head()
Out[83]:
SepalLength SepalWidth PetalLength PetalWidth Name
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
In the example above, we inserted a precomputed value. We can also pass in a function of one argument to be evaluated
on the DataFrame being assigned to.
Out[85]:
SepalLength SepalWidth PetalLength PetalWidth Name sepal_ratio
0 5.1 3.5 1.4 0.2 Iris-setosa 0.686275
1 4.9 3.0 1.4 0.2 Iris-setosa 0.612245
2 4.7 3.2 1.3 0.2 Iris-setosa 0.680851
3 4.6 3.1 1.5 0.2 Iris-setosa 0.673913
4 5.0 3.6 1.4 0.2 Iris-setosa 0.720000
assign always returns a copy of the data, leaving the original DataFrame untouched.
Passing a callable, as opposed to an actual value to be inserted, is useful when you don’t have a reference to the
DataFrame at hand. This is common when using assign in a chain of operations. For example, we can limit the
DataFrame to just those observations with a Sepal Length greater than 5, calculate the ratio, and plot:
In [86]: (
....: iris.query("SepalLength > 5")
....: .assign(
....: SepalRatio=lambda x: x.SepalWidth / x.SepalLength,
....: PetalRatio=lambda x: x.PetalWidth / x.PetalLength,
....: )
....: .plot(kind="scatter", x="SepalRatio", y="PetalRatio")
....: )
....:
Out[86]: <AxesSubplot:xlabel='SepalRatio', ylabel='PetalRatio'>
Since a function is passed in, the function is computed on the DataFrame being assigned to. Importantly, this is the
DataFrame that’s been filtered to those rows with sepal length greater than 5. The filtering happens first, and then the
ratio calculations. This is an example where we didn’t have a reference to the filtered DataFrame available.
The function signature for assign is simply **kwargs. The keys are the column names for the new fields, and the
values are either a value to be inserted (for example, a Series or NumPy array), or a function of one argument to be
called on the DataFrame. A copy of the original DataFrame is returned, with the new values inserted.
Starting with Python 3.6 the order of **kwargs is preserved. This allows for dependent assignment, where an
expression later in **kwargs can refer to a column created earlier in the same assign().
In the second expression, x['C'] will refer to the newly created column, that’s equal to dfa['A'] + dfa['B'].
Indexing / selection
Row selection, for example, returns a Series whose index is the columns of the DataFrame:
In [89]: df.loc["b"]
Out[89]:
one 2.0
bar 2.0
flag False
foo bar
one_trunc 2.0
Name: b, dtype: object
In [90]: df.iloc[2]
Out[90]:
one 3.0
bar 3.0
flag True
foo bar
one_trunc NaN
Name: c, dtype: object
For a more exhaustive treatment of sophisticated label-based indexing and slicing, see the section on indexing. We
will address the fundamentals of reindexing / conforming to new sets of labels in the section on reindexing.
Data alignment between DataFrame objects automatically align on both the columns and the index (row labels).
Again, the resulting object will have the union of the column and row labels.
In [93]: df + df2
Out[93]:
A B C D
0 0.045691 -0.014138 1.380871 NaN
1 -0.955398 -1.501007 0.037181 NaN
2 -0.662690 1.534833 -0.859691 NaN
3 -2.452949 1.237274 -0.133712 NaN
4 1.414490 1.951676 -2.320422 NaN
5 -0.494922 -1.649727 -1.084601 NaN
6 -1.047551 -0.748572 -0.805479 NaN
7 NaN NaN NaN NaN
(continues on next page)
When doing an operation between DataFrame and Series, the default behavior is to align the Series index on the
DataFrame columns, thus broadcasting row-wise. For example:
In [94]: df - df.iloc[0]
Out[94]:
A B C D
0 0.000000 0.000000 0.000000 0.000000
1 -1.359261 -0.248717 -0.453372 -1.754659
2 0.253128 0.829678 0.010026 -1.991234
3 -1.311128 0.054325 -1.724913 -1.620544
4 0.573025 1.500742 -0.676070 1.367331
5 -1.741248 0.781993 -1.241620 -2.053136
6 -1.240774 -0.869551 -0.153282 0.000430
7 -0.743894 0.411013 -0.929563 -0.282386
8 -1.194921 1.320690 0.238224 -1.482644
9 2.293786 1.856228 0.773289 -1.446531
For explicit control over the matching and broadcasting behavior, see the section on flexible binary operations.
Operations with scalars are just as you would expect:
In [95]: df * 5 + 2
Out[95]:
A B C D
0 3.359299 -0.124862 4.835102 3.381160
1 -3.437003 -1.368449 2.568242 -5.392133
2 4.624938 4.023526 4.885230 -6.575010
3 -3.196342 0.146766 -3.789461 -4.721559
4 6.224426 7.378849 1.454750 10.217815
5 -5.346940 3.785103 -1.373001 -6.884519
6 -2.844569 -4.472618 4.068691 3.383309
7 -0.360173 1.930201 0.187285 1.969232
8 -2.615303 6.478587 6.026220 -4.032059
9 14.828230 9.156280 8.701544 -3.851494
In [96]: 1 / df
Out[96]:
A B C D
0 3.678365 -2.353094 1.763605 3.620145
1 -0.919624 -1.484363 8.799067 -0.676395
2 1.904807 2.470934 1.732964 -0.583090
3 -0.962215 -2.697986 -0.863638 -0.743875
4 1.183593 0.929567 -9.170108 0.608434
5 -0.680555 2.800959 -1.482360 -0.562777
6 -1.032084 -0.772485 2.416988 3.614523
7 -2.118489 -71.634509 -2.758294 -162.507295
8 -1.083352 1.116424 1.241860 -0.828904
9 0.389765 0.698687 0.746097 -0.854483
In [97]: df ** 4
Out[97]:
A B C D
0 0.005462 3.261689e-02 0.103370 5.822320e-03
1 1.398165 2.059869e-01 0.000167 4.777482e+00
(continues on next page)
In [103]: -df1
Out[103]:
a b
0 False True
1 True False
2 False False
Transposing
To transpose, access the T attribute (also the transpose function), similar to an ndarray:
Elementwise NumPy ufuncs (log, exp, sqrt, . . . ) and various other NumPy functions can be used with no issues on
Series and DataFrame, assuming the data within are numeric:
In [105]: np.exp(df)
Out[105]:
A B C D
0 1.312403 0.653788 1.763006 1.318154
1 0.337092 0.509824 1.120358 0.227996
2 1.690438 1.498861 1.780770 0.179963
3 0.353713 0.690288 0.314148 0.260719
4 2.327710 2.932249 0.896686 5.173571
5 0.230066 1.429065 0.509360 0.169161
6 0.379495 0.274028 1.512461 1.318720
7 0.623732 0.986137 0.695904 0.993865
8 0.397301 2.449092 2.237242 0.299269
9 13.009059 4.183951 3.820223 0.310274
In [106]: np.asarray(df)
Out[106]:
array([[ 0.2719, -0.425 , 0.567 , 0.2762],
[-1.0874, -0.6737, 0.1136, -1.4784],
[ 0.525 , 0.4047, 0.577 , -1.715 ],
[-1.0393, -0.3706, -1.1579, -1.3443],
[ 0.8449, 1.0758, -0.109 , 1.6436],
[-1.4694, 0.357 , -0.6746, -1.7769],
[-0.9689, -1.2945, 0.4137, 0.2767],
[-0.472 , -0.014 , -0.3625, -0.0062],
[-0.9231, 0.8957, 0.8052, -1.2064],
[ 2.5656, 1.4313, 1.3403, -1.1703]])
DataFrame is not intended to be a drop-in replacement for ndarray as its indexing semantics and data model are quite
different in places from an n-dimensional array.
Series implements __array_ufunc__, which allows it to work with NumPy’s universal functions.
The ufunc is applied to the underlying array in a Series.
In [108]: np.exp(ser)
Out[108]:
0 2.718282
1 7.389056
2 20.085537
3 54.598150
dtype: float64
Changed in version 0.25.0: When multiple Series are passed to a ufunc, they are aligned before performing the
operation.
Like other parts of the library, pandas will automatically align labeled inputs as part of a ufunc with multiple inputs.
For example, using numpy.remainder() on two Series with differently ordered labels will align before the
operation.
In [111]: ser1
Out[111]:
a 1
b 2
c 3
dtype: int64
In [112]: ser2
Out[112]:
b 1
a 3
c 5
dtype: int64
As usual, the union of the two indices is taken, and non-overlapping values are filled with missing values.
In [115]: ser3
Out[115]:
b 2
c 4
d 6
dtype: int64
When a binary ufunc is applied to a Series and Index, the Series implementation takes precedence and a Series is
returned.
NumPy ufuncs are safe to apply to Series backed by non-ndarray arrays, for example arrays.SparseArray
(see Sparse calculation). If possible, the ufunc is applied without converting the underlying data to an ndarray.
Console display
Very large DataFrames will be truncated to display them in the console. You can also get a summary using info().
(Here I am reading a CSV version of the baseball dataset from the plyr R package):
In [120]: baseball = pd.read_csv("data/baseball.csv")
In [121]: print(baseball)
id player year stint team lg g ab r h X2b X3b hr rbi sb
˓→ cs bb so ibb hbp sh sf gidp
0 88641 womacto01 2006 2 CHN NL 19 50 6 14 1 0 1 2.0 1.0
˓→ 1.0 4 4.0 0.0 0.0 3.0 0.0 0.0
1 88643 schilcu01 2006 1 BOS AL 31 2 0 1 0 0 0 0.0 0.0
˓→ 0.0 0 1.0 0.0 0.0 0.0 0.0 0.0
.. ... ... ... ... ... .. .. ... .. ... ... ... .. ... ...
˓→ ... .. ... ... ... ... ... ...
98 89533 aloumo01 2007 1 NYN NL 87 328 51 112 19 1 13 49.0 3.0
˓→ 0.0 27 30.0 5.0 2.0 0.0 3.0 13.0
99 89534 alomasa02 2007 1 NYN NL 8 22 1 3 1 0 0 0.0 0.0
˓→ 0.0 0 3.0 0.0 0.0 0.0 0.0 0.0
In [122]: baseball.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 100 non-null int64
1 player 100 non-null object
2 year 100 non-null int64
3 stint 100 non-null int64
4 team 100 non-null object
5 lg 100 non-null object
6 g 100 non-null int64
7 ab 100 non-null int64
8 r 100 non-null int64
9 h 100 non-null int64
10 X2b 100 non-null int64
11 X3b 100 non-null int64
12 hr 100 non-null int64
13 rbi 100 non-null float64
14 sb 100 non-null float64
15 cs 100 non-null float64
16 bb 100 non-null int64
17 so 100 non-null float64
18 ibb 100 non-null float64
19 hbp 100 non-null float64
20 sh 100 non-null float64
21 sf 100 non-null float64
22 gidp 100 non-null float64
dtypes: float64(9), int64(11), object(3)
memory usage: 18.1+ KB
However, using to_string will return a string representation of the DataFrame in tabular form, though it won’t
You can change how much to print on a single row by setting the display.width option:
In [125]: pd.set_option("display.width", 40) # default is 80
You can adjust the max width of the individual columns by setting display.max_colwidth
In [127]: datafile = {
.....: "filename": ["filename_01", "filename_02"],
.....: "path": [
(continues on next page)
In [129]: pd.DataFrame(datafile)
Out[129]:
filename path
0 filename_01 media/user_name/storage/fo...
1 filename_02 media/user_name/storage/fo...
In [131]: pd.DataFrame(datafile)
Out[131]:
filename path
0 filename_01 media/user_name/storage/folder_01/filename_01
1 filename_02 media/user_name/storage/folder_02/filename_02
You can also disable this feature via the expand_frame_repr option. This will print the table in one block.
If a DataFrame column label is a valid Python variable name, the column can be accessed like an attribute:
In [133]: df
Out[133]:
foo1 foo2
0 1.126203 0.781836
1 -0.977349 -1.071357
2 1.474071 0.441153
3 -0.064034 2.353925
4 -1.282782 0.583787
In [134]: df.foo1
Out[134]:
0 1.126203
1 -0.977349
2 1.474071
3 -0.064034
4 -1.282782
Name: foo1, dtype: float64
The columns are also connected to the IPython completion mechanism so they can be tab-completed:
Here we discuss a lot of the essential functionality common to the pandas data structures. To begin, let’s create some
example objects like we did in the 10 minutes to pandas section:
To view a small sample of a Series or DataFrame object, use the head() and tail() methods. The default number
of elements to display is five, but you may pass a custom number.
In [5]: long_series.head()
Out[5]:
0 -1.157892
1 -1.344312
2 0.844885
3 1.075770
4 -0.109050
dtype: float64
In [6]: long_series.tail(3)
Out[6]:
997 -0.289388
998 -1.020544
999 0.589993
dtype: float64
pandas objects have a number of attributes enabling you to access the metadata
• shape: gives the axis dimensions of the object, consistent with ndarray
• Axis labels
– Series: index (only axis)
– DataFrame: index (rows) and columns
Note, these attributes can be safely assigned to!
In [7]: df[:2]
Out[7]:
A B C
2000-01-01 -0.173215 0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929
In [9]: df
Out[9]:
a b c
2000-01-01 -0.173215 0.119209 -1.044236
2000-01-02 -0.861849 -2.104569 -0.494929
2000-01-03 1.071804 0.721555 -0.706771
2000-01-04 -1.039575 0.271860 -0.424972
2000-01-05 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427
2000-01-07 0.524988 0.404705 0.577046
2000-01-08 -1.715002 -1.039268 -0.370647
pandas objects (Index, Series, DataFrame) can be thought of as containers for arrays, which hold the actual
data and do the actual computation. For many types, the underlying array is a numpy.ndarray. However, pandas
and 3rd party libraries may extend NumPy’s type system to add support for custom arrays (see dtypes).
To get the actual data inside a Index or Series, use the .array property
In [10]: s.array
Out[10]:
<PandasArray>
[ 0.4691122999071863, -0.2828633443286633, -1.5090585031735124,
-1.1356323710171934, 1.2121120250208506]
Length: 5, dtype: float64
In [11]: s.index.array
Out[11]:
<PandasArray>
['a', 'b', 'c', 'd', 'e']
Length: 5, dtype: object
array will always be an ExtensionArray. The exact details of what an ExtensionArray is and why pandas
uses them are a bit beyond the scope of this introduction. See dtypes for more.
If you know you need a NumPy array, use to_numpy() or numpy.asarray().
In [12]: s.to_numpy()
Out[12]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])
In [13]: np.asarray(s)
Out[13]: array([ 0.4691, -0.2829, -1.5091, -1.1356, 1.2121])
When the Series or Index is backed by an ExtensionArray, to_numpy() may involve copying data and coercing
values. See dtypes for more.
to_numpy() gives some control over the dtype of the resulting numpy.ndarray. For example, consider date-
times with timezones. NumPy doesn’t have a dtype to represent timezone-aware datetimes, so there are two possibly
useful representations:
1. An object-dtype numpy.ndarray with Timestamp objects, each with the correct tz
2. A datetime64[ns] -dtype numpy.ndarray, where the values have been converted to UTC and the time-
zone discarded
Timezones may be preserved with dtype=object
In [15]: ser.to_numpy(dtype=object)
Out[15]:
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
dtype=object)
In [16]: ser.to_numpy(dtype="datetime64[ns]")
Out[16]:
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00.000000000'],
dtype='datetime64[ns]')
Getting the “raw data” inside a DataFrame is possibly a bit more complex. When your DataFrame only has a
single data type for all the columns, DataFrame.to_numpy() will return the underlying data:
In [17]: df.to_numpy()
Out[17]:
array([[-0.1732, 0.1192, -1.0442],
[-0.8618, -2.1046, -0.4949],
[ 1.0718, 0.7216, -0.7068],
[-1.0396, 0.2719, -0.425 ],
[ 0.567 , 0.2762, -1.0874],
[-0.6737, 0.1136, -1.4784],
[ 0.525 , 0.4047, 0.577 ],
[-1.715 , -1.0393, -0.3706]])
If a DataFrame contains homogeneously-typed data, the ndarray can actually be modified in-place, and the changes
will be reflected in the data structure. For heterogeneous data (e.g. some of the DataFrame’s columns are not all the
same dtype), this will not be the case. The values attribute itself, unlike the axis labels, cannot be assigned to.
Note: When working with heterogeneous data, the dtype of the resulting ndarray will be chosen to accommodate all
of the data involved. For example, if strings are involved, the result will be of object dtype. If there are only floats and
integers, the resulting array will be of float dtype.
In the past, pandas recommended Series.values or DataFrame.values for extracting the data from a Series
or DataFrame. You’ll still find references to these in old code bases and online. Going forward, we recommend
avoiding .values and using .array or .to_numpy(). .values has the following drawbacks:
1. When your Series contains an extension type, it’s unclear whether Series.values returns a NumPy array
or the extension array. Series.array will always return an ExtensionArray, and will never copy data.
Series.to_numpy() will always return a NumPy array, potentially at the cost of copying / coercing values.
2. When your DataFrame contains a mixture of data types, DataFrame.values may involve copying data and
coercing values to a common dtype, a relatively expensive operation. DataFrame.to_numpy(), being a
method, makes it clearer that the returned NumPy array may not be a view on the same data in the DataFrame.
pandas has support for accelerating certain types of binary numerical and boolean operations using the numexpr
library and the bottleneck libraries.
These libraries are especially useful when dealing with large data sets, and provide large speedups. numexpr uses
smart chunking, caching, and multiple cores. bottleneck is a set of specialized cython routines that are especially
fast when dealing with arrays that have nans.
Here is a sample (using 100 column x 100,000 row DataFrames):
You are highly encouraged to install both libraries. See the section Recommended Dependencies for more installation
info.
These are both enabled to be used by default, you can control this by setting the options:
pd.set_option("compute.use_bottleneck", False)
pd.set_option("compute.use_numexpr", False)
With binary operations between pandas data structures, there are two key points of interest:
• Broadcasting behavior between higher- (e.g. DataFrame) and lower-dimensional (e.g. Series) objects.
• Missing data in computations.
We will demonstrate how to manage these issues independently, though they can be handled simultaneously.
DataFrame has the methods add(), sub(), mul(), div() and related functions radd(), rsub(), . . . for
carrying out binary operations. For broadcasting behavior, Series input is of primary interest. Using these functions,
you can use to either match on the index or columns via the axis keyword:
In [18]: df = pd.DataFrame(
....: {
....: "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
....: "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
....: "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
....: }
....: )
....:
In [19]: df
Out[19]:
one two three
a 1.394981 1.772517 NaN
b 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
(continues on next page)
Series and Index also support the divmod() builtin. This function takes the floor division and modulo operation at
the same time returning a two-tuple of the same type as the left hand side. For example:
In [29]: s = pd.Series(np.arange(10))
In [30]: s
Out[30]:
0 0
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
dtype: int64
In [32]: div
Out[32]:
0 0
1 0
2 0
3 1
4 1
5 1
6 2
7 2
8 2
9 3
dtype: int64
In [33]: rem
Out[33]:
0 0
1 1
2 2
3 0
4 1
5 2
6 0
7 1
8 2
9 0
dtype: int64
In [35]: idx
Out[35]: Int64Index([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype='int64')
In [37]: div
Out[37]: Int64Index([0, 0, 0, 1, 1, 1, 2, 2, 2, 3], dtype='int64')
In [40]: div
Out[40]:
0 0
1 0
2 0
3 1
4 1
5 1
6 1
7 1
8 1
9 1
dtype: int64
In [41]: rem
Out[41]:
0 0
1 1
2 2
3 0
4 0
5 1
6 1
7 2
8 2
9 3
dtype: int64
In Series and DataFrame, the arithmetic functions have the option of inputting a fill_value, namely a value to substitute
when at most one of the values at a location are missing. For example, when adding two DataFrame objects, you may
wish to treat NaN as 0 unless both DataFrames are missing that value, in which case the result will be NaN (you can
later replace NaN with some other value using fillna if you wish).
In [42]: df
Out[42]:
one two three
a 1.394981 1.772517 NaN
b 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
d NaN 0.279344 -0.613172
In [43]: df2
Out[43]:
one two three
a 1.394981 1.772517 1.000000
b 0.343054 1.912123 -0.050390
(continues on next page)
In [44]: df + df2
Out[44]:
one two three
a 2.789963 3.545034 NaN
b 0.686107 3.824246 -0.100780
c 1.390491 2.956737 2.454870
d NaN 0.558688 -1.226343
Flexible comparisons
Series and DataFrame have the binary comparison methods eq, ne, lt, gt, le, and ge whose behavior is analogous
to the binary arithmetic operations described above:
In [46]: df.gt(df2)
Out[46]:
one two three
a False False False
b False False False
c False False False
d False False False
In [47]: df2.ne(df)
Out[47]:
one two three
a False False True
b False False False
c False False False
d True False False
These operations produce a pandas object of the same type as the left-hand-side input that is of dtype bool. These
boolean objects can be used in indexing operations, see the section on Boolean indexing.
Boolean reductions
You can apply the reductions: empty, any(), all(), and bool() to provide a way to summarize a boolean result.
In [48]: (df > 0).all()
Out[48]:
one False
two True
three False
dtype: bool
You can test if a pandas object is empty, via the empty property.
In [51]: df.empty
Out[51]: False
In [52]: pd.DataFrame(columns=list("ABC")).empty
Out[52]: True
To evaluate single-element pandas objects in a boolean context, use the method bool():
In [53]: pd.Series([True]).bool()
Out[53]: True
In [54]: pd.Series([False]).bool()
Out[54]: False
In [55]: pd.DataFrame([[True]]).bool()
Out[55]: True
In [56]: pd.DataFrame([[False]]).bool()
Out[56]: False
Or
>>> df and df2
These will both raise errors, as you are trying to compare multiple values.:
ValueError: The truth value of an array is ambiguous. Use a.empty, a.any() or a.
˓→all().
Often you may find that there is more than one way to compute the same result. As a simple example, consider df
+ df and df * 2. To test that these two computations produce the same result, given the tools shown above, you
might imagine using (df + df == df * 2).all(). But in fact, this expression is False:
In [57]: df + df == df * 2
Out[57]:
one two three
a True True False
b True True True
c True True True
d False True True
Notice that the boolean DataFrame df + df == df * 2 contains some False values! This is because NaNs do
not compare as equals:
In [59]: np.nan == np.nan
Out[59]: False
So, NDFrames (such as Series and DataFrames) have an equals() method for testing equality, with NaNs in corre-
sponding locations treated as equal.
In [60]: (df + df).equals(df * 2)
Out[60]: True
Note that the Series or DataFrame index needs to be in the same order for equality to be True:
In [61]: df1 = pd.DataFrame({"col": ["foo", 0, np.nan]})
In [63]: df1.equals(df2)
Out[63]: False
In [64]: df1.equals(df2.sort_index())
Out[64]: True
You can conveniently perform element-wise comparisons when comparing a pandas data structure with a scalar value:
In [65]: pd.Series(["foo", "bar", "baz"]) == "foo"
Out[65]:
0 True
1 False
2 False
dtype: bool
pandas also handles element-wise comparisons between different array-like objects of the same length:
Trying to compare Index or Series objects of different lengths will raise a ValueError:
Note that this is different from the NumPy behavior where a comparison can be broadcast:
A problem occasionally arising is the combination of two similar data sets where values in one are preferred over the
other. An example would be two data series representing a particular economic indicator where one is considered to
be of “higher quality”. However, the lower quality series might extend further back in history or have more complete
data coverage. As such, we would like to combine two DataFrame objects where missing values in one DataFrame
are conditionally filled with like-labeled values from the other DataFrame. The function implementing this operation
is combine_first(), which we illustrate:
....: )
....:
In [73]: df1
Out[73]:
A B
0 1.0 NaN
1 NaN 2.0
2 3.0 3.0
3 5.0 NaN
4 NaN 6.0
In [74]: df2
Out[74]:
A B
0 5.0 NaN
1 2.0 NaN
2 4.0 3.0
3 NaN 4.0
4 3.0 6.0
5 7.0 8.0
In [75]: df1.combine_first(df2)
Out[75]:
A B
0 1.0 NaN
1 2.0 2.0
2 3.0 3.0
3 5.0 4.0
4 3.0 6.0
5 7.0 8.0
The combine_first() method above calls the more general DataFrame.combine(). This method takes
another DataFrame and a combiner function, aligns the input DataFrame and then passes the combiner function pairs
of Series (i.e., columns whose names are the same).
So, for instance, to reproduce combine_first() as above:
There exists a large number of methods for computing descriptive statistics and other related operations on Series,
DataFrame. Most of these are aggregations (hence producing a lower-dimensional result) like sum(), mean(), and
quantile(), but some of them, like cumsum() and cumprod(), produce an object of the same size. Generally
speaking, these methods take an axis argument, just like ndarray.{sum, std, . . . }, but the axis can be specified by name
or integer:
• Series: no axis argument needed
• DataFrame: “index” (axis=0, default), “columns” (axis=1)
For example:
In [78]: df
Out[78]:
one two three
a 1.394981 1.772517 NaN
b 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
d NaN 0.279344 -0.613172
In [79]: df.mean(0)
Out[79]:
one 0.811094
two 1.360588
three 0.187958
dtype: float64
In [80]: df.mean(1)
Out[80]:
a 1.583749
b 0.734929
c 1.133683
d -0.166914
dtype: float64
All such methods have a skipna option signaling whether to exclude missing data (True by default):
Combined with the broadcasting / arithmetic behavior, one can describe various statistical procedures, like standard-
ization (rendering data zero mean and standard deviation of 1), very concisely:
In [84]: ts_stand.std()
Out[84]:
one 1.0
two 1.0
three 1.0
dtype: float64
In [86]: xs_stand.std(1)
Out[86]:
a 1.0
b 1.0
c 1.0
d 1.0
dtype: float64
Note that methods like cumsum() and cumprod() preserve the location of NaN values. This is somewhat different
from expanding() and rolling() since NaN behavior is furthermore dictated by a min_periods parameter.
In [87]: df.cumsum()
Out[87]:
one two three
a 1.394981 1.772517 NaN
b 1.738035 3.684640 -0.050390
c 2.433281 5.163008 1.177045
d NaN 5.442353 0.563873
Here is a quick reference summary table of common functions. Each also takes an optional level parameter which
applies only if the object has a hierarchical index.
Function Description
count Number of non-NA observations
sum Sum of values
mean Mean of values
mad Mean absolute deviation
median Arithmetic median of values
min Minimum
max Maximum
mode Mode
abs Absolute Value
prod Product of values
std Bessel-corrected sample standard deviation
var Unbiased variance
sem Standard error of the mean
skew Sample skewness (3rd moment)
kurt Sample kurtosis (4th moment)
quantile Sample quantile (value at %)
cumsum Cumulative sum
cumprod Cumulative product
cummax Cumulative maximum
cummin Cumulative minimum
Note that by chance some NumPy methods, like mean, std, and sum, will exclude NAs on Series input by default:
In [88]: np.mean(df["one"])
Out[88]: 0.8110935116651192
In [89]: np.mean(df["one"].to_numpy())
Out[89]: nan
In [92]: series[10:20] = 5
In [93]: series.nunique()
Out[93]: 11
There is a convenient describe() function which computes a variety of summary statistics about a Series or the
columns of a DataFrame (excluding NAs of course):
In [96]: series.describe()
Out[96]:
count 500.000000
mean -0.021292
std 1.015906
min -2.683763
25% -0.699070
50% -0.069718
75% 0.714483
max 3.160915
dtype: float64
In [99]: frame.describe()
Out[99]:
a b c d e
count 500.000000 500.000000 500.000000 500.000000 500.000000
mean 0.033387 0.030045 -0.043719 -0.051686 0.005979
std 1.017152 0.978743 1.025270 1.015988 1.006695
min -3.000951 -2.637901 -3.303099 -3.159200 -3.188821
25% -0.647623 -0.576449 -0.712369 -0.691338 -0.691115
50% 0.047578 -0.021499 -0.023888 -0.032652 -0.025363
75% 0.729907 0.775880 0.618896 0.670047 0.649748
max 2.740139 2.752332 3.004229 2.728702 3.240991
In [101]: s = pd.Series(["a", "a", "b", "b", "a", "a", np.nan, "c", "d", "a"])
In [102]: s.describe()
Out[102]:
count 9
unique 4
top a
freq 5
dtype: object
Note that on a mixed-type DataFrame object, describe() will restrict the summary to include only numerical
columns or, if none are, only categorical columns:
In [104]: frame.describe()
Out[104]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
This behavior can be controlled by providing a list of types as include/exclude arguments. The special value
all can also be used:
In [105]: frame.describe(include=["object"])
Out[105]:
a
count 4
unique 2
top No
freq 2
(continues on next page)
In [106]: frame.describe(include=["number"])
Out[106]:
b
count 4.000000
mean 1.500000
std 1.290994
min 0.000000
25% 0.750000
50% 1.500000
75% 2.250000
max 3.000000
In [107]: frame.describe(include="all")
Out[107]:
a b
count 4 4.000000
unique 2 NaN
top No NaN
freq 2 NaN
mean NaN 1.500000
std NaN 1.290994
min NaN 0.000000
25% NaN 0.750000
50% NaN 1.500000
75% NaN 2.250000
max NaN 3.000000
That feature relies on select_dtypes. Refer to there for details about accepted inputs.
The idxmin() and idxmax() functions on Series and DataFrame compute the index labels with the minimum and
maximum corresponding values:
In [108]: s1 = pd.Series(np.random.randn(5))
In [109]: s1
Out[109]:
0 1.118076
1 -0.352051
2 -1.242883
3 -1.277155
4 -0.641184
dtype: float64
In [112]: df1
Out[112]:
A B C
0 -0.327863 -0.946180 -0.137570
(continues on next page)
In [113]: df1.idxmin(axis=0)
Out[113]:
A 2
B 3
C 1
dtype: int64
In [114]: df1.idxmax(axis=1)
Out[114]:
0 C
1 A
2 C
3 A
4 C
dtype: object
When there are multiple rows (or columns) matching the minimum or maximum value, idxmin() and idxmax()
return the first matching index:
In [116]: df3
Out[116]:
A
e 2.0
d 1.0
c 1.0
b 3.0
a NaN
In [117]: df3["A"].idxmin()
Out[117]: 'd'
Note: idxmin and idxmax are called argmin and argmax in NumPy.
The value_counts() Series method and top-level function computes a histogram of a 1D array of values. It can
also be used as a function on regular arrays:
In [119]: data
Out[119]:
array([6, 6, 2, 3, 5, 3, 2, 5, 4, 5, 4, 3, 4, 5, 0, 2, 0, 4, 2, 0, 3, 2,
2, 5, 6, 5, 3, 4, 6, 4, 3, 5, 6, 4, 3, 6, 2, 6, 6, 2, 3, 4, 2, 1,
6, 2, 6, 1, 5, 4])
In [121]: s.value_counts()
Out[121]:
2 10
6 10
4 9
3 8
5 8
0 3
1 2
dtype: int64
In [122]: pd.value_counts(data)
Out[122]:
2 10
6 10
4 9
3 8
5 8
0 3
1 2
dtype: int64
In [125]: frame.value_counts()
Out[125]:
a b
1 x 1
2 x 1
3 y 1
4 y 1
dtype: int64
Similarly, you can get the most frequently occurring value(s), i.e. the mode, of the values in a Series or DataFrame:
In [126]: s5 = pd.Series([1, 1, 3, 3, 3, 5, 5, 7, 7, 7])
In [127]: s5.mode()
Out[127]:
0 3
1 7
dtype: int64
In [129]: df5.mode()
Out[129]:
A B
0 1.0 -9
1 NaN 10
2 NaN 13
Continuous values can be discretized using the cut() (bins based on values) and qcut() (bins based on sample
quantiles) functions:
In [130]: arr = np.random.randn(20)
In [132]: factor
Out[132]:
[(-0.251, 0.464], (-0.968, -0.251], (0.464, 1.179], (-0.251, 0.464], (-0.968, -0.251],
˓→ ..., (-0.251, 0.464], (-0.968, -0.251], (-0.968, -0.251], (-0.968, -0.251], (-0.
˓→968, -0.251]]
Length: 20
Categories (4, interval[float64]): [(-0.968, -0.251] < (-0.251, 0.464] < (0.464, 1.
˓→179] <
(1.179, 1.893]]
In [134]: factor
Out[134]:
[(0, 1], (-1, 0], (0, 1], (0, 1], (-1, 0], ..., (-1, 0], (-1, 0], (-1, 0], (-1, 0], (-
˓→1, 0]]
Length: 20
Categories (4, interval[int64]): [(-5, -1] < (-1, 0] < (0, 1] < (1, 5]]
qcut() computes sample quantiles. For example, we could slice up some normally distributed data into equal-size
quartiles like so:
In [135]: arr = np.random.randn(30)
In [137]: factor
Out[137]:
[(0.569, 1.184], (-2.278, -0.301], (-2.278, -0.301], (0.569, 1.184], (0.569, 1.184], .
˓→.., (-0.301, 0.569], (1.184, 2.346], (1.184, 2.346], (-0.301, 0.569], (-2.278, -0.
˓→301]]
Length: 30
Categories (4, interval[float64]): [(-2.278, -0.301] < (-0.301, 0.569] < (0.569, 1.
˓→184] <
(1.184, 2.346]]
In [141]: factor
Out[141]:
[(-inf, 0.0], (0.0, inf], (0.0, inf], (-inf, 0.0], (-inf, 0.0], ..., (-inf, 0.0], (-
˓→inf, 0.0], (-inf, 0.0], (0.0, inf], (0.0, inf]]
Length: 20
Categories (2, interval[float64]): [(-inf, 0.0] < (0.0, inf]]
To apply your own or another library’s functions to pandas objects, you should be aware of the three methods below.
The appropriate method to use depends on whether your function expects to operate on an entire DataFrame or
Series, row- or column-wise, or elementwise.
1. Tablewise Function Application: pipe()
2. Row or Column-wise Function Application: apply()
3. Aggregation API: agg() and transform()
4. Applying Elementwise Functions: applymap()
DataFrames and Series can be passed into functions. However, if the function needs to be called in a chain,
consider using the pipe() method.
First some setup:
Is equivalent to:
pandas encourages the second style, which is known as method chaining. pipe makes it easy to use your own or
another library’s functions in method chains, alongside pandas’ methods.
In the example above, the functions extract_city_name and add_country_name each expected a
DataFrame as the first positional argument. What if the function you wish to apply takes its data as, say, the
second argument? In this case, provide pipe with a tuple of (callable, data_keyword). .pipe will route
the DataFrame to the argument specified in the tuple.
For example, we can fit a regression using statsmodels. Their API expects a formula first and a DataFrame as the
second argument, data. We pass in the function, keyword pair (sm.ols, 'data') to pipe:
In [149]: (
.....: bb.query("h > 0")
.....: .assign(ln_h=lambda df: np.log(df.h))
.....: .pipe((sm.ols, "data"), "hr ~ ln_h + year + g + C(lg)")
.....: .fit()
.....: .summary()
.....: )
.....:
Out[149]:
<class 'statsmodels.iolib.summary.Summary'>
"""
OLS Regression Results
==============================================================================
Dep. Variable: hr R-squared: 0.685
Model: OLS Adj. R-squared: 0.665
Method: Least Squares F-statistic: 34.28
Date: Sat, 26 Dec 2020 Prob (F-statistic): 3.48e-15
Time: 14:28:19 Log-Likelihood: -205.92
No. Observations: 68 AIC: 421.8
Df Residuals: 63 BIC: 432.9
(continues on next page)
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly
˓→specified.
[2] The condition number is large, 1.49e+07. This might indicate that there are
strong multicollinearity or other numerical problems.
"""
The pipe method is inspired by unix pipes and more recently dplyr and magrittr, which have introduced the popular
(%>%) (read pipe) operator for R. The implementation of pipe here is quite clean and feels right at home in Python.
We encourage you to view the source code of pipe().
Arbitrary functions can be applied along the axes of a DataFrame using the apply() method, which, like the de-
scriptive statistics methods, takes an optional axis argument:
In [150]: df.apply(np.mean)
Out[150]:
one 0.811094
two 1.360588
three 0.187958
dtype: float64
In [154]: df.apply(np.exp)
Out[154]:
one two three
a 4.034899 5.885648 NaN
b 1.409244 6.767440 0.950858
c 2.004201 4.385785 3.412466
d NaN 1.322262 0.541630
The return type of the function passed to apply() affects the type of the final output from DataFrame.apply for
the default behaviour:
• If the applied function returns a Series, the final output is a DataFrame. The columns match the index of
the Series returned by the applied function.
• If the applied function returns any other type, the final output is a Series.
This default behaviour can be overridden using the result_type, which accepts three options: reduce,
broadcast, and expand. These will determine how list-likes return values expand (or not) to a DataFrame.
apply() combined with some cleverness can be used to answer many questions about a data set. For example,
suppose we wanted to extract the date where the maximum value for each column occurred:
In [157]: tsdf = pd.DataFrame(
.....: np.random.randn(1000, 3),
.....: columns=["A", "B", "C"],
.....: index=pd.date_range("1/1/2000", periods=1000),
.....: )
.....:
You may also pass additional arguments and keyword arguments to the apply() method. For instance, consider the
following function you would like to apply:
Another useful feature is the ability to pass Series methods to carry out some Series operation on each column or row:
In [159]: tsdf
Out[159]:
A B C
2000-01-01 -0.158131 -0.232466 0.321604
2000-01-02 -1.810340 -3.105758 0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 -0.653602 0.178875 1.008298
2000-01-09 1.007996 0.462824 0.254472
2000-01-10 0.307473 0.600337 1.643950
In [160]: tsdf.apply(pd.Series.interpolate)
Out[160]:
A B C
2000-01-01 -0.158131 -0.232466 0.321604
2000-01-02 -1.810340 -3.105758 0.433834
2000-01-03 -1.209847 -1.156793 -0.136794
2000-01-04 -1.098598 -0.889659 0.092225
2000-01-05 -0.987349 -0.622526 0.321243
2000-01-06 -0.876100 -0.355392 0.550262
2000-01-07 -0.764851 -0.088259 0.779280
2000-01-08 -0.653602 0.178875 1.008298
2000-01-09 1.007996 0.462824 0.254472
2000-01-10 0.307473 0.600337 1.643950
Finally, apply() takes an argument raw which is False by default, which converts each row or column into a Series
before applying the function. When set to True, the passed function will instead receive an ndarray object, which has
positive performance implications if you do not need the indexing functionality.
Aggregation API
The aggregation API allows one to express possibly multiple aggregation operations in a single concise way. This
API is similar across pandas objects, see groupby API, the window API, and the resample API. The entry point for
aggregation is DataFrame.aggregate(), or the alias DataFrame.agg().
We will use a similar starting frame from above:
In [163]: tsdf
Out[163]:
A B C
2000-01-01 1.257606 1.004194 0.167574
2000-01-02 -0.749892 0.288112 -0.757304
2000-01-03 -0.207550 -0.298599 0.116018
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.814347 -0.257623 0.869226
2000-01-09 -0.250663 -1.206601 0.896839
2000-01-10 2.169758 -1.333363 0.283157
Using a single function is equivalent to apply(). You can also pass named methods as strings. These will return a
Series of the aggregated output:
In [164]: tsdf.agg(np.sum)
Out[164]:
A 3.033606
B -1.803879
C 1.575510
dtype: float64
In [165]: tsdf.agg("sum")
Out[165]:
A 3.033606
B -1.803879
C 1.575510
dtype: float64
In [167]: tsdf["A"].agg("sum")
Out[167]: 3.033606102414146
You can pass multiple aggregation arguments as a list. The results of each of the passed functions will be a row in the
resulting DataFrame. These are naturally named from the aggregation function.
In [168]: tsdf.agg(["sum"])
Out[168]:
A B C
sum 3.033606 -1.803879 1.57551
Passing a named function will yield that name for the row:
Passing a dictionary of column names to a scalar or a list of scalars, to DataFrame.agg allows you to customize
which functions are applied to which columns. Note that the results are not in any particular order, you can use an
OrderedDict instead to guarantee ordering.
Passing a list-like will generate a DataFrame output. You will get a matrix-like output of all of the aggregators. The
output will consist of all unique functions. Those that are not noted for a particular column will be NaN:
Mixed dtypes
When presented with mixed dtypes that cannot aggregate, .agg will only take the valid aggregations. This is similar
to how .groupby.agg works.
In [177]: mdf.dtypes
Out[177]:
A int64
B float64
C object
D datetime64[ns]
dtype: object
Custom describe
With .agg() it is possible to easily create a custom describe function, similar to the built in describe function.
Transform API
The transform() method returns an object that is indexed the same (same size) as the original. This API allows
you to provide multiple operations at the same time rather than one-by-one. Its API is quite similar to the .agg API.
We create a frame similar to the one used in the above sections.
In [187]: tsdf
Out[187]:
A B C
2000-01-01 -0.428759 -0.864890 -0.675341
2000-01-02 -0.168731 1.338144 -1.279321
2000-01-03 -1.621034 0.438107 0.903794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 -1.240447 -0.201052
2000-01-09 -0.157795 0.791197 -1.144209
2000-01-10 -0.030876 0.371900 0.061932
Transform the entire frame. .transform() allows input functions as: a NumPy function, a string function name or
a user defined function.
In [188]: tsdf.transform(np.abs)
Out[188]:
A B C
2000-01-01 0.428759 0.864890 0.675341
2000-01-02 0.168731 1.338144 1.279321
2000-01-03 1.621034 0.438107 0.903794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 1.240447 0.201052
2000-01-09 0.157795 0.791197 1.144209
2000-01-10 0.030876 0.371900 0.061932
In [189]: tsdf.transform("abs")
Out[189]:
A B C
2000-01-01 0.428759 0.864890 0.675341
2000-01-02 0.168731 1.338144 1.279321
2000-01-03 1.621034 0.438107 0.903794
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 1.240447 0.201052
2000-01-09 0.157795 0.791197 1.144209
2000-01-10 0.030876 0.371900 0.061932
Passing a single function to .transform() with a Series will yield a single Series in return.
In [192]: tsdf["A"].transform(np.abs)
Out[192]:
2000-01-01 0.428759
2000-01-02 0.168731
2000-01-03 1.621034
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 NaN
2000-01-07 NaN
2000-01-08 0.254374
2000-01-09 0.157795
2000-01-10 0.030876
Freq: D, Name: A, dtype: float64
Passing multiple functions will yield a column MultiIndexed DataFrame. The first level will be the original frame
column names; the second level will be the names of the transforming functions.
Passing multiple functions to a Series will yield a DataFrame. The resulting column names will be the transforming
functions.
Passing a dict of lists will generate a MultiIndexed DataFrame with these selective transforms.
In [196]: tsdf.transform({"A": np.abs, "B": [lambda x: x + 1, "sqrt"]})
Out[196]:
A B
A <lambda> sqrt
2000-01-01 0.428759 0.135110 NaN
2000-01-02 0.168731 2.338144 1.156782
2000-01-03 1.621034 1.438107 0.661897
2000-01-04 NaN NaN NaN
2000-01-05 NaN NaN NaN
2000-01-06 NaN NaN NaN
2000-01-07 NaN NaN NaN
2000-01-08 0.254374 -0.240447 NaN
2000-01-09 0.157795 1.791197 0.889493
2000-01-10 0.030876 1.371900 0.609836
Since not all functions can be vectorized (accept NumPy arrays and return another array or value), the methods
applymap() on DataFrame and analogously map() on Series accept any Python function taking a single value and
returning a single value. For example:
In [197]: df4
Out[197]:
one two three
a 1.394981 1.772517 NaN
b 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
d NaN 0.279344 -0.613172
In [199]: df4["one"].map(f)
Out[199]:
a 18
(continues on next page)
In [200]: df4.applymap(f)
Out[200]:
one two three
a 18 17 3
b 19 18 20
c 18 18 16
d 3 19 19
Series.map() has an additional feature; it can be used to easily “link” or “map” values defined by a secondary
series. This is closely related to merging/joining functionality:
In [201]: s = pd.Series(
.....: ["six", "seven", "six", "seven", "six"], index=["a", "b", "c", "d", "e"]
.....: )
.....:
In [203]: s
Out[203]:
a six
b seven
c six
d seven
e six
dtype: object
In [204]: s.map(t)
Out[204]:
a 6.0
b 7.0
c 6.0
d 7.0
e 6.0
dtype: float64
reindex() is the fundamental data alignment method in pandas. It is used to implement nearly all other features
relying on label-alignment functionality. To reindex means to conform the data to match a given set of labels along a
particular axis. This accomplishes several things:
• Reorders the existing data to match a new set of labels
• Inserts missing value (NA) markers in label locations where no data for that label existed
• If specified, fill data for missing labels using logic (highly relevant to working with time series data)
Here is a simple example:
In [206]: s
Out[206]:
a 1.695148
b 1.328614
c 1.234686
d -0.385845
e -1.326508
dtype: float64
Here, the f label was not contained in the Series and hence appears as NaN in the result.
With a DataFrame, you can simultaneously reindex the index and columns:
In [208]: df
Out[208]:
one two three
a 1.394981 1.772517 NaN
b 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
d NaN 0.279344 -0.613172
Note that the Index objects containing the actual axis labels can be shared between objects. So if we have a Series
and a DataFrame, the following can be done:
In [211]: rs = s.reindex(df.index)
In [212]: rs
Out[212]:
a 1.695148
b 1.328614
c 1.234686
d -0.385845
(continues on next page)
This means that the reindexed Series’s index is the same Python object as the DataFrame’s index.
DataFrame.reindex() also supports an “axis-style” calling convention, where you specify a single labels
argument and the axis it applies to.
In [214]: df.reindex(["c", "f", "b"], axis="index")
Out[214]:
one two three
c 0.695246 1.478369 1.227435
f NaN NaN NaN
b 0.343054 1.912123 -0.050390
See also:
MultiIndex / Advanced Indexing is an even more concise way of doing reindexing.
Note: When writing performance-sensitive code, there is a good reason to spend some time becoming a reindexing
ninja: many operations are faster on pre-aligned data. Adding two unaligned DataFrames internally triggers a
reindexing step. For exploratory analysis you will hardly notice the difference (because reindex has been heavily
optimized), but when CPU cycles matter sprinkling a few explicit reindex calls here and there can have an impact.
You may wish to take an object and reindex its axes to be labeled the same as another object. While the syntax for this
is straightforward albeit verbose, it is a common enough operation that the reindex_like() method is available
to make this simpler:
In [216]: df2
Out[216]:
one two
a 1.394981 1.772517
b 0.343054 1.912123
c 0.695246 1.478369
In [217]: df3
Out[217]:
one two
a 0.583888 0.051514
b -0.468040 0.191120
c -0.115848 -0.242634
The align() method is the fastest way to simultaneously align two objects. It supports a join argument (related to
joining and merging):
• join='outer': take the union of the indexes (default)
• join='left': use the calling object’s index
• join='right': use the passed object’s index
• join='inner': intersect the indexes
It returns a tuple with both of the reindexed Series:
In [219]: s = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])
In [220]: s1 = s[:4]
In [221]: s2 = s[1:]
In [222]: s1.align(s2)
Out[222]:
(a -0.186646
b -1.692424
c -0.303893
d -1.425662
e NaN
dtype: float64,
a NaN
b -1.692424
c -0.303893
d -1.425662
e 1.114285
dtype: float64)
For DataFrames, the join method will be applied to both the index and the columns by default:
You can also pass an axis option to only align on the specified axis:
If you pass a Series to DataFrame.align(), you can choose to align both objects either on the DataFrame’s index
or columns using the axis argument:
reindex() takes an optional parameter method which is a filling method chosen from the following table:
Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
nearest Fill from the nearest index value
In [231]: ts
Out[231]:
2000-01-03 0.183051
2000-01-04 0.400528
2000-01-05 -0.015083
2000-01-06 2.395489
2000-01-07 1.414806
2000-01-08 0.118428
2000-01-09 0.733639
2000-01-10 -0.936077
Freq: D, dtype: float64
In [232]: ts2
Out[232]:
2000-01-03 0.183051
2000-01-06 2.395489
2000-01-09 0.733639
Freq: 3D, dtype: float64
In [233]: ts2.reindex(ts.index)
Out[233]:
2000-01-03 0.183051
2000-01-04 NaN
2000-01-05 NaN
2000-01-06 2.395489
2000-01-07 NaN
2000-01-08 NaN
2000-01-09 0.733639
2000-01-10 NaN
Freq: D, dtype: float64
These methods require that the indexes are ordered increasing or decreasing.
Note that the same result could have been achieved using fillna (except for method='nearest') or interpolate:
In [237]: ts2.reindex(ts.index).fillna(method="ffill")
Out[237]:
2000-01-03 0.183051
2000-01-04 0.183051
2000-01-05 0.183051
2000-01-06 2.395489
2000-01-07 2.395489
2000-01-08 2.395489
2000-01-09 0.733639
2000-01-10 0.733639
Freq: D, dtype: float64
reindex() will raise a ValueError if the index is not monotonically increasing or decreasing. fillna() and
interpolate() will not perform any checks on the order of the index.
The limit and tolerance arguments provide additional control over filling while reindexing. Limit specifies the
maximum count of consecutive matches:
In contrast, tolerance specifies the maximum distance between the index and indexer values:
Notice that when used on a DatetimeIndex, TimedeltaIndex or PeriodIndex, tolerance will coerced
into a Timedelta if possible. This allows you to specify tolerance with appropriate strings.
A method closely related to reindex is the drop() function. It removes a set of labels from an axis:
In [240]: df
Out[240]:
one two three
a 1.394981 1.772517 NaN
b 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
d NaN 0.279344 -0.613172
Note that the following also works, but is a bit less obvious / clean:
The rename() method allows you to relabel an axis based on some mapping (a dict or Series) or an arbitrary function.
In [244]: s
Out[244]:
a -0.186646
b -1.692424
c -0.303893
d -1.425662
e 1.114285
dtype: float64
In [245]: s.rename(str.upper)
Out[245]:
A -0.186646
B -1.692424
C -0.303893
D -1.425662
E 1.114285
dtype: float64
If you pass a function, it must return a value when called with any of the labels (and must produce a set of unique
values). A dict or Series can also be used:
In [246]: df.rename(
.....: columns={"one": "foo", "two": "bar"},
.....: index={"a": "apple", "b": "banana", "d": "durian"},
.....: )
.....:
Out[246]:
foo bar three
apple 1.394981 1.772517 NaN
banana 0.343054 1.912123 -0.050390
c 0.695246 1.478369 1.227435
durian NaN 0.279344 -0.613172
If the mapping doesn’t include a column/index label, it isn’t renamed. Note that extra labels in the mapping don’t
throw an error.
DataFrame.rename() also supports an “axis-style” calling convention, where you specify a single mapper and
the axis to apply that mapping to.
The rename() method also provides an inplace named parameter that is by default False and copies the under-
lying data. Pass inplace=True to rename the data in place.
Finally, rename() also accepts a scalar or list-like for altering the Series.name attribute.
In [249]: s.rename("scalar-name")
Out[249]:
a -0.186646
b -1.692424
c -0.303893
d -1.425662
e 1.114285
Name: scalar-name, dtype: float64
In [251]: df
Out[251]:
x y
let num
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60
In [253]: df.rename_axis(index=str.upper)
Out[253]:
x y
LET NUM
a 1 1 10
2 2 20
b 1 3 30
2 4 40
c 1 5 50
2 6 60
2.3.8 Iteration
The behavior of basic iteration over pandas objects depends on the type. When iterating over a Series, it is regarded
as array-like, and basic iteration produces the values. DataFrames follow the dict-like convention of iterating over the
“keys” of the objects.
In short, basic iteration (for i in object) produces:
• Series: values
• DataFrame: column labels
Thus, for example, iterating over a DataFrame gives you the column names:
In [254]: df = pd.DataFrame(
.....: {"col1": np.random.randn(3), "col2": np.random.randn(3)}, index=["a", "b
˓→", "c"]
.....: )
.....:
pandas objects also have the dict-like items() method to iterate over the (key, value) pairs.
To iterate over the rows of a DataFrame, you can use the following methods:
• iterrows(): Iterate over the rows of a DataFrame as (index, Series) pairs. This converts the rows to Series
objects, which can change the dtypes and has some performance implications.
• itertuples(): Iterate over the rows of a DataFrame as namedtuples of the values. This is a lot faster than
iterrows(), and is in most cases preferable to use to iterate over the values of a DataFrame.
Warning: Iterating through pandas objects is generally slow. In many cases, iterating manually over the rows is
not needed and can be avoided with one of the following approaches:
• Look for a vectorized solution: many operations can be performed using built-in methods or NumPy func-
tions, (boolean) indexing, . . .
• When you have a function that cannot work on the full DataFrame/Series at once, it is better to use apply()
instead of iterating over the values. See the docs on function application.
• If you need to do iterative manipulations on the values but performance is important, consider writing the in-
ner loop with cython or numba. See the enhancing performance section for some examples of this approach.
Warning: You should never modify something you are iterating over. This is not guaranteed to work in all cases.
Depending on the data types, the iterator returns a copy and not a view, and writing to it will have no effect!
For example, in the following case setting the value has no effect:
In [256]: df = pd.DataFrame({"a": [1, 2, 3], "b": ["a", "b", "c"]})
In [258]: df
Out[258]:
a b
0 1 a
1 2 b
2 3 c
items
Consistent with the dict-like interface, items() iterates through key-value pairs:
• Series: (index, scalar value) pairs
• DataFrame: (column, Series) pairs
For example:
iterrows
iterrows() allows you to iterate through the rows of a DataFrame as Series objects. It returns an iterator yielding
each index value along with a Series containing the data in each row:
In [260]: for row_index, row in df.iterrows():
.....: print(row_index, row, sep="\n")
.....:
0
a 1
b a
Name: 0, dtype: object
1
a 2
b b
Name: 1, dtype: object
2
a 3
b c
Name: 2, dtype: object
Note: Because iterrows() returns a Series for each row, it does not preserve dtypes across the rows (dtypes are
preserved across columns for DataFrames). For example,
In [261]: df_orig = pd.DataFrame([[1, 1.5]], columns=["int", "float"])
In [262]: df_orig.dtypes
Out[262]:
int int64
float float64
dtype: object
In [264]: row
Out[264]:
int 1.0
float 1.5
Name: 0, dtype: float64
All values in row, returned as a Series, are now upcasted to floats, also the original integer value in column x:
In [265]: row["int"].dtype
Out[265]: dtype('float64')
In [266]: df_orig["int"].dtype
Out[266]: dtype('int64')
To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns namedtuples of the
values and which is generally much faster than iterrows().
In [268]: print(df2)
(continues on next page)
In [269]: print(df2.T)
0 1 2
x 1 2 3
y 4 5 6
In [271]: print(df2_t)
0 1 2
x 1 2 3
y 4 5 6
itertuples
The itertuples() method will return an iterator yielding a namedtuple for each row in the DataFrame. The first
element of the tuple will be the row’s corresponding index value, while the remaining values are the row values.
For instance:
This method does not convert the row to a Series object; it merely returns the values inside a namedtuple. Therefore,
itertuples() preserves the data type of the values and is generally faster as iterrows().
Note: The column names will be renamed to positional names if they are invalid Python identifiers, repeated, or start
with an underscore. With a large number of columns (>255), regular tuples are returned.
Series has an accessor to succinctly return datetime like properties for the values of the Series, if it is a date-
time/period like Series. This will return a Series, indexed like the existing Series.
# datetime
In [273]: s = pd.Series(pd.date_range("20130101 09:10:12", periods=4))
In [274]: s
Out[274]:
0 2013-01-01 09:10:12
1 2013-01-02 09:10:12
2 2013-01-03 09:10:12
3 2013-01-04 09:10:12
dtype: datetime64[ns]
(continues on next page)
In [275]: s.dt.hour
Out[275]:
0 9
1 9
2 9
3 9
dtype: int64
In [276]: s.dt.second
Out[276]:
0 12
1 12
2 12
3 12
dtype: int64
In [277]: s.dt.day
Out[277]:
0 1
1 2
2 3
3 4
dtype: int64
In [278]: s[s.dt.day == 2]
Out[278]:
1 2013-01-02 09:10:12
dtype: datetime64[ns]
In [280]: stz
Out[280]:
0 2013-01-01 09:10:12-05:00
1 2013-01-02 09:10:12-05:00
2 2013-01-03 09:10:12-05:00
3 2013-01-04 09:10:12-05:00
dtype: datetime64[ns, US/Eastern]
In [281]: stz.dt.tz
Out[281]: <DstTzInfo 'US/Eastern' LMT-1 day, 19:04:00 STD>
In [282]: s.dt.tz_localize("UTC").dt.tz_convert("US/Eastern")
Out[282]:
0 2013-01-01 04:10:12-05:00
1 2013-01-02 04:10:12-05:00
2 2013-01-03 04:10:12-05:00
3 2013-01-04 04:10:12-05:00
dtype: datetime64[ns, US/Eastern]
You can also format datetime values as strings with Series.dt.strftime() which supports the same format as
the standard strftime().
# DatetimeIndex
In [283]: s = pd.Series(pd.date_range("20130101", periods=4))
In [284]: s
Out[284]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: datetime64[ns]
In [285]: s.dt.strftime("%Y/%m/%d")
Out[285]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
# PeriodIndex
In [286]: s = pd.Series(pd.period_range("20130101", periods=4))
In [287]: s
Out[287]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]
In [288]: s.dt.strftime("%Y/%m/%d")
Out[288]:
0 2013/01/01
1 2013/01/02
2 2013/01/03
3 2013/01/04
dtype: object
In [290]: s
Out[290]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
3 2013-01-04
dtype: period[D]
In [291]: s.dt.year
Out[291]:
0 2013
1 2013
(continues on next page)
In [292]: s.dt.day
Out[292]:
0 1
1 2
2 3
3 4
dtype: int64
# timedelta
In [293]: s = pd.Series(pd.timedelta_range("1 day 00:00:05", periods=4, freq="s"))
In [294]: s
Out[294]:
0 1 days 00:00:05
1 1 days 00:00:06
2 1 days 00:00:07
3 1 days 00:00:08
dtype: timedelta64[ns]
In [295]: s.dt.days
Out[295]:
0 1
1 1
2 1
3 1
dtype: int64
In [296]: s.dt.seconds
Out[296]:
0 5
1 6
2 7
3 8
dtype: int64
In [297]: s.dt.components
Out[297]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 1 0 0 5 0 0 0
1 1 0 0 6 0 0 0
2 1 0 0 7 0 0 0
3 1 0 0 8 0 0 0
Note: Series.dt will raise a TypeError if you access with a non-datetime-like values.
Series is equipped with a set of string processing methods that make it easy to operate on each element of the array.
Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via the Series’s
str attribute and generally have names matching the equivalent (scalar) built-in string methods. For example:
In [298]: s = pd.Series(
.....: ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"],
˓→dtype="string"
.....: )
.....:
In [299]: s.str.lower()
Out[299]:
0 a
1 b
2 c
3 aaba
4 baca
5 <NA>
6 caba
7 dog
8 cat
dtype: string
Powerful pattern-matching methods are provided as well, but note that pattern-matching generally uses regular expres-
sions by default (and in some cases always uses them).
Note: Prior to pandas 1.0, string methods were only available on object -dtype Series. pandas 1.0 added the
StringDtype which is dedicated to strings. See Text data types for more.
2.3.11 Sorting
pandas supports three kinds of sorting: sorting by index labels, sorting by column values, and sorting by a combination
of both.
By index
The Series.sort_index() and DataFrame.sort_index() methods are used to sort a pandas object by its
index levels.
In [300]: df = pd.DataFrame(
.....: {
.....: "one": pd.Series(np.random.randn(3), index=["a", "b", "c"]),
.....: "two": pd.Series(np.random.randn(4), index=["a", "b", "c", "d"]),
.....: "three": pd.Series(np.random.randn(3), index=["b", "c", "d"]),
.....: }
.....: )
.....:
In [302]: unsorted_df
Out[302]:
three two one
a NaN -1.152244 0.562973
d -0.252916 -0.109597 NaN
c 1.273388 -0.167123 0.640382
b -0.098217 0.009797 -1.299504
# DataFrame
In [303]: unsorted_df.sort_index()
Out[303]:
three two one
a NaN -1.152244 0.562973
b -0.098217 0.009797 -1.299504
c 1.273388 -0.167123 0.640382
d -0.252916 -0.109597 NaN
In [304]: unsorted_df.sort_index(ascending=False)
Out[304]:
three two one
d -0.252916 -0.109597 NaN
c 1.273388 -0.167123 0.640382
b -0.098217 0.009797 -1.299504
a NaN -1.152244 0.562973
In [305]: unsorted_df.sort_index(axis=1)
Out[305]:
one three two
a 0.562973 NaN -1.152244
d NaN -0.252916 -0.109597
c 0.640382 1.273388 -0.167123
b -1.299504 -0.098217 0.009797
# Series
In [306]: unsorted_df["three"].sort_index()
Out[306]:
a NaN
b -0.098217
c 1.273388
d -0.252916
Name: three, dtype: float64
.....: list("ab")
.....: )
.....:
In [309]: s1.sort_index(level="a")
Out[309]:
c
a b
B 1 2
C 3 4
a 2 3
By values
The Series.sort_values() method is used to sort a Series by its values. The DataFrame.
sort_values() method is used to sort a DataFrame by its column or row values. The optional by parameter to
DataFrame.sort_values() may used to specify one or more columns to use to determine the sorted order.
In [311]: df1 = pd.DataFrame(
.....: {"one": [2, 1, 1, 1], "two": [1, 3, 2, 4], "three": [5, 4, 3, 2]}
.....: )
.....:
In [312]: df1.sort_values(by="two")
Out[312]:
one two three
0 2 1 5
2 1 2 3
1 1 3 4
3 1 4 2
These methods have special treatment of NA values via the na_position argument:
In [315]: s.sort_values()
Out[315]:
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
2 <NA>
5 <NA>
dtype: string
In [316]: s.sort_values(na_position="first")
Out[316]:
2 <NA>
5 <NA>
0 A
3 Aaba
1 B
4 Baca
6 CABA
8 cat
7 dog
dtype: string
In [318]: s1.sort_values()
Out[318]:
0 B
2 C
1 a
dtype: object
key will be given the Series of values and should return a Series or array of the same shape with the transformed
values. For DataFrame objects, the key is applied per column, so the key should still expect a Series and return a
Series, e.g.
In [320]: df = pd.DataFrame({"a": ["B", "a", "C"], "b": [1, 2, 3]})
In [321]: df.sort_values(by="a")
Out[321]:
a b
(continues on next page)
The name or type of each column can be used to apply different functions to different columns.
Strings passed as the by parameter to DataFrame.sort_values() may refer to either columns or index level
names.
# Build MultiIndex
In [323]: idx = pd.MultiIndex.from_tuples(
.....: [("a", 1), ("a", 2), ("a", 2), ("b", 2), ("b", 1), ("b", 1)]
.....: )
.....:
# Build DataFrame
In [325]: df_multi = pd.DataFrame({"A": np.arange(6, 0, -1)}, index=idx)
In [326]: df_multi
Out[326]:
A
first second
a 1 6
2 5
2 4
b 2 3
1 2
1 1
Note: If a string matches both a column name and an index level name then a warning is issued and the column takes
searchsorted
Series has the nsmallest() and nlargest() methods which return the smallest or largest 𝑛 values. For a
large Series this can be much faster than sorting the entire Series and calling head(n) on the result.
In [335]: s = pd.Series(np.random.permutation(10))
In [336]: s
Out[336]:
0 2
1 0
2 3
3 7
4 1
5 5
6 9
7 6
8 8
9 4
dtype: int64
In [337]: s.sort_values()
Out[337]:
1 0
4 1
0 2
2 3
9 4
5 5
(continues on next page)
In [338]: s.nsmallest(3)
Out[338]:
1 0
4 1
0 2
dtype: int64
In [339]: s.nlargest(3)
Out[339]:
6 9
8 8
3 7
dtype: int64
You must be explicit about sorting when the column is a MultiIndex, and fully specify all levels to by.
2.3.12 Copying
The copy() method on pandas objects copies the underlying data (though not the axis indexes, since they are im-
mutable) and returns a new object. Note that it is seldom necessary to copy objects. For example, there are only a
handful of ways to alter a DataFrame in-place:
• Inserting, deleting, or modifying a column.
• Assigning to the index or columns attributes.
• For homogeneous data, directly modifying the values via the values attribute or advanced indexing.
To be clear, no pandas method has the side effect of modifying your data; almost every method returns a new object,
leaving the original object untouched. If the data is modified, it is because you did so explicitly.
2.3.13 dtypes
For the most part, pandas uses NumPy arrays and dtypes for Series or individual columns of a DataFrame. NumPy
provides support for float, int, bool, timedelta64[ns] and datetime64[ns] (note that NumPy does not
support timezone-aware datetimes).
pandas and third-party libraries extend NumPy’s type system in a few places. This section describes the extensions
pandas has made internally. See Extension types for how to write your own extension that works with pandas. See
ecosystem.extensions for a list of third-party libraries that have implemented an extension.
The following table lists all of pandas extension types. For methods requiring dtype arguments, strings can be
specified as indicated. See the respective documentation sections for more on each type.
In [348]: dft
Out[348]:
A B C D E F G
0 0.035962 1 foo 2001-01-02 1.0 False 1
1 0.701379 1 foo 2001-01-02 1.0 False 1
2 0.281885 1 foo 2001-01-02 1.0 False 1
In [349]: dft.dtypes
Out[349]:
A float64
B int64
C object
D datetime64[ns]
E float32
F bool
G int8
dtype: object
In [350]: dft["A"].dtype
Out[350]: dtype('float64')
If a pandas object contains data with multiple dtypes in a single column, the dtype of the column will be chosen to
accommodate all of the data types (object is the most general).
The number of columns of each type in a DataFrame can be found by calling DataFrame.dtypes.
value_counts().
In [353]: dft.dtypes.value_counts()
Out[353]:
float64 1
object 1
int8 1
(continues on next page)
Numeric dtypes will propagate and can coexist in DataFrames. If a dtype is passed (either directly via the dtype
keyword, a passed ndarray, or a passed Series), then it will be preserved in DataFrame operations. Furthermore,
different numeric dtypes will NOT be combined. The following example will give you a taste.
In [355]: df1
Out[355]:
A
0 0.224364
1 1.890546
2 0.182879
3 0.787847
4 -0.188449
5 0.667715
6 -0.011736
7 -0.399073
In [356]: df1.dtypes
Out[356]:
A float32
dtype: object
In [358]: df2
Out[358]:
A B C
0 0.823242 0.256090 0
1 1.607422 1.426469 0
2 -0.333740 -0.416203 255
3 -0.063477 1.139976 0
4 -1.014648 -1.193477 0
5 0.678711 0.096706 0
6 -0.040863 -1.956850 1
7 -0.357422 -0.714337 0
In [359]: df2.dtypes
Out[359]:
A float16
B float64
C uint8
dtype: object
defaults
By default integer types are int64 and float types are float64, regardless of platform (32-bit or 64-bit). The
following will all result in int64 dtypes.
Note that Numpy will choose platform-dependent types when creating arrays. The following WILL result in int32
on 32-bit platform.
upcasting
Types can potentially be upcasted when combined with other types, meaning they are promoted from the current type
(e.g. int to float).
In [365]: df3
Out[365]:
A B C
0 1.047606 0.256090 0.0
1 3.497968 1.426469 0.0
2 -0.150862 -0.416203 255.0
3 0.724370 1.139976 0.0
4 -1.203098 -1.193477 0.0
5 1.346426 0.096706 0.0
6 -0.052599 -1.956850 1.0
7 -0.756495 -0.714337 0.0
In [366]: df3.dtypes
Out[366]:
A float32
B float64
C float64
dtype: object
DataFrame.to_numpy() will return the lower-common-denominator of the dtypes, meaning the dtype that can
accommodate ALL of the types in the resulting homogeneous dtyped NumPy array. This can force some upcasting.
In [367]: df3.to_numpy().dtype
Out[367]: dtype('float64')
astype
You can use the astype() method to explicitly convert dtypes from one to another. These will by default return a
copy, even if the dtype was unchanged (pass copy=False to change this behavior). In addition, they will raise an
exception if the astype operation is invalid.
Upcasting is always according to the NumPy rules. If two different dtypes are involved in an operation, then the more
general one will be used as the result of the operation.
In [368]: df3
Out[368]:
A B C
0 1.047606 0.256090 0.0
1 3.497968 1.426469 0.0
2 -0.150862 -0.416203 255.0
3 0.724370 1.139976 0.0
4 -1.203098 -1.193477 0.0
5 1.346426 0.096706 0.0
6 -0.052599 -1.956850 1.0
7 -0.756495 -0.714337 0.0
In [369]: df3.dtypes
Out[369]:
A float32
B float64
C float64
dtype: object
# conversion of dtypes
In [370]: df3.astype("float32").dtypes
Out[370]:
A float32
B float32
C float32
dtype: object
In [371]: dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
In [373]: dft
Out[373]:
a b c
0 1 4 7
1 2 5 8
2 3 6 9
In [374]: dft.dtypes
Out[374]:
a uint8
b uint8
c int64
dtype: object
In [375]: dft1 = pd.DataFrame({"a": [1, 0, 1], "b": [4, 5, 6], "c": [7, 8, 9]})
In [377]: dft1
Out[377]:
a b c
0 True 4 7.0
1 False 5 8.0
2 True 6 9.0
In [378]: dft1.dtypes
Out[378]:
a bool
b int64
c float64
dtype: object
Note: When trying to convert a subset of columns to a specified type using astype() and loc(), upcasting occurs.
loc() tries to fit in what we are assigning to the current dtypes, while [] will overwrite them taking the dtype from
the right hand side. Therefore the following piece of code produces the unintended result.
In [379]: dft = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6], "c": [7, 8, 9]})
In [382]: dft.dtypes
Out[382]:
a int64
b int64
c int64
dtype: object
object conversion
pandas offers various functions to try to force conversion of types from the object dtype to other types. In cases
where the data is already of the correct type, but stored in an object array, the DataFrame.infer_objects()
and Series.infer_objects() methods can be used to soft convert to the correct type.
In [383]: import datetime
In [384]: df = pd.DataFrame(
.....: [
.....: [1, 2],
.....: ["a", "b"],
.....: [datetime.datetime(2016, 3, 2), datetime.datetime(2016, 3,
˓→2)],
In [385]: df = df.T
In [386]: df
Out[386]:
0 1 2
0 1 a 2016-03-02
1 2 b 2016-03-02
In [387]: df.dtypes
Out[387]:
0 object
1 object
2 datetime64[ns]
dtype: object
Because the data was transposed the original inference stored all columns as object, which infer_objects will
correct.
In [388]: df.infer_objects().dtypes
Out[388]:
0 int64
1 object
2 datetime64[ns]
dtype: object
The following functions are available for one dimensional object arrays or scalars to perform hard conversion of objects
to a specified type:
• to_numeric() (conversion to numeric dtypes)
In [389]: m = ["1.1", 2, 3]
In [390]: pd.to_numeric(m)
Out[390]: array([1.1, 2. , 3. ])
In [393]: pd.to_datetime(m)
Out[393]: DatetimeIndex(['2016-07-09', '2016-03-02'], dtype='datetime64[ns]',
˓→freq=None)
In [395]: pd.to_timedelta(m)
Out[395]: TimedeltaIndex(['0 days 00:00:00.000005', '1 days 00:00:00'], dtype=
˓→'timedelta64[ns]', freq=None)
To force a conversion, we can pass in an errors argument, which specifies how pandas should deal with elements
that cannot be converted to desired dtype or object. By default, errors='raise', meaning that any errors encoun-
tered will be raised during the conversion process. However, if errors='coerce', these errors will be ignored
and pandas will convert problematic elements to pd.NaT (for datetime and timedelta) or np.nan (for numeric).
This might be useful if you are reading in data which is mostly of the desired dtype (e.g. numeric, datetime), but
occasionally has non-conforming elements intermixed that you want to represent as missing:
In [399]: m = ["apple", 2, 3]
The errors parameter has a third option of errors='ignore', which will simply return the passed in data if it
encounters any errors with the conversion to a desired data type:
In [406]: m = ["apple", 2, 3]
In addition to object conversion, to_numeric() provides another argument downcast, which gives the option of
downcasting the newly (or already) numeric data to a smaller dtype, which can conserve memory:
In [410]: m = ["1", 2, 3]
As these methods apply only to one-dimensional arrays, lists or scalars; they cannot be used directly on multi-
dimensional objects such as DataFrames. However, with apply(), we can “apply” the function over each column
efficiently:
In [417]: df
Out[417]:
0 1
0 2016-07-09 2016-03-02 00:00:00
1 2016-07-09 2016-03-02 00:00:00
In [418]: df.apply(pd.to_datetime)
Out[418]:
0 1
0 2016-07-09 2016-03-02
1 2016-07-09 2016-03-02
In [420]: df
Out[420]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [421]: df.apply(pd.to_numeric)
Out[421]:
0 1 2
0 1.1 2 3
1 1.1 2 3
In [423]: df
Out[423]:
0 1
0 5us 1 days 00:00:00
1 5us 1 days 00:00:00
In [424]: df.apply(pd.to_timedelta)
Out[424]:
0 1
0 0 days 00:00:00.000005 1 days
1 0 days 00:00:00.000005 1 days
gotchas
Performing selection operations on integer type data can easily upcast the data to floating. The dtype of the
input data will be preserved in cases where nans are not introduced. See also Support for integer NA.
In [425]: dfi = df3.astype("int32")
In [426]: dfi["E"] = 1
In [427]: dfi
Out[427]:
A B C E
0 1 0 0 1
1 3 1 0 1
2 0 0 255 1
3 0 1 0 1
4 -1 -1 0 1
5 1 0 0 1
6 0 -1 1 1
7 0 0 0 1
In [428]: dfi.dtypes
Out[428]:
A int32
B int32
C int32
E int64
dtype: object
In [430]: casted
Out[430]:
A B C E
0 1.0 NaN NaN 1
1 3.0 1.0 NaN 1
2 NaN NaN 255.0 1
3 NaN 1.0 NaN 1
4 NaN NaN NaN 1
5 1.0 NaN NaN 1
6 NaN NaN 1.0 1
7 NaN NaN NaN 1
In [431]: casted.dtypes
Out[431]:
A float64
B float64
C float64
E int64
dtype: object
In [434]: dfa.dtypes
(continues on next page)
In [436]: casted
Out[436]:
A B C
0 1.047606 0.256090 NaN
1 3.497968 1.426469 NaN
2 NaN NaN 255.0
3 NaN 1.139976 NaN
4 NaN NaN NaN
5 1.346426 0.096706 NaN
6 NaN NaN 1.0
7 NaN NaN NaN
In [437]: casted.dtypes
Out[437]:
A float32
B float64
C float64
dtype: object
In [438]: df = pd.DataFrame(
.....: {
.....: "string": list("abc"),
.....: "int64": list(range(1, 4)),
.....: "uint8": np.arange(3, 6).astype("u1"),
.....: "float64": np.arange(4.0, 7.0),
.....: "bool1": [True, False, True],
.....: "bool2": [False, True, False],
.....: "dates": pd.date_range("now", periods=3),
.....: "category": pd.Series(list("ABC")).astype("category"),
.....: }
.....: )
.....:
[3 rows x 12 columns]
In [444]: df.dtypes
Out[444]:
string object
int64 int64
uint8 uint8
float64 float64
bool1 bool
bool2 bool
dates datetime64[ns]
category category
tdeltas timedelta64[ns]
uint64 uint64
other_dates datetime64[ns]
tz_aware_dates datetime64[ns, US/Eastern]
dtype: object
select_dtypes() has two parameters include and exclude that allow you to say “give me the columns with
these dtypes” (include) and/or “give the columns without these dtypes” (exclude).
For example, to select bool columns:
In [445]: df.select_dtypes(include=[bool])
Out[445]:
bool1 bool2
0 True False
1 False True
2 True False
You can also pass the name of a dtype in the NumPy dtype hierarchy:
In [446]: df.select_dtypes(include=["bool"])
Out[446]:
bool1 bool2
0 True False
1 False True
2 True False
In [448]: df.select_dtypes(include=["object"])
Out[448]:
string
0 a
1 b
2 c
To see all the child dtypes of a generic dtype like numpy.number you can define a function that returns a tree of
child dtypes:
In [450]: subdtypes(np.generic)
Out[450]:
[numpy.generic,
[[numpy.number,
[[numpy.integer,
[[numpy.signedinteger,
[numpy.int8,
numpy.int16,
numpy.int32,
numpy.int64,
numpy.longlong,
numpy.timedelta64]],
[numpy.unsignedinteger,
[numpy.uint8,
numpy.uint16,
numpy.uint32,
numpy.uint64,
numpy.ulonglong]]]],
[numpy.inexact,
[[numpy.floating,
[numpy.float16, numpy.float32, numpy.float64, numpy.float128]],
[numpy.complexfloating,
[numpy.complex64, numpy.complex128, numpy.complex256]]]]]],
[numpy.flexible,
[[numpy.character, [numpy.bytes_, numpy.str_]],
[numpy.void, [numpy.record]]]],
numpy.bool_,
numpy.datetime64,
numpy.object_]]
Note: pandas also defines the types category, and datetime64[ns, tz], which are not integrated into the
normal NumPy hierarchy and won’t show up with the above function.
The pandas I/O API is a set of top level reader functions accessed like pandas.read_csv() that generally
return a pandas object. The corresponding writer functions are object methods that are accessed like DataFrame.
to_csv(). Below is a table containing available readers and writers.
Note: For examples that use the StringIO class, make sure you import it with from io import StringIO
for Python 3.
The workhorse function for reading text files (a.k.a. flat files) is read_csv(). See the cookbook for some advanced
strategies.
Parsing options
Basic
header [int or list of ints, default 'infer'] Row number(s) to use as the column names, and the start of the data.
Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0
and column names are inferred from the first line of the file, if column names are passed explicitly then the
behavior is identical to header=None. Explicitly pass header=0 to be able to replace existing names.
The header can be a list of ints that specify row locations for a MultiIndex on the columns e.g. [0,1,3].
Intervening rows that are not specified will be skipped (e.g. 2 in this example is skipped). Note that this
parameter ignores commented lines and empty lines if skip_blank_lines=True, so header=0 denotes the
first line of data rather than the first line of the file.
names [array-like, default None] List of column names to use. If file contains no header row, then you should
explicitly pass header=None. Duplicates in this list are not allowed.
index_col [int, str, sequence of int / str, or False, default None] Column(s) to use as the row labels of the
DataFrame, either given as string name or column index. If a sequence of int / str is given, a MultiIndex
is used.
Note: index_col=False can be used to force pandas to not use the first column as the index, e.g. when you
have a malformed file with delimiters at the end of each line.
The default value of None instructs pandas to guess. If the number of fields in the column header row is equal
to the number of fields in the body of the data file, then a default index is used. If it is one larger, then the first
field is used as an index.
usecols [list-like or callable, default None] Return a subset of the columns. If list-like, all elements must either
be positional (i.e. integer indices into the document columns) or strings that correspond to column names
provided either by the user in names or inferred from the document header row(s). For example, a valid list-
like usecols parameter would be [0, 1, 2] or ['foo', 'bar', 'baz'].
Element order is ignored, so usecols=[0, 1] is the same as [1, 0]. To instantiate a
DataFrame from data with element order preserved use pd.read_csv(data, usecols=['foo',
In [4]: pd.read_csv(StringIO(data))
Out[4]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
Out[5]:
col1 col3
0 a 1
1 a 2
2 c 3
Using this parameter results in much faster parsing time and lower memory usage.
squeeze [boolean, default False] If the parsed data only contains one column then return a Series.
prefix [str, default None] Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, . . .
mangle_dupe_cols [boolean, default True] Duplicate columns will be specified as ‘X’, ‘X.1’. . . ’X.N’, rather than
‘X’. . . ’X’. Passing in False will cause data to be overwritten if there are duplicate names in the columns.
dtype [Type name or dict of column -> type, default None] Data type for data or columns. E.g. {'a': np.
float64, 'b': np.int32} (unsupported with engine='python'). Use str or object together
with suitable na_values settings to preserve and not interpret dtype.
engine [{'c', 'python'}] Parser engine to use. The C engine is faster while the Python engine is currently more
feature-complete.
converters [dict, default None] Dict of functions for converting values in certain columns. Keys can either be integers
or column labels.
true_values [list, default None] Values to consider as True.
false_values [list, default None] Values to consider as False.
skipinitialspace [boolean, default False] Skip spaces after delimiter.
skiprows [list-like or integer, default None] Line numbers to skip (0-indexed) or number of lines to skip (int) at the
start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if the row should be
skipped and False otherwise:
In [7]: pd.read_csv(StringIO(data))
Out[7]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
skipfooter [int, default 0] Number of lines at bottom of file to skip (unsupported with engine=’c’).
nrows [int, default None] Number of rows of file to read. Useful for reading pieces of large files.
low_memory [boolean, default True] Internally process the file in chunks, resulting in lower memory use while
parsing, but possibly mixed type inference. To ensure no mixed types either set False, or specify the type with
the dtype parameter. Note that the entire file is read into a single DataFrame regardless, use the chunksize
or iterator parameter to return the data in chunks. (Only valid with C parser)
memory_map [boolean, default False] If a filepath is provided for filepath_or_buffer, map the file object
directly onto memory and access the data directly from there. Using this option can improve performance
because there is no longer any I/O overhead.
na_values [scalar, str, list-like, or dict, default None] Additional strings to recognize as NA/NaN. If dict passed,
specific per-column NA values. See na values const below for a list of the values interpreted as NaN by default.
keep_default_na [boolean, default True] Whether or not to include the default NaN values when parsing the data.
Depending on whether na_values is passed in, the behavior is as follows:
• If keep_default_na is True, and na_values are specified, na_values is appended to the default
NaN values used for parsing.
• If keep_default_na is True, and na_values are not specified, only the default NaN values are
used for parsing.
• If keep_default_na is False, and na_values are specified, only the NaN values specified
na_values are used for parsing.
• If keep_default_na is False, and na_values are not specified, no strings will be parsed as NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters will
be ignored.
na_filter [boolean, default True] Detect missing value markers (empty strings and the value of na_values). In data
without any NAs, passing na_filter=False can improve the performance of reading a large file.
verbose [boolean, default False] Indicate number of NA values placed in non-numeric columns.
skip_blank_lines [boolean, default True] If True, skip over blank lines rather than interpreting as NaN values.
Datetime handling
parse_dates [boolean or list of ints or names or list of lists or dict, default False.]
• If True -> try parsing the index.
• If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date column.
• If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
• If {'foo': [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’. A fast-path exists for iso8601-
formatted dates.
infer_datetime_format [boolean, default False] If True and parse_dates is enabled for a column, attempt to infer
the datetime format to speed up the processing.
keep_date_col [boolean, default False] If True and parse_dates specifies combining multiple columns then keep
the original columns.
date_parser [function, default None] Function to use for converting a sequence of string columns to an array of
datetime instances. The default uses dateutil.parser.parser to do the conversion. pandas will try to
call date_parser in three different ways, advancing to the next if an exception occurs: 1) Pass one or more arrays
(as defined by parse_dates) as arguments; 2) concatenate (row-wise) the string values from the columns defined
by parse_dates into a single array and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst [boolean, default False] DD/MM format dates, international and European format.
cache_dates [boolean, default True] If True, use a cache of unique, converted dates to apply the datetime conversion.
May produce significant speed-up when parsing duplicate date strings, especially ones with timezone offsets.
New in version 0.25.0.
Iteration
iterator [boolean, default False] Return TextFileReader object for iteration or getting chunks with
get_chunk().
chunksize [int, default None] Return TextFileReader object for iteration. See iterating and chunking below.
compression [{'infer', 'gzip', 'bz2', 'zip', 'xz', None, dict}, default 'infer'] For on-the-fly de-
compression of on-disk data. If ‘infer’, then use gzip, bz2, zip, or xz if filepath_or_buffer is path-like
ending in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’, respectively, and no decompression otherwise. If using ‘zip’, the ZIP
file must contain only one data file to be read in. Set to None for no decompression. Can also be a dict
with key 'method' set to one of {'zip', 'gzip', 'bz2'} and other key-value pairs are forwarded to
zipfile.ZipFile, gzip.GzipFile, or bz2.BZ2File. As an example, the following could be passed
for faster compression and to create a reproducible gzip archive: compression={'method': 'gzip',
'compresslevel': 1, 'mtime': 1}.
Changed in version 0.24.0: ‘infer’ option added and set to default.
Changed in version 1.1.0: dict option extended to support gzip and bz2.
Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to gzip.open.
thousands [str, default None] Thousands separator.
decimal [str, default '.'] Character to recognize as decimal point. E.g. use ',' for European data.
float_precision [string, default None] Specifies which converter the C engine should use for floating-point values.
The options are None for the ordinary converter, high for the high-precision converter, and round_trip for
the round-trip converter.
lineterminator [str (length 1), default None] Character to break file into lines. Only valid with C parser.
quotechar [str (length 1)] The character used to denote the start and end of a quoted item. Quoted items can include
the delimiter and it will be ignored.
quoting [int or csv.QUOTE_* instance, default 0] Control field quoting behavior per csv.QUOTE_* constants.
Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1), QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote [boolean, default True] When quotechar is specified and quoting is not QUOTE_NONE, indi-
cate whether or not to interpret two consecutive quotechar elements inside a field as a single quotechar
element.
escapechar [str (length 1), default None] One-character string used to escape delimiter when quoting is
QUOTE_NONE.
comment [str, default None] Indicates remainder of line should not be parsed. If found at the beginning of a line,
the line will be ignored altogether. This parameter must be a single character. Like empty lines (as long
as skip_blank_lines=True), fully commented lines are ignored by the parameter header but not by
skiprows. For example, if comment='#', parsing ‘#empty\na,b,c\n1,2,3’ with header=0 will result in
‘a,b,c’ being treated as the header.
encoding [str, default None] Encoding to use for UTF when reading/writing (e.g. 'utf-8'). List of Python standard
encodings.
dialect [str or csv.Dialect instance, default None] If provided, this parameter will override values (default or
not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace,
quotechar, and quoting. If it is necessary to override values, a ParserWarning will be issued. See csv.
Dialect documentation for more details.
Error handling
error_bad_lines [boolean, default True] Lines with too many fields (e.g. a csv line with too many commas) will by
default cause an exception to be raised, and no DataFrame will be returned. If False, then these “bad lines”
will dropped from the DataFrame that is returned. See bad lines below.
warn_bad_lines [boolean, default True] If error_bad_lines is False, and warn_bad_lines is True, a warning for
each “bad line” will be output.
You can indicate the data type for the whole DataFrame or individual columns:
In [11]: print(data)
a,b,c,d
1,2,3,4
5,6,7,8
9,10,11
In [13]: df
Out[13]:
a b c d
0 1 2 3 4
1 5 6 7 8
2 9 10 11 NaN
In [14]: df["a"][0]
Out[14]: '1'
In [16]: df.dtypes
Out[16]:
a int64
b object
c float64
d Int64
dtype: object
Fortunately, pandas offers more than one way to ensure that your column(s) contain only one dtype. If you’re
unfamiliar with these concepts, you can see here to learn more about dtypes, and here to learn more about object
conversion in pandas.
For instance, you can use the converters argument of read_csv():
In [17]: data = "col_1\n1\n2\n'A'\n4.22"
In [19]: df
Out[19]:
col_1
0 1
1 2
2 'A'
3 4.22
In [20]: df["col_1"].apply(type).value_counts()
Out[20]:
<class 'str'> 4
Name: col_1, dtype: int64
Or you can use the to_numeric() function to coerce the dtypes after reading in the data,
In [21]: df2 = pd.read_csv(StringIO(data))
In [23]: df2
Out[23]:
col_1
0 1.00
1 2.00
2 NaN
(continues on next page)
In [24]: df2["col_1"].apply(type).value_counts()
Out[24]:
<class 'float'> 4
Name: col_1, dtype: int64
which will convert all valid parsing to floats, leaving the invalid parsing as NaN.
Ultimately, how you deal with reading in columns containing mixed dtypes depends on your specific needs. In the case
above, if you wanted to NaN out the data anomalies, then to_numeric() is probably your best option. However, if
you wanted for all the data to be coerced, no matter the type, then using the converters argument of read_csv()
would certainly be worth trying.
Note: In some cases, reading in abnormal data with columns containing mixed dtypes will result in an inconsistent
dataset. If you rely on pandas to infer the dtypes of your columns, the parsing engine will go and infer the dtypes for
different chunks of the data, rather than the whole dataset at once. Consequently, you can end up with column(s) with
mixed dtypes. For example,
In [27]: df.to_csv("foo.csv")
In [29]: mixed_df["col_1"].apply(type).value_counts()
Out[29]:
<class 'int'> 737858
<class 'str'> 262144
Name: col_1, dtype: int64
In [30]: mixed_df["col_1"].dtype
Out[30]: dtype('O')
will result with mixed_df containing an int dtype for certain chunks of the column, and str for others due to the
mixed dtypes from the data that was read in. It is important to note that the overall column will be marked with a
dtype of object, which is used for columns with mixed dtypes.
In [32]: pd.read_csv(StringIO(data))
Out[32]:
col1 col2 col3
0 a b 1
1 a b 2
2 c d 3
(continues on next page)
In [33]: pd.read_csv(StringIO(data)).dtypes
Out[33]:
col1 object
col2 object
col3 int64
dtype: object
Specifying dtype='category' will result in an unordered Categorical whose categories are the unique
values observed in the data. For more control on the categories and order, create a CategoricalDtype ahead of
time, and pass that for that column’s dtype.
Note: With dtype='category', the resulting categories will always be parsed as strings (object dtype). If the
categories are numeric they can be converted using the to_numeric() function, or as appropriate, another converter
such as to_datetime().
When dtype is a CategoricalDtype with homogeneous categories ( all numeric, all datetimes, etc.), the
conversion is done automatically.
In [42]: df.dtypes
Out[42]:
col1 category
col2 category
col3 category
dtype: object
In [43]: df["col3"]
Out[43]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, object): ['1', '2', '3']
In [45]: df["col3"]
Out[45]:
0 1
1 2
2 3
Name: col3, dtype: category
Categories (3, int64): [1, 2, 3]
A file may or may not have a header row. pandas assumes the first row should be used as the column names:
In [47]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9
In [48]: pd.read_csv(StringIO(data))
Out[48]:
a b c
0 1 2 3
1 4 5 6
2 7 8 9
By specifying the names argument in conjunction with header you can indicate other names to use and whether or
not to throw away the header row (if any):
In [49]: print(data)
a,b,c
1,2,3
4,5,6
7,8,9
If the header is in a row other than the first, pass the row number to header. This will skip the preceding rows:
Note: Default behavior is to infer the column names: if no names are passed the behavior is identical to header=0
and column names are inferred from the first non-blank line of the file, if column names are passed explicitly then the
behavior is identical to header=None.
If the file or header contains duplicate names, pandas will by default distinguish between them so as to prevent
overwriting data:
In [55]: pd.read_csv(StringIO(data))
Out[55]:
a b a.1
0 0 1 2
1 3 4 5
There is no more duplicate data because mangle_dupe_cols=True by default, which modifies a series of dupli-
cate columns ‘X’, . . . , ‘X’ to become ‘X’, ‘X.1’, . . . , ‘X.N’. If mangle_dupe_cols=False, duplicate data can
arise:
To prevent users from encountering this problem with duplicate data, a ValueError exception is raised if
mangle_dupe_cols != True:
The usecols argument allows you to select any subset of the columns in a file, either using the column names,
position numbers or a callable:
In [57]: pd.read_csv(StringIO(data))
Out[57]:
a b c d
0 1 2 3 foo
1 4 5 6 bar
2 7 8 9 baz
The usecols argument can also be used to specify which columns not to use in the final result:
In this case, the callable is specifying that we exclude the “a” and “c” columns from the output.
If the comment parameter is specified, then completely commented lines will be ignored. By default, completely
blank lines will be ignored as well.
In [63]: print(data)
a,b,c
# commented line
1,2,3
4,5,6
Warning: The presence of ignored lines might create ambiguities involving line numbers; the parameter header
uses row numbers (ignoring commented/empty lines), while skiprows uses line numbers (including com-
mented/empty lines):
In [67]: data = "#comment\na,b,c\nA,B,C\n1,2,3"
If both header and skiprows are specified, header will be relative to the end of skiprows. For example:
In [71]: data = (
....: "# empty\n"
....: "# second empty line\n"
....: "# third emptyline\n"
....: "X,Y,Z\n"
....: "1,2,3\n"
....: "A,B,C\n"
....: "1,2.,4.\n"
....: "5.,NaN,10.0\n"
....: )
....:
In [72]: print(data)
# empty
# second empty line
# third emptyline
X,Y,Z
1,2,3
A,B,C
1,2.,4.
5.,NaN,10.0
Comments
In [74]: print(open("tmp.csv").read())
ID,level,category
Patient1,123000,x # really unpleasant
Patient2,23000,y # wouldn't take his medicine
Patient3,1234018,z # awesome
In [75]: df = pd.read_csv("tmp.csv")
In [76]: df
Out[76]:
(continues on next page)
In [78]: df
Out[78]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z
The encoding argument should be used for encoded unicode data, which will result in byte strings being decoded
to unicode in the result:
In [83]: df
Out[83]:
word length
0 Träumen 7
1 Grüße 5
In [84]: df["word"][1]
Out[84]: 'Grüße'
Some formats which encode all characters as multiple bytes, like UTF-16, won’t parse correctly at all without speci-
fying the encoding. Full list of Python standard encodings.
If a file has one more column of data than the number of column names, the first column will be used as the
DataFrame’s row names:
In [86]: pd.read_csv(StringIO(data))
Out[86]:
a b c
4 apple bat 5.7
8 orange cow 10.0
Ordinarily, you can achieve this behavior using the index_col option.
There are some exception cases when a file has been prepared with delimiters at the end of each data line, confusing
the parser. To explicitly disable the index column inference and discard the last column, pass index_col=False:
In [90]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,
In [91]: pd.read_csv(StringIO(data))
Out[91]:
a b c
4 apple bat NaN
8 orange cow NaN
If a subset of data is being parsed using the usecols option, the index_col specification is based on that subset,
not the original data.
In [94]: print(data)
a,b,c
4,apple,bat,
8,orange,cow,
Date Handling
To better facilitate working with datetime data, read_csv() uses the keyword arguments parse_dates and
date_parser to allow users to specify a variety of columns and date/time formats to turn the input text data into
datetime objects.
The simplest case is to just pass in parse_dates=True:
In [98]: df
Out[98]:
A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5
It is often the case that we may want to store date and time data separately, or store various date fields separately. the
parse_dates keyword can be used to specify a combination of columns to parse the dates and/or times from.
You can specify a list of column lists to parse_dates, the resulting date columns will be prepended to the output
(so as to not affect the existing column order) and the new column names will be the concatenation of the component
column names:
In [100]: print(open("tmp.csv").read())
KORD,19990127, 19:00:00, 18:56:00, 0.8100
KORD,19990127, 20:00:00, 19:56:00, 0.0100
KORD,19990127, 21:00:00, 20:56:00, -0.5900
KORD,19990127, 21:00:00, 21:18:00, -0.9900
KORD,19990127, 22:00:00, 21:56:00, -0.5900
KORD,19990127, 23:00:00, 22:56:00, -0.5900
In [102]: df
Out[102]:
1_2 1_3 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
By default the parser removes the component date columns, but you can choose to retain them via the
keep_date_col keyword:
In [103]: df = pd.read_csv(
.....: "tmp.csv", header=None, parse_dates=[[1, 2], [1, 3]], keep_date_col=True
.....: )
.....:
In [104]: df
Out[104]:
1_2 1_3 0 1 2 3 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 19990127 19:00:00 18:56:00 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 19990127 20:00:00 19:56:00 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD 19990127 21:00:00 20:56:00 -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD 19990127 21:00:00 21:18:00 -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD 19990127 22:00:00 21:56:00 -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD 19990127 23:00:00 22:56:00 -0.59
Note that if you wish to combine multiple columns into a single date column, a nested list must be used. In other
words, parse_dates=[1, 2] indicates that the second and third columns should each be parsed as separate date
columns while parse_dates=[[1, 2]] means the two columns should be parsed into a single column.
You can also use a dict to specify custom name columns:
In [107]: df
Out[107]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
It is important to remember that if multiple text columns are to be parsed into a single date column, then a new column
is prepended to the data. The index_col specification is based off of this new set of columns rather than the original
data columns:
In [109]: df = pd.read_csv(
.....: "tmp.csv", header=None, parse_dates=date_spec, index_col=0
.....: ) # index is the nominal column
.....:
In [110]: df
Out[110]:
actual 0 4
nominal
1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
Note: If a column or index contains an unparsable date, the entire column or index will be returned unaltered as an
object data type. For non-standard datetime parsing, use to_datetime() after pd.read_csv.
Note: read_csv has a fast_path for parsing datetime strings in iso8601 format, e.g “2000-01-01T00:01:02+00:00” and
similar variations. If you can arrange for your data to store datetimes in this format, load times will be significantly
faster, ~20x has been observed.
Finally, the parser allows you to specify a custom date_parser function to take full advantage of the flexibility of
the date parsing API:
In [111]: df = pd.read_csv(
.....: "tmp.csv", header=None, parse_dates=date_spec, date_parser=pd.to_
˓→datetime
.....: )
.....:
In [112]: df
Out[112]:
nominal actual 0 4
0 1999-01-27 19:00:00 1999-01-27 18:56:00 KORD 0.81
1 1999-01-27 20:00:00 1999-01-27 19:56:00 KORD 0.01
2 1999-01-27 21:00:00 1999-01-27 20:56:00 KORD -0.59
3 1999-01-27 21:00:00 1999-01-27 21:18:00 KORD -0.99
4 1999-01-27 22:00:00 1999-01-27 21:56:00 KORD -0.59
5 1999-01-27 23:00:00 1999-01-27 22:56:00 KORD -0.59
pandas will try to call the date_parser function in three different ways. If an exception is raised, the next one is
tried:
1. date_parser is first called with one or more arrays as arguments, as defined using parse_dates (e.g.,
date_parser(['2013', '2013'], ['1', '2'])).
2. If #1 fails, date_parser is called with all the columns concatenated row-wise into a single array (e.g.,
date_parser(['2013 1', '2013 2'])).
Note that performance-wise, you should try these methods of parsing dates in order:
1. Try to infer the format using infer_datetime_format=True (see section below).
2. If you know the format, use pd.to_datetime(): date_parser=lambda x: pd.
to_datetime(x, format=...).
3. If you have a really non-standard format, use a custom date_parser function. For optimal performance, this
should be vectorized, i.e., it should accept arrays as arguments.
pandas cannot natively represent a column or index with mixed timezones. If your CSV file contains columns with a
mixture of timezones, the default result will be an object-dtype column with strings, even with parse_dates.
In [115]: df["a"]
Out[115]:
0 2000-01-01 00:00:00+05:00
1 2000-01-01 00:00:00+06:00
Name: a, dtype: object
To parse the mixed-timezone values as a datetime column, pass a partially-applied to_datetime() with
utc=True as the date_parser.
In [116]: df = pd.read_csv(
.....: StringIO(content),
.....: parse_dates=["a"],
.....: date_parser=lambda col: pd.to_datetime(col, utc=True),
.....: )
.....:
In [117]: df["a"]
Out[117]:
0 1999-12-31 19:00:00+00:00
1 1999-12-31 18:00:00+00:00
Name: a, dtype: datetime64[ns, UTC]
If you have parse_dates enabled for some or all of your columns, and your datetime strings are all formatted the
same way, you may get a large speed up by setting infer_datetime_format=True. If set, pandas will attempt
to guess the format of your datetime strings, and then use a faster means of parsing the strings. 5-10x parsing speeds
have been observed. pandas will fallback to the usual parsing if either the format cannot be guessed or the format that
was guessed cannot properly parse the entire column of strings. So in general, infer_datetime_format should
not have any negative consequences if enabled.
Here are some examples of datetime strings that can be guessed (All representing December 30th, 2011 at 00:00:00):
• “20111230”
• “2011/12/30”
• “20111230 00:00:00”
• “12/30/2011 00:00:00”
• “30/Dec/2011 00:00:00”
• “30/December/2011 00:00:00”
In [119]: df
Out[119]:
A B C
date
2009-01-01 a 1 2
2009-01-02 b 3 4
2009-01-03 c 4 5
While US date formats tend to be MM/DD/YYYY, many international formats use DD/MM/YYYY instead. For
convenience, a dayfirst keyword is provided:
In [120]: print(open("tmp.csv").read())
date,value,cat
1/6/2000,5,a
2/6/2000,10,b
3/6/2000,15,c
In [123]: import io
The parameter float_precision can be specified in order to use a specific floating-point converter during parsing
with the C engine. The options are the ordinary converter, the high-precision converter, and the round-trip converter
(which is guaranteed to round-trip values after writing to a file). For example:
In [129]: abs(
.....: pd.read_csv(
.....: StringIO(data),
.....: engine="c",
.....: float_precision=None,
.....: )["c"][0] - float(val)
.....: )
.....:
Out[129]: 5.551115123125783e-17
In [130]: abs(
.....: pd.read_csv(
.....: StringIO(data),
.....: engine="c",
.....: float_precision="high",
.....: )["c"][0] - float(val)
.....: )
.....:
Out[130]: 5.551115123125783e-17
In [131]: abs(
.....: pd.read_csv(StringIO(data), engine="c", float_precision="round_trip")["c
˓→"][0]
.....: - float(val)
.....: )
.....:
Out[131]: 0.0
Thousand separators
For large numbers that have been written with a thousands separator, you can set the thousands keyword to a string
of length 1 so that integers will be parsed correctly:
By default, numbers with a thousands separator will be parsed as strings:
In [132]: print(open("tmp.csv").read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z
In [134]: df
Out[134]:
ID level category
0 Patient1 123,000 x
1 Patient2 23,000 y
2 Patient3 1,234,018 z
In [135]: df.level.dtype
Out[135]: dtype('O')
In [136]: print(open("tmp.csv").read())
ID|level|category
Patient1|123,000|x
Patient2|23,000|y
Patient3|1,234,018|z
In [138]: df
Out[138]:
ID level category
0 Patient1 123000 x
1 Patient2 23000 y
2 Patient3 1234018 z
In [139]: df.level.dtype
Out[139]: dtype('int64')
NA values
To control which values are parsed as missing values (which are signified by NaN), specify a string in na_values.
If you specify a list of strings, then all values in it are considered to be missing values. If you specify a number (a
float, like 5.0 or an integer like 5), the corresponding equivalent values will also imply a missing value (in this
case effectively [5.0, 5] are recognized as NaN).
To completely override the default values that are recognized as missing, specify keep_default_na=False.
The default NaN recognized values are ['-1.#IND', '1.#QNAN', '1.#IND', '-1.#QNAN', '#N/
A N/A', '#N/A', 'N/A', 'n/a', 'NA', '<NA>', '#NA', 'NULL', 'null', 'NaN',
'-NaN', 'nan', '-nan', ''].
pd.read_csv("path_to_file.csv", na_values=[5])
In the example above 5 and 5.0 will be recognized as NaN, in addition to the defaults. A string will first be interpreted
as a numerical 5, then as a NaN.
pd.read_csv("path_to_file.csv", na_values=["Nope"])
The default values, in addition to the string "Nope" are recognized as NaN.
Infinity
inf like values will be parsed as np.inf (positive infinity), and -inf as -np.inf (negative infinity). These will
ignore the case of the value, meaning Inf, will also be parsed as np.inf.
Returning Series
Using the squeeze keyword, the parser will return output with a single column as a Series:
In [140]: print(open("tmp.csv").read())
level
Patient1,123000
Patient2,23000
Patient3,1234018
In [142]: output
Out[142]:
Patient1 123000
Patient2 23000
Patient3 1234018
Name: level, dtype: int64
In [143]: type(output)
Out[143]: pandas.core.series.Series
Boolean values
The common values True, False, TRUE, and FALSE are all recognized as boolean. Occasionally you might want to
recognize other values as being boolean. To do this, use the true_values and false_values options as follows:
In [145]: print(data)
a,b,c
1,Yes,2
3,No,4
In [146]: pd.read_csv(StringIO(data))
Out[146]:
a b c
0 1 Yes 2
1 3 No 4
Some files may have malformed lines with too few fields or too many. Lines with too few fields will have NA values
filled in the trailing fields. Lines with too many fields will raise an error by default:
In [149]: pd.read_csv(StringIO(data))
---------------------------------------------------------------------------
ParserError Traceback (most recent call last)
<ipython-input-149-6388c394e6b8> in <module>
----> 1 pd.read_csv(StringIO(data))
˓→options)
603 kwds.update(kwds_defaults)
604
--> 605 return _read(filepath_or_buffer, kwds)
606
607
/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.read()
/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_low_memory()
/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._read_rows()
/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._tokenize_rows()
/pandas/pandas/_libs/parsers.pyx in pandas._libs.parsers.raise_parser_error()
Out[29]:
a b c
0 1 2 3
1 8 9 10
You can also use the usecols parameter to eliminate extraneous column data that appear in some lines but not others:
Out[30]:
a b c
0 1 2 3
1 4 5 6
2 8 9 10
Dialect
The dialect keyword gives greater flexibility in specifying the file format. By default it uses the Excel dialect but
you can specify either the dialect name or a csv.Dialect instance.
Suppose you had data with unenclosed quotes:
In [150]: print(data)
label1,label2,label3
index1,"a,c,e
index2,b,d,f
By default, read_csv uses the Excel dialect and treats the double quote as the quote character, which causes it to
fail when it finds a newline before it finds the closing double quote.
We can get around this using dialect:
Another common dialect option is skipinitialspace, to skip any whitespace after a delimiter:
In [158]: print(data)
a, b, c
1, 2, 3
4, 5, 6
The parsers make every attempt to “do the right thing” and not be fragile. Type inference is a pretty big deal. If a
column can be coerced to integer dtype without altering the contents, the parser will do so. Any non-numeric columns
will come through as object dtype as with the rest of pandas objects.
Quotes (and other escape characters) in embedded fields can be handled in any number of ways. One way is to use
backslashes; to properly parse this data, you should pass the escapechar option:
In [161]: print(data)
a,b
"hello, \"Bob\", nice to see you",5
While read_csv() reads delimited data, the read_fwf() function works with data files that have known and
fixed column widths. The function parameters to read_fwf are largely the same as read_csv with two extra
parameters, and a different usage of the delimiter parameter:
• colspecs: A list of pairs (tuples) giving the extents of the fixed-width fields of each line as half-open intervals
(i.e., [from, to[ ). String value ‘infer’ can be used to instruct the parser to try detecting the column specifications
from the first 100 rows of the data. Default behavior, if not specified, is to infer.
• widths: A list of field widths which can be used instead of ‘colspecs’ if the intervals are contiguous.
• delimiter: Characters to consider as filler characters in the fixed-width file. Can be used to specify the filler
character of the fields if it is not spaces (e.g., ‘~’).
Consider a typical fixed-width data file:
In [163]: print(open("bar.csv").read())
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3
In order to parse this file into a DataFrame, we simply need to supply the column specifications to the read_fwf
function along with the file name:
In [166]: df
Out[166]:
1 2 3
0
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3
Note how the parser automatically picks column names X.<column number> when header=None argument is spec-
ified. Alternatively, you can supply just the column widths for contiguous columns:
In [169]: df
Out[169]:
0 1 2 3
0 id8141 360.242940 149.910199 11950.7
1 id1594 444.953632 166.985655 11788.4
2 id1849 364.136849 183.628767 11806.2
3 id1230 413.836124 184.375703 11916.8
4 id1948 502.953953 173.237159 12468.3
The parser will take care of extra white spaces around the columns so it’s ok to have extra separation between the
columns in the file.
By default, read_fwf will try to infer the file’s colspecs by using the first 100 rows of the file. It can do it
only in cases when the columns are aligned and correctly separated by the provided delimiter (default delimiter is
whitespace).
In [171]: df
Out[171]:
1 2 3
0
id8141 360.242940 149.910199 11950.7
id1594 444.953632 166.985655 11788.4
id1849 364.136849 183.628767 11806.2
id1230 413.836124 184.375703 11916.8
id1948 502.953953 173.237159 12468.3
read_fwf supports the dtype parameter for specifying the types of parsed columns to be different from the inferred
type.
Indexes
Consider a file with one less entry in the header than the number of data column:
In [174]: print(open("foo.csv").read())
A,B,C
20090101,a,1,2
20090102,b,3,4
20090103,c,4,5
In this special case, read_csv assumes that the first column is to be used as the index of the DataFrame:
In [175]: pd.read_csv("foo.csv")
Out[175]:
A B C
20090101 a 1 2
20090102 b 3 4
20090103 c 4 5
Note that the dates weren’t automatically parsed. In that case you would need to do as before:
In [177]: df.index
Out[177]: DatetimeIndex(['2009-01-01', '2009-01-02', '2009-01-03'], dtype=
˓→'datetime64[ns]', freq=None)
In [178]: print(open("data/mindex_ex.csv").read())
year,indiv,zit,xit
1977,"A",1.2,.6
1977,"B",1.5,.5
1977,"C",1.7,.8
1978,"A",.2,.06
1978,"B",.7,.2
1978,"C",.8,.3
1978,"D",.9,.5
1978,"E",1.4,.9
1979,"C",.2,.15
1979,"D",.14,.05
1979,"E",.5,.15
1979,"F",1.2,.5
1979,"G",3.4,1.9
1979,"H",5.4,2.7
1979,"I",6.4,1.2
The index_col argument to read_csv can take a list of column numbers to turn multiple columns into a
MultiIndex for the index of the returned object:
In [180]: df
Out[180]:
zit xit
year indiv
1977 A 1.20 0.60
B 1.50 0.50
C 1.70 0.80
1978 A 0.20 0.06
B 0.70 0.20
C 0.80 0.30
D 0.90 0.50
E 1.40 0.90
1979 C 0.20 0.15
D 0.14 0.05
E 0.50 0.15
F 1.20 0.50
G 3.40 1.90
H 5.40 2.70
I 6.40 1.20
In [181]: df.loc[1978]
Out[181]:
zit xit
indiv
A 0.2 0.06
B 0.7 0.20
C 0.8 0.30
D 0.9 0.50
E 1.4 0.90
By specifying list of row locations for the header argument, you can read in a MultiIndex for the columns.
Specifying non-consecutive rows will skip the intervening rows.
In [184]: df.to_csv("mi.csv")
In [185]: print(open("mi.csv").read())
C0,,C_l0_g0,C_l0_g1,C_l0_g2
C1,,C_l1_g0,C_l1_g1,C_l1_g2
C2,,C_l2_g0,C_l2_g1,C_l2_g2
C3,,C_l3_g0,C_l3_g1,C_l3_g2
R0,R1,,,
R_l0_g0,R_l1_g0,R0C0,R0C1,R0C2
R_l0_g1,R_l1_g1,R1C0,R1C1,R1C2
R_l0_g2,R_l1_g2,R2C0,R2C1,R2C2
R_l0_g3,R_l1_g3,R3C0,R3C1,R3C2
R_l0_g4,R_l1_g4,R4C0,R4C1,R4C2
In [187]: print(open("mi2.csv").read())
,a,a,a,b,c,c
,q,r,s,t,u,v
one,1,2,3,4,5,6
two,7,8,9,10,11,12
Note: If an index_col is not specified (e.g. you don’t have an index, or wrote it with df.to_csv(...,
index=False), then any names on the columns index will be lost.
read_csv is capable of inferring delimited (not necessarily comma-separated) files, as pandas uses the csv.
Sniffer class of the csv module. For this, you have to specify sep=None.
In [189]: print(open("tmp2.sv").read())
:0:1:2:3
0:0.4691122999071863:-0.2828633443286633:-1.5090585031735124:-1.1356323710171934
1:1.2121120250208506:-0.17321464905330858:0.11920871129693428:-1.0442359662799567
2:-0.8618489633477999:-2.1045692188948086:-0.4949292740687813:1.071803807037338
3:0.7215551622443669:-0.7067711336300845:-1.0395749851146963:0.27185988554282986
4:-0.42497232978883753:0.567020349793672:0.27623201927771873:-1.0874006912859915
5:-0.6736897080883706:0.1136484096888855:-1.4784265524372235:0.5249876671147047
6:0.4047052186802365:0.5770459859204836:-1.7150020161146375:-1.0392684835147725
7:-0.3706468582364464:-1.1578922506419993:-1.344311812731667:0.8448851414248841
8:1.0757697837155533:-0.10904997528022223:1.6435630703622064:-1.4693879595399115
9:0.35702056413309086:-0.6746001037299882:-1.776903716971867:-0.9689138124473498
It’s best to use concat() to combine multiple files. See the cookbook for an example.
Suppose you wish to iterate through a (potentially very large) file lazily rather than reading the entire file into memory,
such as the following:
In [191]: print(open("tmp.sv").read())
|0|1|2|3
0|0.4691122999071863|-0.2828633443286633|-1.5090585031735124|-1.1356323710171934
1|1.2121120250208506|-0.17321464905330858|0.11920871129693428|-1.0442359662799567
2|-0.8618489633477999|-2.1045692188948086|-0.4949292740687813|1.071803807037338
3|0.7215551622443669|-0.7067711336300845|-1.0395749851146963|0.27185988554282986
4|-0.42497232978883753|0.567020349793672|0.27623201927771873|-1.0874006912859915
5|-0.6736897080883706|0.1136484096888855|-1.4784265524372235|0.5249876671147047
6|0.4047052186802365|0.5770459859204836|-1.7150020161146375|-1.0392684835147725
7|-0.3706468582364464|-1.1578922506419993|-1.344311812731667|0.8448851414248841
8|1.0757697837155533|-0.10904997528022223|1.6435630703622064|-1.4693879595399115
9|0.35702056413309086|-0.6746001037299882|-1.776903716971867|-0.9689138124473498
In [193]: table
Out[193]:
Unnamed: 0 0 1 2 3
0 0 0.469112 -0.282863 -1.509059 -1.135632
1 1 1.212112 -0.173215 0.119209 -1.044236
2 2 -0.861849 -2.104569 -0.494929 1.071804
3 3 0.721555 -0.706771 -1.039575 0.271860
4 4 -0.424972 0.567020 0.276232 -1.087401
5 5 -0.673690 0.113648 -1.478427 0.524988
6 6 0.404705 0.577046 -1.715002 -1.039268
7 7 -0.370647 -1.157892 -1.344312 0.844885
8 8 1.075770 -0.109050 1.643563 -1.469388
9 9 0.357021 -0.674600 -1.776904 -0.968914
By specifying a chunksize to read_csv, the return value will be an iterable object of type TextFileReader:
In [194]: with pd.read_csv("tmp.sv", sep="|", chunksize=4) as reader:
.....: reader
.....: for chunk in reader:
(continues on next page)
Changed in version 1.2: read_csv/json/sas return a context-manager when iterating through a file.
Specifying iterator=True will also return the TextFileReader object:
Under the hood pandas uses a fast and efficient parser implemented in C as well as a Python implementation which is
currently more feature-complete. Where possible pandas uses the C parser (specified as engine='c'), but may fall
back to Python if C-unsupported options are specified. Currently, C-unsupported options include:
• sep other than a single character (e.g. regex separators)
• skipfooter
• sep=None with delim_whitespace=False
Specifying any of the above options will produce a ParserWarning unless the python engine is selected explicitly
using engine='python'.
You can pass in a URL to read or write remote files to many of pandas’ IO functions - the following example shows
reading a CSV file:
df = pd.read_csv("https://download.bls.gov/pub/time.series/cu/cu.item", sep="\t")
All URLs which are not local files or HTTP(s) are handled by fsspec, if installed, and its various filesystem implemen-
tations (including Amazon S3, Google Cloud, SSH, FTP, webHDFS. . . ). Some of these implementations will require
additional packages to be installed, for example S3 URLs require the s3fs library:
df = pd.read_json("s3://pandas-test/adatafile.json")
When dealing with remote storage systems, you might need extra configuration with environment variables or config
files in special locations. For example, to access data in your S3 bucket, you will need to define credentials in one
of the several ways listed in the S3Fs documentation. The same is true for several of the storage backends, and you
should follow the links at fsimpl1 for implementations built into fsspec and fsimpl2 for those not included in the
main fsspec distribution.
You can also pass parameters directly to the backend driver. For example, if you do not have S3 credentials, you can
still access public data by specifying an anonymous connection, such as
New in version 1.2.0.
pd.read_csv(
"s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/SaKe2013"
"-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
storage_options={"anon": True},
)
fsspec also allows complex URLs, for accessing data in compressed archives, local caching of files, and more. To
locally cache the above example, you would modify the call to
pd.read_csv(
"simplecache::s3://ncei-wcsd-archive/data/processed/SH1305/18kHz/"
"SaKe2013-D20130523-T080854_to_SaKe2013-D20130523-T085643.csv",
storage_options={"s3": {"anon": True}},
)
where we specify that the “anon” parameter is meant for the “s3” part of the implementation, not to the caching
implementation. Note that this caches to a temporary directory for the duration of the session only, but you can also
specify a permanent store.
The Series and DataFrame objects have an instance method to_csv which allows storing the contents of the
object as a comma-separated-values file. The function takes a number of arguments. Only the first is required.
• path_or_buf: A string path to the file to write or a file object. If a file object it must be opened with
newline=''
• sep : Field delimiter for the output file (default “,”)
• na_rep: A string representation of a missing value (default ‘’)
• float_format: Format string for floating point numbers
• columns: Columns to write (default None)
• header: Whether to write out the column names (default True)
• index: whether to write row (index) names (default True)
• index_label: Column label(s) for index column(s) if desired. If None (default), and header and index
are True, then the index names are used. (A sequence should be given if the DataFrame uses MultiIndex).
• mode : Python write mode, default ‘w’
• encoding: a string representing the encoding to use if the contents are non-ASCII, for Python versions prior
to 3
• line_terminator: Character sequence denoting line end (default os.linesep)
• quoting: Set quoting rules as in csv module (default csv.QUOTE_MINIMAL). Note that if you have set
a float_format then floats are converted to strings and csv.QUOTE_NONNUMERIC will treat them as
non-numeric
• quotechar: Character used to quote fields (default ‘”’)
• doublequote: Control quoting of quotechar in fields (default True)
• escapechar: Character used to escape sep and quotechar when appropriate (default None)
• chunksize: Number of rows to write at a time
• date_format: Format string for datetime objects
The DataFrame object has an instance method to_string which allows control over the string representation of
the object. All arguments are optional:
• buf default None, for example a StringIO object
• columns default None, which columns to write
• col_space default None, minimum width of each column.
• na_rep default NaN, representation of NA value
• formatters default None, a dictionary (by column) of functions each of which takes a single argument and
returns a formatted string
• float_format default None, a function which takes a single (float) argument and returns a formatted string;
to be applied to floats in the DataFrame.
• sparsify default True, set to False for a DataFrame with a hierarchical index to print every MultiIndex key
at each row.
• index_names default True, will print the names of the indices
• index default True, will print the index (ie, row labels)
• header default True, will print the column labels
• justify default left, will print column headers left- or right-justified
The Series object also has a to_string method, but with only the buf, na_rep, float_format arguments.
There is also a length argument which, if set to True, will additionally output the length of the Series.
2.4.2 JSON
Writing JSON
A Series or DataFrame can be converted to a valid JSON string. Use to_json with optional parameters:
• path_or_buf : the pathname or buffer to write the output This can be None in which case a JSON string is
returned
• orient :
Series:
– default is index
– allowed values are {split, records, index}
DataFrame:
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string
split dict like {index -> [index], columns -> [columns], data -> [values]}
records list like [{column -> value}, . . . , {column -> value}]
index dict like {index -> {column -> value}}
columns dict like {column -> {index -> value}}
values just the values array
• date_format : string, type of date conversion, ‘epoch’ for timestamp, ‘iso’ for ISO8601.
• double_precision : The number of decimal places to use when encoding floating point values, default 10.
• force_ascii : force encoded string to be ASCII, default True.
• date_unit : The time unit to encode to, governs timestamp and ISO8601 precision. One of ‘s’, ‘ms’, ‘us’ or
‘ns’ for seconds, milliseconds, microseconds and nanoseconds respectively. Default ‘ms’.
• default_handler : The handler to call if an object cannot otherwise be converted to a suitable format for
JSON. Takes a single argument, which is the object to convert, and returns a serializable object.
• lines : If records orient, then will write each record per line as json.
Note NaN’s, NaT’s and None will be converted to null and datetime objects will be converted based on the
date_format and date_unit parameters.
In [198]: json
Out[198]: '{"A":{"0":-1.2945235903,"1":0.2766617129,"2":-0.0139597524,"3":-0.
˓→0061535699,"4":0.8957173022},"B":{"0":0.4137381054,"1":-0.472034511,"2":-0.
˓→3625429925,"3":-0.923060654,"4":0.8052440254}}'
Orient options
There are a number of different options for the format of the resulting JSON file / string. Consider the following
DataFrame and Series:
In [200]: dfjo
Out[200]:
A B C
x 1 4 7
y 2 5 8
z 3 6 9
In [202]: sjo
Out[202]:
x 15
y 16
z 17
Name: D, dtype: int64
Column oriented (the default for DataFrame) serializes the data as nested JSON objects with column labels acting
as the primary index:
In [203]: dfjo.to_json(orient="columns")
Out[203]: '{"A":{"x":1,"y":2,"z":3},"B":{"x":4,"y":5,"z":6},"C":{"x":7,"y":8,"z":9}}'
Index oriented (the default for Series) similar to column oriented but the index labels are now primary:
In [204]: dfjo.to_json(orient="index")
Out[204]: '{"x":{"A":1,"B":4,"C":7},"y":{"A":2,"B":5,"C":8},"z":{"A":3,"B":6,"C":9}}'
In [205]: sjo.to_json(orient="index")
Out[205]: '{"x":15,"y":16,"z":17}'
Record oriented serializes the data to a JSON array of column -> value records, index labels are not included. This is
useful for passing DataFrame data to plotting libraries, for example the JavaScript library d3.js:
In [206]: dfjo.to_json(orient="records")
Out[206]: '[{"A":1,"B":4,"C":7},{"A":2,"B":5,"C":8},{"A":3,"B":6,"C":9}]'
In [207]: sjo.to_json(orient="records")
Out[207]: '[15,16,17]'
Value oriented is a bare-bones option which serializes to nested JSON arrays of values only, column and index labels
are not included:
In [208]: dfjo.to_json(orient="values")
Out[208]: '[[1,4,7],[2,5,8],[3,6,9]]'
Split oriented serializes to a JSON object containing separate entries for values, index and columns. Name is also
included for Series:
In [209]: dfjo.to_json(orient="split")
Out[209]: '{"columns":["A","B","C"],"index":["x","y","z"],"data":[[1,4,7],[2,5,8],[3,
˓→6,9]]}'
In [210]: sjo.to_json(orient="split")
Out[210]: '{"name":"D","index":["x","y","z"],"data":[15,16,17]}'
Table oriented serializes to the JSON Table Schema, allowing for the preservation of metadata including but not
limited to dtypes and index names.
Note: Any orient option that encodes to a JSON object will not preserve the ordering of index and column la-
bels during round-trip serialization. If you wish to preserve label ordering use the split option as it uses ordered
containers.
Date handling
In [215]: json
Out[215]: '{"date":{"0":"2013-01-01T00:00:00.000Z","1":"2013-01-01T00:00:00.000Z","2":
˓→"2013-01-01T00:00:00.000Z","3":"2013-01-01T00:00:00.000Z","4":"2013-01-01T00:00:00.
˓→000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.8138502857,"4
˓→":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.1702987971,"3":0.
˓→4108345112,"4":0.1320031703}}'
In [217]: json
Out[217]: '{"date":{"0":"2013-01-01T00:00:00.000000Z","1":"2013-01-01T00:00:00.000000Z
˓→","2":"2013-01-01T00:00:00.000000Z","3":"2013-01-01T00:00:00.000000Z","4":"2013-01-
˓→01T00:00:00.000000Z"},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":
˓→0.8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.
˓→1702987971,"3":0.4108345112,"4":0.1320031703}}'
In [219]: json
Out[219]: '{"date":{"0":1356998400,"1":1356998400,"2":1356998400,"3":1356998400,"4":
˓→1356998400},"B":{"0":2.5656459463,"1":1.3403088498,"2":-0.2261692849,"3":0.
˓→8138502857,"4":-0.8273169356},"A":{"0":-1.2064117817,"1":1.4312559863,"2":-1.
˓→1702987971,"3":0.4108345112,"4":0.1320031703}}'
In [225]: dfj2.to_json("test.json")
˓→"1356998400000":0.4137381054,"1357084800000":-0.472034511,"1357171200000":-0.
˓→3625429925,"1357257600000":-0.923060654,"1357344000000":0.8052440254},"date":{
˓→"1356998400000":1356998400000,"1357084800000":1356998400000,"1357171200000":
˓→1356998400000,"1357257600000":1356998400000,"1357344000000":1356998400000},"ints":{
˓→"1356998400000":0,"1357084800000":1,"1357171200000":2,"1357257600000":3,
˓→"1357344000000":4},"bools":{"1356998400000":true,"1357084800000":true,"1357171200000
˓→":true,"1357257600000":true,"1357344000000":true}}
Fallback behavior
If the JSON serializer cannot handle the container contents directly it will fall back in the following manner:
• if the dtype is unsupported (e.g. np.complex_) then the default_handler, if provided, will be called for
each value, otherwise an exception is raised.
• if an object is unsupported it will attempt the following:
– check if the object has defined a toDict method and call it. A toDict method should return a dict
which will then be JSON serialized.
– invoke the default_handler if one was provided.
– convert the object to a dict by traversing its contents. However this will often fail with an
OverflowError or give unexpected results.
In general the best approach for unsupported objects or dtypes is to provide a default_handler. For example:
Reading JSON
Reading a JSON string to pandas object can take a number of parameters. The parser will try to parse a DataFrame
if typ is not supplied or is None. To explicitly force Series parsing, pass typ=series
• filepath_or_buffer : a VALID JSON string or file handle / StringIO. The string could be a URL. Valid
URL schemes include http, ftp, S3, and file. For file URLs, a host is expected. For instance, a local file could be
file ://localhost/path/to/table.json
• typ : type of object to recover (series or frame), default ‘frame’
• orient :
Series :
– default is index
– allowed values are {split, records, index}
DataFrame
– default is columns
– allowed values are {split, records, index, columns, values, table}
The format of the JSON string
split dict like {index -> [index], columns -> [columns], data -> [values]}
records list like [{column -> value}, . . . , {column -> value}]
index dict like {index -> {column -> value}}
columns dict like {column -> {index -> value}}
values just the values array
table adhering to the JSON Table Schema
• dtype : if True, infer dtypes, if a dict of column to dtype, then use those, if False, then don’t infer dtypes at
all, default is True, apply only to the data.
• convert_axes : boolean, try to convert the axes to the proper dtypes, default is True
• convert_dates : a list of columns to parse for dates; If True, then try to parse date-like columns, default
is True.
• keep_default_dates : boolean, default True. If parsing dates, then parse the default date-like columns.
• numpy : direct decoding to NumPy arrays. default is False; Supports numeric data only, although labels may
be non-numeric. Also note that the JSON ordering MUST be the same for each term if numpy=True.
• precise_float : boolean, default False. Set to enable usage of higher precision (strtod) function when
decoding string to double values. Default (False) is to use fast but less precise builtin functionality.
• date_unit : string, the timestamp unit to detect if converting dates. Default None. By default the timestamp
precision will be detected, if this is not desired then pass one of ‘s’, ‘ms’, ‘us’ or ‘ns’ to force timestamp
precision to seconds, milliseconds, microseconds or nanoseconds respectively.
• lines : reads file as one json object per line.
• encoding : The encoding to use to decode py3 bytes.
• chunksize : when used in combination with lines=True, return a JsonReader which reads in chunksize
lines per iteration.
The parser will raise one of ValueError/TypeError/AssertionError if the JSON is not parseable.
If a non-default orient was used when encoding to JSON be sure to pass the same option here so that decoding
produces sensible results, see Orient Options for an overview.
Data conversion
The default of convert_axes=True, dtype=True, and convert_dates=True will try to parse the axes, and
all of the data into appropriate types, including dates. If you need to override specific dtypes, pass a dict to dtype.
convert_axes should only be set to False if you need to preserve string-like numbers (e.g. ‘1’, ‘2’) in an axes.
Note: Large integer values may be converted to dates if convert_dates=True and the data and / or column labels
appear ‘date-like’. The exact threshold depends on the date_unit specified. ‘date-like’ means that the column label
meets one of the following criteria:
• it ends with '_at'
• it ends with '_time'
• it begins with 'timestamp'
• it is 'modified'
• it is 'date'
Warning: When reading JSON data, automatic coercing into dtypes has some quirks:
• an index can be reconstructed in a different order from serialization, that is, the returned order is not guaran-
teed to be the same as before serialization
• a column that was float data will be converted to integer if it can be done safely, e.g. a column of 1.
• bool columns will be converted to integer on reconstruction
Thus there are times where you may want to specify specific dtypes via the dtype keyword argument.
In [228]: pd.read_json(json)
Out[228]:
date B A
0 2013-01-01 2.565646 -1.206412
1 2013-01-01 1.340309 1.431256
2 2013-01-01 -0.226169 -1.170299
3 2013-01-01 0.813850 0.410835
4 2013-01-01 -0.827317 0.132003
In [229]: pd.read_json("test.json")
Out[229]:
A B date ints bools
2013-01-01 -1.294524 0.413738 2013-01-01 0 True
(continues on next page)
Don’t convert any data (but still convert axes and dates):
In [230]: pd.read_json("test.json", dtype=object).dtypes
Out[230]:
A object
B object
date object
ints object
bools object
dtype: object
.....: )
.....:
In [233]: si
Out[233]:
0 1 2 3
0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0
In [234]: si.index
Out[234]: Index(['0', '1', '2', '3'], dtype='object')
In [235]: si.columns
Out[235]: Int64Index([0, 1, 2, 3], dtype='int64')
In [238]: sij
Out[238]:
0 1 2 3
0 0 0 0 0
(continues on next page)
In [239]: sij.index
Out[239]: Index(['0', '1', '2', '3'], dtype='object')
In [240]: sij.columns
Out[240]: Index(['0', '1', '2', '3'], dtype='object')
In [243]: dfju
Out[243]:
A B date ints bools
1356998400000000000 -1.294524 0.413738 1356998400000000000 0 True
1357084800000000000 0.276662 -0.472035 1356998400000000000 1 True
1357171200000000000 -0.013960 -0.362543 1356998400000000000 2 True
1357257600000000000 -0.006154 -0.923061 1356998400000000000 3 True
1357344000000000000 0.895717 0.805244 1356998400000000000 4 True
In [245]: dfju
Out[245]:
A B date ints bools
2013-01-01 -1.294524 0.413738 2013-01-01 0 True
2013-01-02 0.276662 -0.472035 2013-01-01 1 True
2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
2013-01-05 0.895717 0.805244 2013-01-01 4 True
In [247]: dfju
Out[247]:
A B date ints bools
2013-01-01 -1.294524 0.413738 2013-01-01 0 True
2013-01-02 0.276662 -0.472035 2013-01-01 1 True
2013-01-03 -0.013960 -0.362543 2013-01-01 2 True
2013-01-04 -0.006154 -0.923061 2013-01-01 3 True
2013-01-05 0.895717 0.805244 2013-01-01 4 True
Note: This param has been deprecated as of version 1.0.0 and will raise a FutureWarning.
This supports numeric data only. Index and columns labels may be non-numeric, e.g. strings, dates etc.
If numpy=True is passed to read_json an attempt will be made to sniff an appropriate dtype during deserialization
and to subsequently decode directly to NumPy arrays, bypassing the need for intermediate Python objects.
This can provide speedups if you are deserialising a large amount of numeric data:
Warning: Direct NumPy decoding makes a number of assumptions and may fail or produce unexpected output if
these assumptions are not satisfied:
• data is numeric.
• data is uniform. The dtype is sniffed from the first value decoded. A ValueError may be raised, or
incorrect output may be produced if this condition is not satisfied.
• labels are ordered. Labels are only read from the first container, it is assumed that each subsequent row /
column has been encoded in the same order. This should be satisfied if the data was encoded using to_json
but may not be the case if the JSON is from another source.
Normalization
pandas provides a utility function to take a dict or list of dicts and normalize this semi-structured data into a flat table.
In [257]: data = [
.....: {"id": 1, "name": {"first": "Coleen", "last": "Volk"}},
.....: {"name": {"given": "Mose", "family": "Regner"}},
.....: {"id": 2, "name": "Faye Raker"},
.....: ]
.....:
In [258]: pd.json_normalize(data)
Out[258]:
id name.first name.last name.given name.family name
0 1.0 Coleen Volk NaN NaN NaN
1 NaN NaN NaN Mose Regner NaN
2 2.0 NaN NaN NaN NaN Faye Raker
In [259]: data = [
.....: {
.....: "state": "Florida",
.....: "shortname": "FL",
.....: "info": {"governor": "Rick Scott"},
.....: "county": [
.....: {"name": "Dade", "population": 12345},
.....: {"name": "Broward", "population": 40000},
.....: {"name": "Palm Beach", "population": 60000},
.....: ],
.....: },
.....: {
.....: "state": "Ohio",
.....: "shortname": "OH",
.....: "info": {"governor": "John Kasich"},
.....: "county": [
.....: {"name": "Summit", "population": 1234},
.....: {"name": "Cuyahoga", "population": 1337},
.....: ],
.....: },
.....: ]
.....:
Out[260]:
name population state shortname info.governor
0 Dade 12345 Florida FL Rick Scott
1 Broward 40000 Florida FL Rick Scott
2 Palm Beach 60000 Florida FL Rick Scott
3 Summit 1234 Ohio OH John Kasich
4 Cuyahoga 1337 Ohio OH John Kasich
The max_level parameter provides more control over which level to end normalization. With max_level=1 the follow-
ing snippet normalizes until 1st nesting level of the provided dict.
In [261]: data = [
.....: {
.....: "CreatedBy": {"Name": "User001"},
(continues on next page)
pandas is able to read and write line-delimited json files that are common in data processing pipelines using Hadoop
or Spark.
For line-delimited json files, pandas can also return an iterator which reads in chunksize lines at a time. This can
be useful for large files or to read from a stream.
In [265]: df
Out[265]:
a b
0 1 2
1 3 4
Table schema
Table Schema is a spec for describing tabular datasets as a JSON object. The JSON includes information on the field
names, types, and other attributes. You can use the orient table to build a JSON string with two fields, schema and
data.
In [268]: df = pd.DataFrame(
.....: {
.....: "A": [1, 2, 3],
.....: "B": ["a", "b", "c"],
.....: "C": pd.date_range("2016-01-01", freq="d", periods=3),
.....: },
.....: index=pd.Index(range(3), name="idx"),
.....: )
.....:
In [269]: df
Out[269]:
A B C
idx
0 1 a 2016-01-01
1 2 b 2016-01-02
2 3 c 2016-01-03
˓→":["idx"],"pandas_version":"0.20.0"},"data":[{"idx":0,"A":1,"B":"a","C":"2016-01-
˓→01T00:00:00.000Z"},{"idx":1,"A":2,"B":"b","C":"2016-01-02T00:00:00.000Z"},{"idx":2,
˓→"A":3,"B":"c","C":"2016-01-03T00:00:00.000Z"}]}'
The schema field contains the fields key, which itself contains a list of column name to type pairs, including the
Index or MultiIndex (see below for a list of types). The schema field also contains a primaryKey field if the
(Multi)index is unique.
The second field, data, contains the serialized data with the records orient. The index is included, and any
datetimes are ISO 8601 formatted, as required by the Table Schema spec.
The full list of types supported are described in the Table Schema spec. This table shows the mapping from pandas
types:
In [273]: build_table_schema(s)
Out[273]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
• datetimes with a timezone (before serializing), include an additional field tz with the time zone name (e.g.
'US/Central').
In [274]: s_tz = pd.Series(pd.date_range("2016", periods=12, tz="US/Central"))
In [275]: build_table_schema(s_tz)
Out[275]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values', 'type': 'datetime', 'tz': 'US/Central'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
• Periods are converted to timestamps before serialization, and so have the same behavior of being converted to
UTC. In addition, periods will contain and additional field freq with the period’s frequency, e.g. 'A-DEC'.
In [276]: s_per = pd.Series(1, index=pd.period_range("2016", freq="A-DEC",
˓→periods=4))
In [277]: build_table_schema(s_per)
Out[277]:
{'fields': [{'name': 'index', 'type': 'datetime', 'freq': 'A-DEC'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
• Categoricals use the any type and an enum constraint listing the set of possible values. Additionally, an
ordered field is included:
In [278]: s_cat = pd.Series(pd.Categorical(["a", "b", "a"]))
In [279]: build_table_schema(s_cat)
Out[279]:
{'fields': [{'name': 'index', 'type': 'integer'},
{'name': 'values',
'type': 'any',
'constraints': {'enum': ['a', 'b']},
'ordered': False}],
'primaryKey': ['index'],
'pandas_version': '0.20.0'}
In [281]: build_table_schema(s_dupe)
Out[281]:
(continues on next page)
• The primaryKey behavior is the same with MultiIndexes, but in this case the primaryKey is an array:
In [283]: build_table_schema(s_multi)
Out[283]:
{'fields': [{'name': 'level_0', 'type': 'string'},
{'name': 'level_1', 'type': 'integer'},
{'name': 'values', 'type': 'integer'}],
'primaryKey': FrozenList(['level_0', 'level_1']),
'pandas_version': '0.20.0'}
In [284]: df = pd.DataFrame(
.....: {
.....: "foo": [1, 2, 3, 4],
.....: "bar": ["a", "b", "c", "d"],
.....: "baz": pd.date_range("2018-01-01", freq="d", periods=4),
.....: "qux": pd.Categorical(["a", "b", "c", "c"]),
.....: },
.....: index=pd.Index(range(4), name="idx"),
.....: )
.....:
In [285]: df
Out[285]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [286]: df.dtypes
Out[286]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
In [289]: new_df
Out[289]:
foo bar baz qux
idx
0 1 a 2018-01-01 a
1 2 b 2018-01-02 b
2 3 c 2018-01-03 c
3 4 d 2018-01-04 c
In [290]: new_df.dtypes
Out[290]:
foo int64
bar object
baz datetime64[ns]
qux category
dtype: object
Please note that the literal string ‘index’ as the name of an Index is not round-trippable, nor are any names begin-
ning with 'level_' within a MultiIndex. These are used by default in DataFrame.to_json() to indicate
missing values and the subsequent read cannot distinguish the intent.
In [294]: print(new_df.index.name)
None
2.4.3 HTML
Warning: We highly encourage you to read the HTML Table Parsing gotchas below regarding the issues sur-
rounding the BeautifulSoup4/html5lib/lxml parsers.
The top-level read_html() function can accept an HTML string/file/URL and will parse HTML tables into list of
pandas DataFrames. Let’s look at a few examples.
Note: read_html returns a list of DataFrame objects, even if there is only a single table contained in the
HTML content.
In [297]: dfs
Out[297]:
[ Bank Name City ST CERT
˓→Acquiring Institution Closing Date
0 Almena State Bank Almena KS 15426
˓→ Equity Bank October 23, 2020
1 First City Bank of Florida Fort Walton Beach FL 16748
˓→United Fidelity Bank, fsb October 16, 2020
2 The First State Bank Barboursville WV 14361
˓→ MVB Bank, Inc. April 3, 2020
3 Ericson State Bank Ericson NE 18265
˓→Farmers and Merchants Bank February 14, 2020
4 City National Bank of New Jersey Newark NJ 21111
˓→ Industrial Bank November 1, 2019
.. ... ... .. ...
˓→ ... ...
558 Superior Bank, FSB Hinsdale IL 32646
˓→Superior Federal, FSB July 27, 2001
559 Malta National Bank Malta OH 6629
˓→ North Valley Bank May 3, 2001
560 First Alliance Bank & Trust Co. Manchester NH 34264 Southern New
˓→Hampshire Bank & Trust February 2, 2001
561 National State Bank of Metropolis Metropolis IL 3815
˓→Banterra Bank of Marion December 14, 2000
562 Bank of Honolulu Honolulu HI 21029
˓→ Bank of the Orient October 13, 2000
Note: The data from the above URL changes every Monday so the resulting data above and the data below may be
slightly different.
Read in the content of the file from the above URL and pass it to read_html as a string:
In [298]: with open(file_path, "r") as f:
.....: dfs = pd.read_html(f.read())
.....:
In [299]: dfs
Out[299]:
[ Bank Name City ... Closing Date
˓→ Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha ... May 31, 2013
˓→ May 31, 2013
1 Central Arizona Bank Scottsdale ... May 14, 2013
˓→ May 20, 2013
2 Sunrise Bank Valdosta ... May 10, 2013
˓→ May 21, 2013
3 Pisgah Community Bank Asheville ... May 10, 2013
˓→ May 14, 2013
4 Douglas County Bank Douglasville ... April 26, 2013
˓→ May 16, 2013
(continues on next page)
In [302]: dfs
Out[302]:
[ Bank Name City ... Closing Date
˓→ Updated Date
0 Banks of Wisconsin d/b/a Bank of Kenosha Kenosha ... May 31, 2013
˓→ May 31, 2013
1 Central Arizona Bank Scottsdale ... May 14, 2013
˓→ May 20, 2013
2 Sunrise Bank Valdosta ... May 10, 2013
˓→ May 21, 2013
3 Pisgah Community Bank Asheville ... May 10, 2013
˓→ May 14, 2013
4 Douglas County Bank Douglasville ... April 26, 2013
˓→ May 16, 2013
.. ... ... ... ...
˓→ ...
500 Superior Bank, FSB Hinsdale ... July 27, 2001
˓→ June 5, 2012
501 Malta National Bank Malta ... May 3, 2001
˓→November 18, 2002
502 First Alliance Bank & Trust Co. Manchester ... February 2, 2001
˓→February 18, 2003
503 National State Bank of Metropolis Metropolis ... December 14, 2000
˓→ March 17, 2005
504 Bank of Honolulu Honolulu ... October 13, 2000
˓→ March 17, 2005
Note: The following examples are not run by the IPython evaluator due to the fact that having so many network-
accessing functions slows down the documentation build. If you spot an error or an example that doesn’t run, please
do not hesitate to report it over on pandas GitHub issues page.
Specify a header row (by default <th> or <td> elements located within a <thead> are used to form the column
index, if multiple rows are contained within <thead> then a MultiIndex is created); if specified, the header row is
taken from the data minus the parsed header elements (<th> elements).
dfs = pd.read_html(url, header=0)
Specify converters for columns. This is useful for numerical text data that has leading zeros. By default columns that
are numerical are cast to numeric types and the leading zeros are lost. To avoid this, we can convert these columns to
strings.
url_mcc = "https://en.wikipedia.org/wiki/Mobile_country_code"
dfs = pd.read_html(
url_mcc,
match="Telekom Albania",
header=0,
converters={"MNC": str},
)
Read in pandas to_html output (with some loss of floating point precision):
df = pd.DataFrame(np.random.randn(2, 2))
s = df.to_html(float_format="{0:.40g}".format)
dfin = pd.read_html(s, index_col=0)
The lxml backend will raise an error on a failed parse if that is the only parser you provide. If you only have a single
parser you can provide just a string, but it is considered good practice to pass a list with one string if, for example, the
function expects a sequence of strings. You may use:
However, if you have bs4 and html5lib installed and pass None or ['lxml', 'bs4'] then the parse will most
likely succeed. Note that as soon as a parse succeeds, the function will return.
DataFrame objects have an instance method to_html which renders the contents of the DataFrame as an HTML
table. The function arguments are as in the method to_string described above.
Note: Not all of the possible options for DataFrame.to_html are shown here for brevity’s sake. See
to_html() for the full set of options.
In [304]: df
Out[304]:
0 1
0 -0.184744 0.496971
1 -0.856240 1.857977
HTML:
HTML:
float_format takes a Python callable to control the precision of floating point values:
In [307]: print(df.to_html(float_format="{0:.10f}".format))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
<th>1</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>-0.1847438576</td>
<td>0.4969711327</td>
</tr>
<tr>
<th>1</th>
<td>-0.8562396763</td>
<td>1.8579766508</td>
</tr>
</tbody>
</table>
HTML:
bold_rows will make the row labels bold by default, but you can turn that off:
In [308]: print(df.to_html(bold_rows=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>0</th>
(continues on next page)
The classes argument provides the ability to give the resulting HTML table CSS classes. Note that these classes
are appended to the existing 'dataframe' class.
In [309]: print(df.to_html(classes=["awesome_table_class", "even_more_awesome_class
˓→"]))
The render_links argument provides the ability to add hyperlinks to cells that contain URLs.
New in version 0.24.
In [310]: url_df = pd.DataFrame(
.....: {
.....: "name": ["Python", "pandas"],
.....: "url": ["https://www.python.org/", "https://pandas.pydata.org"],
.....: }
.....: )
.....:
In [311]: print(url_df.to_html(render_links=True))
(continues on next page)
</tr>
<tr>
<th>1</th>
<td>pandas</td>
<td><a href="https://pandas.pydata.org" target="_blank">https://pandas.pydata.
˓→org</a></td>
</tr>
</tbody>
</table>
HTML:
Finally, the escape argument allows you to control whether the “<”, “>” and “&” characters escaped in the resulting
HTML (by default it is True). So to get the HTML without escaped characters pass escape=False
In [312]: df = pd.DataFrame({"a": list("&<>"), "b": np.random.randn(3)})
Escaped:
In [313]: print(df.to_html())
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>-0.474063</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.230305</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-0.400654</td>
(continues on next page)
Not escaped:
In [314]: print(df.to_html(escape=False))
<table border="1" class="dataframe">
<thead>
<tr style="text-align: right;">
<th></th>
<th>a</th>
<th>b</th>
</tr>
</thead>
<tbody>
<tr>
<th>0</th>
<td>&</td>
<td>-0.474063</td>
</tr>
<tr>
<th>1</th>
<td><</td>
<td>-0.230305</td>
</tr>
<tr>
<th>2</th>
<td>></td>
<td>-0.400654</td>
</tr>
</tbody>
</table>
Note: Some browsers may not show a difference in the rendering of the previous two HTML tables.
There are some versioning issues surrounding the libraries that are used to parse HTML tables in the top-level pandas
io function read_html.
Issues with lxml
• Benefits
– lxml is very fast.
– lxml requires Cython to install correctly.
• Drawbacks
– lxml does not make any guarantees about the results of its parse unless it is given strictly valid markup.
– In light of the above, we have chosen to allow you, the user, to use the lxml backend, but this backend
will use html5lib if lxml fails to parse
– It is therefore highly recommended that you install both BeautifulSoup4 and html5lib, so that you will
still get a valid result (provided everything else is valid) even if lxml fails.
Issues with BeautifulSoup4 using lxml as a backend
• The above issues hold here as well since BeautifulSoup4 is essentially just a wrapper around a parser backend.
Issues with BeautifulSoup4 using html5lib as a backend
• Benefits
– html5lib is far more lenient than lxml and consequently deals with real-life markup in a much saner way
rather than just, e.g., dropping an element without notifying you.
– html5lib generates valid HTML5 markup from invalid markup automatically. This is extremely important
for parsing HTML tables, since it guarantees a valid document. However, that does NOT mean that it is
“correct”, since the process of fixing markup does not have a single definition.
– html5lib is pure Python and requires no additional build steps beyond its own installation.
• Drawbacks
– The biggest drawback to using html5lib is that it is slow as molasses. However consider the fact that many
tables on the web are not big enough for the parsing algorithm runtime to matter. It is more likely that the
bottleneck will be in the process of reading the raw text from the URL over the web, i.e., IO (input-output).
For very large tables, this might not be true.
The read_excel() method can read Excel 2007+ (.xlsx) files using the openpyxl Python module. Excel 2003
(.xls) files can be read using xlrd. Binary Excel (.xlsb) files can be read using pyxlsb. The to_excel()
instance method is used for saving a DataFrame to Excel. Generally the semantics are similar to working with csv
data. See the cookbook for some advanced strategies.
Warning: The xlwt package for writing old-style .xls excel files is no longer maintained. The xlrd package is
now only for reading old-style .xls files.
Previously, the default argument engine=None to read_excel() would result in using the xlrd engine in
many cases, including new Excel 2007+ (.xlsx) files. If openpyxl is installed, many of these cases will now
default to using the openpyxl engine. See the read_excel() documentation for more details.
Thus, it is strongly encouraged to install openpyxl to read Excel 2007+ (.xlsx) files. Please do not report
issues when using ``xlrd`` to read ``.xlsx`` files. This is no longer supported, switch to using openpyxl instead.
Attempting to use the the xlwt engine will raise a FutureWarning unless the option io.excel.xls.
writer is set to "xlwt". While this option is now deprecated and will also raise a FutureWarning, it can
be globally set and the warning suppressed. Users are recommended to write .xlsx files using the openpyxl
engine instead.
In the most basic use-case, read_excel takes a path to an Excel file, and the sheet_name indicating which sheet
to parse.
# Returns a DataFrame
pd.read_excel("path_to_file.xls", sheet_name="Sheet1")
ExcelFile class
To facilitate working with multiple sheets from the same file, the ExcelFile class can be used to wrap the file and
can be passed into read_excel There will be a performance benefit for reading multiple sheets as the file is read
into memory only once.
xlsx = pd.ExcelFile("path_to_file.xls")
df = pd.read_excel(xlsx, "Sheet1")
The sheet_names property will generate a list of the sheet names in the file.
The primary use-case for an ExcelFile is parsing multiple sheets with different parameters:
data = {}
# For when Sheet1's format differs from Sheet2
with pd.ExcelFile("path_to_file.xls") as xls:
data["Sheet1"] = pd.read_excel(xls, "Sheet1", index_col=None, na_values=["NA"])
data["Sheet2"] = pd.read_excel(xls, "Sheet2", index_col=1)
Note that if the same parsing parameters are used for all sheets, a list of sheet names can simply be passed to
read_excel with no loss in performance.
ExcelFile can also be called with a xlrd.book.Book object as a parameter. This allows the user to control
how the excel file is read. For example, sheets can be loaded on demand by calling xlrd.open_workbook() with
on_demand=True.
import xlrd
Specifying sheets
# Returns a DataFrame
pd.read_excel("path_to_file.xls", "Sheet1", index_col=None, na_values=["NA"])
# Returns a DataFrame
pd.read_excel("path_to_file.xls", 0, index_col=None, na_values=["NA"])
# Returns a DataFrame
pd.read_excel("path_to_file.xls")
read_excel can read more than one sheet, by setting sheet_name to either a list of sheet names, a list of sheet
positions, or None to read all sheets. Sheets can be specified by sheet index or sheet name, using an integer or string,
respectively.
Reading a MultiIndex
read_excel can read a MultiIndex index, by passing a list of columns to index_col and a MultiIndex
column by passing a list of rows to header. If either the index or columns have serialized level names those will
be read in as well by specifying the rows/columns that make up the levels.
For example, to read in a MultiIndex index without names:
In [315]: df = pd.DataFrame(
.....: {"a": [1, 2, 3, 4], "b": [5, 6, 7, 8]},
.....: index=pd.MultiIndex.from_product([["a", "b"], ["c", "d"]]),
.....: )
.....:
In [316]: df.to_excel("path_to_file.xlsx")
In [318]: df
Out[318]:
a b
a c 1 5
d 2 6
b c 3 7
d 4 8
If the index has level names, they will parsed as well, using the same parameters.
In [319]: df.index = df.index.set_names(["lvl1", "lvl2"])
In [320]: df.to_excel("path_to_file.xlsx")
In [322]: df
Out[322]:
a b
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
If the source file has both MultiIndex index and columns, lists specifying each should be passed to index_col
and header:
In [323]: df.columns = pd.MultiIndex.from_product([["a"], ["b", "d"]], names=["c1",
˓→"c2"])
In [324]: df.to_excel("path_to_file.xlsx")
In [326]: df
Out[326]:
c1 a
c2 b d
lvl1 lvl2
(continues on next page)
It is often the case that users will insert columns to do temporary computations in Excel and you may not want to read
in those columns. read_excel takes a usecols keyword to allow you to specify a subset of columns to parse.
Changed in version 1.0.0.
Passing in an integer for usecols will no longer work. Please pass in a list of ints from 0 to usecols inclusive
instead.
You can specify a comma-delimited set of Excel columns and ranges as a string:
If usecols is a list of integers, then it is assumed to be the file column indices to be parsed.
Parsing dates
Datetime-like values are normally automatically converted to the appropriate dtype when reading the excel file. But
if you have a column of strings that look like dates (but are not actually formatted as dates in excel), you can use the
parse_dates keyword to parse those strings to datetimes:
Cell converters
It is possible to transform the contents of Excel cells via the converters option. For instance, to convert a column
to boolean:
This options handles missing values and treats exceptions in the converters as missing data. Transformations are
applied cell by cell rather than to the column as a whole, so the array dtype is not guaranteed. For instance, a column
of integers with missing values cannot be transformed to an array with integer dtype, because NaN is strictly a float.
You can manually mask missing data to recover integer dtype:
def cfun(x):
return int(x) if x else -1
Dtype specifications
As an alternative to converters, the type for an entire column can be specified using the dtype keyword, which takes
a dictionary mapping column names to types. To interpret data with no type inference, use the type str or object.
To write a DataFrame object to a sheet of an Excel file, you can use the to_excel instance method. The arguments
are largely the same as to_csv described above, the first argument being the name of the excel file, and the optional
second argument the name of the sheet to which the DataFrame should be written. For example:
df.to_excel("path_to_file.xlsx", sheet_name="Sheet1")
Files with a .xls extension will be written using xlwt and those with a .xlsx extension will be written using
xlsxwriter (if available) or openpyxl.
The DataFrame will be written in a way that tries to mimic the REPL output. The index_label will be placed
in the second row instead of the first. You can place it in the first row by setting the merge_cells option in
to_excel() to False:
In order to write separate DataFrames to separate sheets in a single Excel file, one can pass an ExcelWriter.
Note: Wringing a little more performance out of read_excel Internally, Excel stores all numeric data as floats.
Because this can produce unexpected behavior when reading in data, pandas defaults to trying to convert integers to
floats if it doesn’t lose information (1.0 --> 1). You can pass convert_float=False to disable this behavior,
which may give a slight performance improvement.
pandas supports writing Excel files to buffer-like objects such as StringIO or BytesIO using ExcelWriter.
bio = BytesIO()
# Seek to the beginning and read to copy the workbook to a variable in memory
bio.seek(0)
workbook = bio.read()
Note: engine is optional but recommended. Setting the engine determines the version of workbook produced.
Setting engine='xlrd' will produce an Excel 2003-format workbook (xls). Using either 'openpyxl' or
'xlsxwriter' will produce an Excel 2007-format workbook (xlsx). If omitted, an Excel 2007-formatted workbook
is produced.
Deprecated since version 1.2.0: As the xlwt package is no longer maintained, the xlwt engine will be removed from
a future version of pandas. This is the only engine in pandas that supports writing to .xls files.
pandas chooses an Excel writer via two methods:
1. the engine keyword argument
2. the filename extension (via the default specified in config options)
By default, pandas uses the XlsxWriter for .xlsx, openpyxl for .xlsm, and xlwt for .xls files. If you have multiple
engines installed, you can set the default engine through setting the config options io.excel.xlsx.writer and
io.excel.xls.writer. pandas will fall back on openpyxl for .xlsx files if Xlsxwriter is not available.
To specify which writer you want to use, you can pass an engine keyword argument to to_excel and to
ExcelWriter. The built-in engines are:
• openpyxl: version 2.4 or higher is required
• xlsxwriter
• xlwt
options.io.excel.xlsx.writer = "xlsxwriter"
df.to_excel("path_to_file.xlsx", sheet_name="Sheet1")
The look and feel of Excel worksheets created from pandas can be modified using the following parameters on the
DataFrame’s to_excel method.
• float_format : Format string for floating point numbers (default None).
• freeze_panes : A tuple of two integers representing the bottommost row and rightmost column to freeze.
Each of these parameters is one-based, so (1, 1) will freeze the first row and first column (default None).
Using the Xlsxwriter engine provides many options for controlling the format of an Excel worksheet created with
the to_excel method. Excellent examples can be found in the Xlsxwriter documentation here: https://xlsxwriter.
readthedocs.io/working_with_pandas.html
# Returns a DataFrame
pd.read_excel("path_to_file.ods", engine="odf")
Note: Currently pandas only supports reading OpenDocument spreadsheets. Writing is not implemented.
# Returns a DataFrame
pd.read_excel("path_to_file.xlsb", engine="pyxlsb")
Note: Currently pandas only supports reading binary Excel files. Writing is not implemented.
2.4.7 Clipboard
A handy way to grab data is to use the read_clipboard() method, which takes the contents of the clipboard
buffer and passes them to the read_csv method. For instance, you can copy the following text to the clipboard
(CTRL-C on many operating systems):
A B C
x 1 4 p
y 2 5 q
z 3 6 r
The to_clipboard method can be used to write the contents of a DataFrame to the clipboard. Following which
you can paste the clipboard contents into other applications (CTRL-V on many operating systems). Here we illustrate
writing a DataFrame into clipboard and reading it back.
>>> df = pd.DataFrame(
... {"A": [1, 2, 3], "B": [4, 5, 6], "C": ["p", "q", "r"]}, index=["x", "y", "z"]
... )
>>> df
A B C
x 1 4 p
y 2 5 q
z 3 6 r
>>> df.to_clipboard()
>>> pd.read_clipboard()
A B C
x 1 4 p
y 2 5 q
z 3 6 r
We can see that we got the same content back, which we had earlier written to the clipboard.
Note: You may need to install xclip or xsel (with PyQt5, PyQt4 or qtpy) on Linux to use these methods.
2.4.8 Pickling
All pandas objects are equipped with to_pickle methods which use Python’s cPickle module to save data
structures to disk using the pickle format.
In [327]: df
Out[327]:
c1 a
c2 b d
lvl1 lvl2
(continues on next page)
In [328]: df.to_pickle("foo.pkl")
The read_pickle function in the pandas namespace can be used to load any pickled pandas object (or any other
pickled object) from file:
In [329]: pd.read_pickle("foo.pkl")
Out[329]:
c1 a
c2 b d
lvl1 lvl2
a c 1 5
d 2 6
b c 3 7
d 4 8
Warning: Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html
Warning: read_pickle() is only guaranteed backwards compatible back to pandas version 0.20.3
In [331]: df
Out[331]:
A B C
0 -0.288267 foo 2013-01-01 00:00:00
(continues on next page)
In [334]: rt
Out[334]:
A B C
0 -0.288267 foo 2013-01-01 00:00:00
1 -0.084905 foo 2013-01-01 00:00:01
2 0.004772 foo 2013-01-01 00:00:02
3 1.382989 foo 2013-01-01 00:00:03
4 0.343635 foo 2013-01-01 00:00:04
.. ... ... ...
995 -0.220893 foo 2013-01-01 00:16:35
996 0.492996 foo 2013-01-01 00:16:36
997 -0.461625 foo 2013-01-01 00:16:37
998 1.361779 foo 2013-01-01 00:16:38
999 -1.197988 foo 2013-01-01 00:16:39
In [337]: rt
Out[337]:
A B C
0 -0.288267 foo 2013-01-01 00:00:00
1 -0.084905 foo 2013-01-01 00:00:01
2 0.004772 foo 2013-01-01 00:00:02
3 1.382989 foo 2013-01-01 00:00:03
4 0.343635 foo 2013-01-01 00:00:04
.. ... ... ...
995 -0.220893 foo 2013-01-01 00:16:35
996 0.492996 foo 2013-01-01 00:16:36
997 -0.461625 foo 2013-01-01 00:16:37
998 1.361779 foo 2013-01-01 00:16:38
999 -1.197988 foo 2013-01-01 00:16:39
In [338]: df.to_pickle("data.pkl.gz")
In [339]: rt = pd.read_pickle("data.pkl.gz")
In [340]: rt
Out[340]:
A B C
0 -0.288267 foo 2013-01-01 00:00:00
1 -0.084905 foo 2013-01-01 00:00:01
2 0.004772 foo 2013-01-01 00:00:02
3 1.382989 foo 2013-01-01 00:00:03
4 0.343635 foo 2013-01-01 00:00:04
.. ... ... ...
995 -0.220893 foo 2013-01-01 00:16:35
996 0.492996 foo 2013-01-01 00:16:36
997 -0.461625 foo 2013-01-01 00:16:37
998 1.361779 foo 2013-01-01 00:16:38
999 -1.197988 foo 2013-01-01 00:16:39
In [341]: df["A"].to_pickle("s1.pkl.bz2")
In [342]: rt = pd.read_pickle("s1.pkl.bz2")
In [343]: rt
Out[343]:
0 -0.288267
1 -0.084905
2 0.004772
3 1.382989
4 0.343635
...
995 -0.220893
996 0.492996
997 -0.461625
998 1.361779
999 -1.197988
Name: A, Length: 1000, dtype: float64
2.4.9 msgpack
pandas support for msgpack has been removed in version 1.0.0. It is recommended to use pyarrow for on-the-wire
transmission of pandas objects.
Example pyarrow usage:
import pandas as pd
import pyarrow as pa
context = pa.default_serialization_context()
df_bytestring = context.serialize(df).to_buffer().to_pybytes()
HDFStore is a dict-like object which reads and writes pandas using the high performance HDF5 format using the
excellent PyTables library. See the cookbook for some advanced strategies
Warning: pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype data
with pickle. Loading pickled data received from untrusted sources can be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
In [346]: print(store)
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Objects can be written to the file just like adding key-value pairs to a dict:
In [351]: store["df"] = df
In [352]: store
Out[352]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [356]: store
Out[356]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [357]: store.close()
In [358]: store
Out[358]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [359]: store.is_open
Out[359]: False
# Working with, and automatically closing the store using a context manager
In [360]: with pd.HDFStore("store.h5") as store:
.....: store.keys()
.....:
Read/write API
HDFStore supports a top-level API using read_hdf for reading and to_hdf for writing, similar to how
read_csv and to_csv work.
HDFStore will by default not drop rows that are all missing. This behavior can be changed by setting dropna=True.
In [365]: df_with_missing
Out[365]:
col1 col2
0 0.0 1.0
1 NaN NaN
2 2.0 NaN
In [368]: df_with_missing.to_hdf(
.....: "file.h5", "df_with_missing", format="table", mode="w", dropna=True
.....: )
.....:
Fixed format
The examples above show storing using put, which write the HDF5 to PyTables in a fixed array format, called
the fixed format. These types of stores are not appendable once written (though you can simply remove them and
rewrite). Nor are they queryable; they must be retrieved in their entirety. They also do not support dataframes with
non-unique column names. The fixed format stores offer very fast writing and slightly faster reading than table
stores. This format is specified by default when using put or to_hdf or by format='fixed' or format='f'.
Warning: A fixed format will raise a TypeError if you try to retrieve using a where:
>>> pd.DataFrame(np.random.randn(10, 2)).to_hdf("test_fixed.h5", "df")
>>> pd.read_hdf("test_fixed.h5", "df", where="index>5")
TypeError: cannot pass a where specification when reading a fixed format.
this store must be selected in its entirety
Table format
HDFStore supports another PyTables format on disk, the table format. Conceptually a table is shaped very
much like a DataFrame, with rows and columns. A table may be appended to in the same or other sessions.
In addition, delete and query type operations are supported. This format is specified by format='table' or
format='t' to append or put or to_hdf.
This format can be set as an option as well pd.set_option('io.hdf.default_format','table') to
enable put/append/to_hdf to by default store in the table format.
In [370]: store = pd.HDFStore("store.h5")
In [375]: store
Out[375]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
Note: You can also create a table by passing format='table' or format='t' to a put operation.
Hierarchical keys
Keys to a store can be specified as a string. These can be in a hierarchical path-name like format (e.g. foo/bar/
bah), which will generate a hierarchy of sub-stores (or Groups in PyTables parlance). Keys can be specified without
the leading ‘/’ and are always absolute (e.g. ‘foo’ refers to ‘/foo’). Removal operations can remove everything in the
sub-store and below, so be careful.
In [381]: store
Out[381]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [384]: store
Out[384]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
You can walk through the group hierarchy using the walk method which will yield a tuple for each group key along
with the relative keys of its contents.
New in version 0.24.0.
Warning: Hierarchical keys cannot be retrieved as dotted (attribute) access as described above for items stored
under the root node.
In [8]: store.foo.bar.bah
AttributeError: 'HDFStore' object has no attribute 'foo'
# you can directly access the actual PyTables node but using the root node
In [9]: store.root.foo.bar.bah
Out[9]:
/foo/bar/bah (Group) ''
children := ['block0_items' (Array), 'block0_values' (Array), 'axis0' (Array),
˓→'axis1' (Array)]
Storing types
Storing mixed-dtype data is supported. Strings are stored as a fixed-width using the maximum size of the appended
column. Subsequent attempts at appending longer strings will raise a ValueError.
Passing min_itemsize={`values`: size} as a parameter to append will set a larger minimum for the string
columns. Storing floats, strings, ints, bools, datetime64 are currently supported. For string
columns, passing nan_rep = 'nan' to append will change the default nan representation on disk (which con-
verts to/from np.nan), this defaults to nan.
In [391]: df_mixed1
Out[391]:
A B C string int bool datetime64
0 -0.116008 0.743946 -0.398501 string 1 True 2001-01-02
1 0.592375 -0.533097 -0.677311 string 1 True 2001-01-02
2 0.476481 -0.140850 -0.874991 string 1 True 2001-01-02
3 NaN NaN -1.167564 NaN 1 True NaT
4 NaN NaN -0.593353 NaN 1 True NaT
5 0.852727 0.463819 0.146262 string 1 True 2001-01-02
6 -1.177365 0.793644 -0.131959 string 1 True 2001-01-02
7 1.236988 0.221252 0.089012 string 1 True 2001-01-02
In [392]: df_mixed1.dtypes.value_counts()
Out[392]:
float64 2
object 1
bool 1
datetime64[ns] 1
float32 1
int64 1
dtype: int64
Storing MultiIndex DataFrames as tables is very similar to storing/selecting from homogeneous index
DataFrames.
In [394]: index = pd.MultiIndex(
.....: levels=[["foo", "bar", "baz", "qux"], ["one", "two", "three"]],
.....: codes=[[0, 0, 0, 1, 1, 2, 2, 3, 3, 3], [0, 1, 2, 0, 1, 1, 2, 0, 1, 2]],
.....: names=["foo", "bar"],
.....: )
.....:
In [396]: df_mi
Out[396]:
A B C
foo bar
foo one 0.667450 0.169405 -1.358046
two -0.105563 0.492195 0.076693
three 0.213685 -0.285283 -1.210529
bar one -1.408386 0.941577 -0.342447
two 0.222031 0.052607 2.093214
baz two 1.064908 1.778161 -0.913867
three -0.030004 -0.399846 -1.234765
qux one 0.081323 -0.268494 0.168016
two -0.898283 -0.218499 1.408028
three -1.267828 -0.689263 0.520995
In [398]: store.select("df_mi")
Out[398]:
A B C
foo bar
foo one 0.667450 0.169405 -1.358046
two -0.105563 0.492195 0.076693
three 0.213685 -0.285283 -1.210529
(continues on next page)
Note: The index keyword is reserved and cannot be use as a level name.
Querying
Querying a table
select and delete operations have an optional criterion that can be specified to select/delete only a subset of the
data. This allows one to have a very large on-disk table and retrieve only a portion of the data.
A query is specified using the Term class under the hood, as a boolean expression.
• index and columns are supported indexers of DataFrames.
• if data_columns are specified, these can be used as additional indexers.
• level name in a MultiIndex, with default name level_0, level_1, . . . if not provided.
Valid comparison operators are:
=, ==, !=, >, >=, <, <=
Valid boolean expressions are combined with:
• | : or
• & : and
• ( and ) : for grouping
These rules are similar to how boolean expressions are used in pandas for indexing.
Note:
• = will be automatically expanded to the comparison operator ==
• ~ is the not operator, but can only be used in very limited circumstances
• If a list/tuple of expressions is passed they will be combined via &
Note: Passing a string to a query by interpolating it into the query expression is not recommended. Simply assign the
string of interest to a variable and use that variable in an expression. For example, do this
string = "HolyMoly'"
store.select("df", "index == string")
instead of this
string = "HolyMoly'"
store.select('df', f'index == {string}')
The latter will not work and will raise a SyntaxError.Note that there’s a single quote followed by a double quote
in the string variable.
If you must interpolate, use the '%r' format specifier
store.select("df", "index == %r" % string)
The columns keyword can be supplied to select a list of columns to be returned, this is equivalent to passing a
'columns=list_of_columns_to_filter':
start and stop parameters can be specified to limit the total search space. These are in terms of the total number
of rows in a table.
Note: select will raise a ValueError if the query expression has an unknown variable reference. Usually this
means that you are trying to select on a column that is not a data_column.
select will raise a SyntaxError if the query expression is not valid.
Query timedelta64[ns]
You can store and query using the timedelta64[ns] type. Terms can be specified in the format:
<float>(<unit>), where float may be signed (and fractional), and unit can be D,s,ms,us,ns for the timedelta.
Here’s an example:
In [408]: dftd
Out[408]:
A B C
0 2013-01-01 2013-01-01 00:00:10 -1 days +23:59:50
1 2013-01-01 2013-01-02 00:00:10 -2 days +23:59:50
2 2013-01-01 2013-01-03 00:00:10 -3 days +23:59:50
3 2013-01-01 2013-01-04 00:00:10 -4 days +23:59:50
4 2013-01-01 2013-01-05 00:00:10 -5 days +23:59:50
5 2013-01-01 2013-01-06 00:00:10 -6 days +23:59:50
6 2013-01-01 2013-01-07 00:00:10 -7 days +23:59:50
7 2013-01-01 2013-01-08 00:00:10 -8 days +23:59:50
8 2013-01-01 2013-01-09 00:00:10 -9 days +23:59:50
9 2013-01-01 2013-01-10 00:00:10 -10 days +23:59:50
Query MultiIndex
Selecting from a MultiIndex can be achieved by using the name of the level.
In [411]: df_mi.index.names
Out[411]: FrozenList(['foo', 'bar'])
If the MultiIndex levels names are None, the levels are automatically made available via the level_n keyword
with n the level of the MultiIndex you want to select from.
In [415]: df_mi_2
Out[415]:
A B C
foo one 0.856838 1.491776 0.001283
two 0.701816 -1.097917 0.102588
three 0.661740 0.443531 0.559313
bar one -0.459055 -1.222598 -0.455304
two -0.781163 0.826204 -0.530057
baz two 0.296135 1.366810 1.073372
three -0.994957 0.755314 2.119746
qux one -2.628174 -0.089460 -0.133636
two 0.337920 -0.634027 0.421107
three 0.604303 1.053434 1.109090
# the levels are automatically included as data columns with keyword level_n
In [417]: store.select("df_mi_2", "level_0=foo and level_1=two")
Out[417]:
A B C
foo two 0.701816 -1.097917 0.102588
Indexing
You can create/modify an index for a table with create_table_index after data is already in the table (after and
append/put operation). Creating a table index is highly encouraged. This will speed your queries a great deal
when you use a select with the indexed dimension as the where.
Note: Indexes are automagically created on the indexables and any data columns you specify. This behavior can be
turned off by passing index=False to append.
In [421]: i = store.root.df.table.cols.index.index
Oftentimes when appending large amounts of data to a store, it is useful to turn off index creation for each append,
then recreate at the end.
In [428]: st.get_storer("df").table
Out[428]:
/df/table (Table(20,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": Float64Col(shape=(1,), dflt=0.0, pos=1),
"B": Float64Col(shape=(), dflt=0.0, pos=2)}
byteorder := 'little'
chunkshape := (2730,)
In [430]: st.get_storer("df").table
Out[430]:
/df/table (Table(20,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
(continues on next page)
In [431]: st.close()
You can designate (and index) certain columns that you want to be able to perform queries (other than the indexable
columns, which you can always query). For instance say you want to perform this common operation, on-disk, and
return just the frame that matches this query. You can specify data_columns = True to force all columns to be
data_columns.
In [432]: df_dc = df.copy()
In [438]: df_dc
Out[438]:
A B C string string2
2000-01-01 1.334065 0.521036 0.930384 foo cool
2000-01-02 -1.613932 1.000000 1.000000 foo cool
2000-01-03 -0.585314 1.000000 1.000000 foo cool
2000-01-04 0.632369 -1.249657 0.975593 foo cool
2000-01-05 1.060617 -0.143682 0.218423 NaN cool
2000-01-06 3.050329 1.317933 -0.963725 NaN cool
2000-01-07 -0.539452 -0.771133 0.023751 foo cool
2000-01-08 0.649464 -1.736427 0.197288 bar cool
# on-disk operations
In [439]: store.append("df_dc", df_dc, data_columns=["B", "C", "string", "string2"])
# getting creative
(continues on next page)
There is some performance degradation by making lots of columns into data columns, so it is up to the user to
designate these. In addition, you cannot change data columns (nor indexables) after the first append/put operation (Of
course you can simply read in the data and create a new table!).
Iterator
Note: You can also use the iterator with read_hdf which will open, then automatically close the store when finished
iterating.
Note, that the chunksize keyword applies to the source rows. So if you are doing a query, then the chunksize will
subdivide the total rows in the table and the query applied, returning an iterator on potentially unequal sized chunks.
Here is a recipe for generating a query and using it to create equal sized return chunks.
In [446]: dfeq
Out[446]:
number
0 1
1 2
2 3
3 4
4 5
5 6
6 7
7 8
8 9
9 10
Advanced queries
To retrieve a single indexable or data column, use the method select_column. This will, for example, enable you
to get the index very quickly. These return a Series of the result, indexed by the row number. These do not currently
accept the where selector.
In [452]: store.select_column("df_dc", "index")
Out[452]:
0 2000-01-01
1 2000-01-02
2 2000-01-03
3 2000-01-04
4 2000-01-05
5 2000-01-06
6 2000-01-07
7 2000-01-08
Name: index, dtype: datetime64[ns]
Selecting coordinates
Sometimes you want to get the coordinates (a.k.a the index locations) of your query. This returns an Int64Index
of the resulting locations. These coordinates can also be passed to subsequent where operations.
In [454]: df_coord = pd.DataFrame(
.....: np.random.randn(1000, 2), index=pd.date_range("20000101", periods=1000)
.....: )
.....:
In [457]: c
Out[457]:
Int64Index([732, 733, 734, 735, 736, 737, 738, 739, 740, 741,
...
990, 991, 992, 993, 994, 995, 996, 997, 998, 999],
dtype='int64', length=268)
Sometime your query can involve creating a list of rows to select. Usually this mask would be a resulting index
from an indexing operation. This example selects the months of a datetimeindex which are 5.
Storer object
If you want to inspect the stored object, retrieve via get_storer. You could use this programmatically to say get
the number of rows in an object.
In [464]: store.get_storer("df_dc").nrows
Out[464]: 8
The methods append_to_multiple and select_as_multiple can perform appending/selecting from mul-
tiple tables at once. The idea is to have one table (call it the selector table) that you index most/all of the columns, and
perform your queries. The other table(s) are data tables with an index matching the selector table’s index. You can
then perform a very fast query on the selector table, yet get lots of data back. This method is similar to having a very
wide table, but enables more efficient queries.
The append_to_multiple method splits a given single DataFrame into multiple tables according to d, a dictio-
nary that maps the table names to a list of ‘columns’ you want in that table. If None is used in place of a list, that
table will have the remaining unspecified columns of the given DataFrame. The argument selector defines which
table is the selector table (which you can make queries from). The argument dropna will drop rows from the input
DataFrame to ensure tables are synchronized. This means that if a row for one of the tables being written to is
entirely np.NaN, that row will be dropped from all tables.
If dropna is False, THE USER IS RESPONSIBLE FOR SYNCHRONIZING THE TABLES. Remember that
entirely np.Nan rows are not written to the HDFStore, so if you choose to call dropna=False, some tables may
have more rows than others, and therefore select_as_multiple may not work or it may return unexpected
results.
In [469]: store
Out[469]:
<class 'pandas.io.pytables.HDFStore'>
File path: store.h5
In [471]: store.select("df2_mt")
Out[471]:
C D E F foo
2000-01-01 1.602451 -0.221229 0.712403 0.465927 bar
2000-01-02 -0.525571 0.851566 -0.681308 -0.549386 bar
2000-01-03 -0.044171 1.396628 1.041242 -1.588171 bar
2000-01-04 0.463351 -0.861042 -2.192841 -1.025263 bar
2000-01-05 -1.954845 -1.712882 -0.204377 -1.608953 bar
2000-01-06 1.601542 -0.417884 -2.757922 -0.307713 bar
2000-01-07 -1.935461 1.007668 0.079529 -1.459471 bar
2000-01-08 -1.057072 -0.864360 -1.124870 1.732966 bar
# as a multiple
In [472]: store.select_as_multiple(
.....: ["df1_mt", "df2_mt"],
.....: where=["A>0", "B>0"],
.....: selector="df1_mt",
.....: )
.....:
Out[472]:
A B C D E F foo
2000-01-05 1.043605 1.798494 -1.954845 -1.712882 -0.204377 -1.608953 bar
2000-01-07 0.150568 0.754820 -1.935461 1.007668 0.079529 -1.459471 bar
You can delete from a table selectively by specifying a where. In deleting rows, it is important to understand the
PyTables deletes rows by erasing the rows, then moving the following data. Thus deleting can potentially be a very
expensive operation depending on the orientation of your data. To get optimal performance, it’s worthwhile to have
the dimension you are deleting be the first of the indexables.
Data is ordered (on the disk) in terms of the indexables. Here’s a simple use case. You store panel-type data, with
dates in the major_axis and ids in the minor_axis. The data is then interleaved like this:
• date_1
– id_1
– id_2
– .
– id_n
• date_2
– id_1
– .
– id_n
It should be clear that a delete operation on the major_axis will be fairly quick, as one chunk is removed, then the
following data moved. On the other hand a delete operation on the minor_axis will be very expensive. In this case
it would almost certainly be faster to rewrite the table using a where that selects all but the missing data.
Warning: Please note that HDF5 DOES NOT RECLAIM SPACE in the h5 files automatically. Thus, repeatedly
deleting (or removing nodes) and adding again, WILL TEND TO INCREASE THE FILE SIZE.
To repack and clean the file, use ptrepack.
Compression
PyTables allows the stored data to be compressed. This applies to all kinds of stores, not just tables. Two parameters
are used to control compression: complevel and complib.
• complevel specifies if and how hard data is to be compressed. complevel=0 and complevel=None
disables compression and 0<complevel<10 enables compression.
• complib specifies which compression library to use. If nothing is specified the default library zlib is used.
A compression library usually optimizes for either good compression rates or speed and the results will depend
on the type of data. Which type of compression to choose depends on your specific needs and data. The list of
supported compression libraries:
– zlib: The default compression library. A classic in terms of compression, achieves good compression rates
but is somewhat slow.
– lzo: Fast compression and decompression.
– bzip2: Good compression rates.
– blosc: Fast compression and decompression.
Support for alternative blosc compressors:
* blosc:zstd: An extremely well balanced codec; it provides the best compression ratios among the
others above, and at reasonably fast speed.
If complib is defined as something other than the listed libraries a ValueError exception is issued.
Note: If the library specified with the complib option is missing on your platform, compression defaults to zlib
without further ado.
store_compressed = pd.HDFStore(
"store_compressed.h5", complevel=9, complib="blosc:blosclz"
)
Or on-the-fly compression (this only applies to tables) in stores where compression is not enabled:
ptrepack
PyTables offers better write performance when tables are compressed after they are written, as opposed to turning on
compression at the very beginning. You can use the supplied PyTables utility ptrepack. In addition, ptrepack
can change compression levels after the fact.
Furthermore ptrepack in.h5 out.h5 will repack the file to allow you to reuse previously deleted space. Alter-
natively, one can simply remove the file and write again, or use the copy method.
Caveats
Warning: HDFStore is not-threadsafe for writing. The underlying PyTables only supports concurrent
reads (via threading or processes). If you need reading and writing at the same time, you need to serialize these
operations in a single thread in a single process. You will corrupt your data otherwise. See the (GH2397) for more
information.
• If you use locks to manage write access between multiple processes, you may want to use fsync() before
releasing write locks. For convenience you can use store.flush(fsync=True) to do this for you.
• Once a table is created columns (DataFrame) are fixed; only exactly the same columns can be appended
• Be aware that timezones (e.g., pytz.timezone('US/Eastern')) are not necessarily equal across time-
zone versions. So if data is localized to a specific timezone in the HDFStore using one version of a timezone
library and that data is updated with another version, the data will be converted to UTC since these timezones
are not considered equal. Either use the same version of timezone library or use tz_convert with the updated
timezone definition.
Warning: PyTables will show a NaturalNameWarning if a column name cannot be used as an attribute
selector. Natural identifiers contain only letters, numbers, and underscores, and may not begin with a number.
Other identifiers cannot be used in a where clause and are generally a bad idea.
DataTypes
HDFStore will map an object dtype to the PyTables underlying dtype. This means the following types are known
to work:
Categorical data
You can write data that contains category dtypes to a HDFStore. Queries work the same as if it was an object
array. However, the category dtyped data is stored in a more efficient manner.
.....: )
.....:
In [474]: dfcat
Out[474]:
A B
0 a 0.477849
1 a 0.283128
2 b -2.045700
3 b -0.338206
4 c -0.423113
5 d 2.314361
6 b -0.033100
7 a -0.965461
In [475]: dfcat.dtypes
Out[475]:
A category
B float64
dtype: object
In [479]: result
Out[479]:
A B
(continues on next page)
In [480]: result.dtypes
Out[480]:
A category
B float64
dtype: object
String columns
min_itemsize
The underlying implementation of HDFStore uses a fixed column width (itemsize) for string columns. A string
column itemsize is calculated as the maximum of the length of data (for that column) that is passed to the HDFStore,
in the first append. Subsequent appends, may introduce a string for a column larger than the column can hold, an
Exception will be raised (otherwise you could have a silent truncation of these columns, leading to loss of information).
In the future we may relax this and allow a user-specified truncation to occur.
Pass min_itemsize on the first table creation to a-priori specify the minimum length of a particular string column.
min_itemsize can be an integer, or a dict mapping a column name to an integer. You can pass values as a key
to allow all indexables or data_columns to have this min_itemsize.
Passing a min_itemsize dict will cause all passed columns to be created as data_columns automatically.
Note: If you are not passing any data_columns, then the min_itemsize will be the maximum of the length of
any string passed
In [482]: dfs
Out[482]:
A B
0 foo bar
1 foo bar
2 foo bar
3 foo bar
4 foo bar
In [484]: store.get_storer("dfs").table
Out[484]:
/dfs/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=30, shape=(2,), dflt=b'', pos=1)}
byteorder := 'little'
chunkshape := (963,)
autoindex := True
(continues on next page)
In [486]: store.get_storer("dfs2").table
Out[486]:
/dfs2/table (Table(5,)) ''
description := {
"index": Int64Col(shape=(), dflt=0, pos=0),
"values_block_0": StringCol(itemsize=3, shape=(1,), dflt=b'', pos=1),
"A": StringCol(itemsize=30, shape=(), dflt=b'', pos=2)}
byteorder := 'little'
chunkshape := (1598,)
autoindex := True
colindexes := {
"index": Index(6, medium, shuffle, zlib(1)).is_csi=False,
"A": Index(6, medium, shuffle, zlib(1)).is_csi=False}
nan_rep
String columns will serialize a np.nan (a missing value) with the nan_rep string representation. This defaults to
the string value nan. You could inadvertently turn an actual nan value into a missing value.
In [488]: dfss
Out[488]:
A
0 foo
1 bar
2 nan
In [490]: store.select("dfss")
Out[490]:
A
0 foo
1 bar
2 NaN
In [492]: store.select("dfss2")
Out[492]:
A
0 foo
1 bar
2 nan
External compatibility
HDFStore writes table format objects in specific formats suitable for producing loss-less round trips to pandas
objects. For external compatibility, HDFStore can read native PyTables format tables.
It is possible to write an HDFStore object that can easily be imported into R using the rhdf5 library (Package
website). Create a table format store like this:
In [493]: df_for_r = pd.DataFrame(
.....: {
.....: "first": np.random.rand(100),
.....: "second": np.random.rand(100),
.....: "class": np.random.randint(0, 2, (100,)),
.....: },
.....: index=range(100),
.....: )
.....:
In [494]: df_for_r.head()
Out[494]:
first second class
0 0.864919 0.852910 0
1 0.030579 0.412962 1
2 0.015226 0.978410 0
3 0.498512 0.686761 0
4 0.232163 0.328185 1
In [497]: store_export
Out[497]:
<class 'pandas.io.pytables.HDFStore'>
File path: export.h5
In R this file can be read into a data.frame object using the rhdf5 library. The following example function reads
the corresponding column names and data values from the values and assembles them into a data.frame:
# Load values and column names for all datasets from corresponding nodes and
# insert them into one data.frame object.
library(rhdf5)
return(data)
}
Note: The R function lists the entire HDF5 file’s contents and assembles the data.frame object from all matching
nodes, so use this only as a starting point if you have stored multiple DataFrame objects to a single HDF5 file.
Performance
• tables format come with a writing performance penalty as compared to fixed stores. The benefit is the
ability to append/delete and query (potentially very large amounts of data). Write times are generally longer as
compared with regular stores. Query times can be quite fast, especially on an indexed axis.
• You can pass chunksize=<int> to append, specifying the write chunksize (default is 50000). This will
significantly lower your memory usage on writing.
• You can pass expectedrows=<int> to the first append, to set the TOTAL number of rows that PyTables
will expect. This will optimize read/write performance.
• Duplicate rows can be written to tables, but are filtered out in selection (with the last items being selected; thus
a table is unique on major, minor pairs)
• A PerformanceWarning will be raised if you are attempting to store types that will be pickled by PyTables
(rather than stored as endemic types). See Here for more information and some solutions.
2.4.11 Feather
Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data frames
efficient, and to make sharing data across data analysis languages easy.
Feather is designed to faithfully serialize and de-serialize DataFrames, supporting all of the pandas dtypes, including
extension dtypes such as categorical and datetime with tz.
Several caveats:
• The format will NOT write an Index, or MultiIndex for the DataFrame and will raise an error if a non-
default one is provided. You can .reset_index() to store the index or .reset_index(drop=True)
to ignore it.
• Duplicate column names and non-string columns names are not supported
• Actual Python objects in object dtype columns are not supported. These will raise a helpful error message on an
attempt at serialization.
See the Full Documentation.
In [498]: df = pd.DataFrame(
.....: {
.....: "a": list("abc"),
.....: "b": list(range(1, 4)),
.....: "c": np.arange(3, 6).astype("u1"),
.....: "d": np.arange(4.0, 7.0, dtype="float64"),
.....: "e": [True, False, True],
.....: "f": pd.Categorical(list("abc")),
.....: "g": pd.date_range("20130101", periods=3),
.....: "h": pd.date_range("20130101", periods=3, tz="US/Eastern"),
.....: "i": pd.date_range("20130101", periods=3, freq="ns"),
.....: }
.....: )
.....:
In [499]: df
Out[499]:
a b c d e f g h
˓→ i
0 a 1 3 4.0 True a 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.
˓→000000000
In [500]: df.dtypes
Out[500]:
a object
b int64
c uint8
d float64
e bool
f category
g datetime64[ns]
h datetime64[ns, US/Eastern]
i datetime64[ns]
dtype: object
In [501]: df.to_feather("example.feather")
In [503]: result
Out[503]:
a b c d e f g h
˓→ i
0 a 1 3 4.0 True a 2013-01-01 2013-01-01 00:00:00-05:00 2013-01-01 00:00:00.
˓→000000000
# we preserve dtypes
In [504]: result.dtypes
Out[504]:
a object
b int64
c uint8
d float64
e bool
f category
g datetime64[ns]
h datetime64[ns, US/Eastern]
i datetime64[ns]
dtype: object
2.4.12 Parquet
Apache Parquet provides a partitioned binary columnar serialization for data frames. It is designed to make reading and
writing data frames efficient, and to make sharing data across data analysis languages easy. Parquet can use a variety
of compression techniques to shrink the file size as much as possible while still maintaining good read performance.
Parquet is designed to faithfully serialize and de-serialize DataFrame s, supporting all of the pandas dtypes, includ-
ing extension dtypes such as datetime with tz.
Several caveats.
• Duplicate column names and non-string columns names are not supported.
• The pyarrow engine always writes the index to the output, but fastparquet only writes non-default in-
dexes. This extra column can cause problems for non-pandas consumers that are not expecting it. You can force
including or omitting indexes with the index argument, regardless of the underlying engine.
• Index level names, if specified, must be strings.
• In the pyarrow engine, categorical dtypes for non-string types can be serialized to parquet, but will de-serialize
as their primitive dtype.
• The pyarrow engine preserves the ordered flag of categorical dtypes with string types. fastparquet
does not preserve the ordered flag.
• Non supported types include Interval and actual Python object types. These will raise a helpful error mes-
sage on an attempt at serialization. Period type is supported with pyarrow >= 0.16.0.
• The pyarrow engine preserves extension data types such as the nullable integer and string data type (requiring
pyarrow >= 0.16.0, and requiring the extension type to implement the needed protocols, see the extension types
documentation).
You can specify an engine to direct the serialization. This can be one of pyarrow, or fastparquet, or auto.
If the engine is NOT specified, then the pd.options.io.parquet.engine option is checked; if this is also
auto, then pyarrow is tried, and falling back to fastparquet.
See the documentation for pyarrow and fastparquet.
Note: These engines are very similar and should read/write nearly identical parquet format files. Currently pyarrow
does not support timedelta data, fastparquet>=0.1.4 supports timezone aware datetimes. These libraries differ
by having different underlying dependencies (fastparquet by using numba, while pyarrow uses a c-library).
In [505]: df = pd.DataFrame(
.....: {
.....: "a": list("abc"),
.....: "b": list(range(1, 4)),
.....: "c": np.arange(3, 6).astype("u1"),
.....: "d": np.arange(4.0, 7.0, dtype="float64"),
.....: "e": [True, False, True],
.....: "f": pd.date_range("20130101", periods=3),
.....: "g": pd.date_range("20130101", periods=3, tz="US/Eastern"),
.....: "h": pd.Categorical(list("abc")),
.....: "i": pd.Categorical(list("abc"), ordered=True),
.....: }
.....: )
.....:
In [506]: df
Out[506]:
a b c d e f g h i
0 a 1 3 4.0 True 2013-01-01 2013-01-01 00:00:00-05:00 a a
1 b 2 4 5.0 False 2013-01-02 2013-01-02 00:00:00-05:00 b b
2 c 3 5 6.0 True 2013-01-03 2013-01-03 00:00:00-05:00 c c
In [507]: df.dtypes
Out[507]:
a object
b int64
c uint8
d float64
e bool
f datetime64[ns]
g datetime64[ns, US/Eastern]
h category
i category
dtype: object
In [512]: result.dtypes
Out[512]:
a object
b int64
c uint8
d float64
e bool
f datetime64[ns]
g datetime64[ns, US/Eastern]
h category
i category
dtype: object
In [515]: result.dtypes
Out[515]:
a object
b int64
dtype: object
Handling indexes
Serializing a DataFrame to parquet may include the implicit index as one or more columns in the output file. Thus,
this code:
creates a parquet file with three columns if you use pyarrow for serialization: a, b, and __index_level_0__.
If you’re using fastparquet, the index may or may not be written to the file.
This unexpected extra column causes some databases like Amazon Redshift to reject the file, because that column
doesn’t exist in the target table.
If you want to omit a dataframe’s indexes when writing, pass index=False to to_parquet():
This creates a parquet file with just the two expected columns, a and b. If your DataFrame has a custom index, you
won’t get it back when you load this file into a DataFrame.
Passing index=True will always write the index, even if that’s not the underlying engine’s default behavior.
The path specifies the parent directory to which data will be saved. The partition_cols are the column names
by which the dataset will be partitioned. Columns are partitioned in the order they are given. The partition splits are
determined by the unique values in the partition columns. The above example creates a partitioned dataset that may
look like:
test
a=0
0bac803e32dc42ae83fddfd029cbdebc.parquet
...
a=1
e6ab24a4f45147b49b54a662f0c412a3.parquet
...
2.4.13 ORC
The pandas.io.sql module provides a collection of query wrappers to both facilitate data retrieval and to reduce
dependency on DB-specific API. Database abstraction is provided by SQLAlchemy if installed. In addition you will
need a driver library for your database. Examples of such drivers are psycopg2 for PostgreSQL or pymysql for
MySQL. For SQLite this is included in Python’s standard library by default. You can find an overview of supported
drivers for each SQL dialect in the SQLAlchemy docs.
If SQLAlchemy is not installed, a fallback is only provided for sqlite (and for mysql for backwards compatibility,
but this is deprecated and will be removed in a future version). This mode requires a Python database adapter which
respect the Python DB-API.
See also some cookbook examples for some advanced strategies.
The key functions are:
In the following example, we use the SQlite SQL database engine. You can use a temporary SQLite database where
data are stored in “memory”.
To connect with SQLAlchemy you use the create_engine() function to create an engine object from database
URI. You only need to create the engine once per database you are connecting to. For more information on
create_engine() and the URI formatting, see the examples below and the SQLAlchemy documentation
If you want to manage your own connections you can pass one of those instead:
Writing DataFrames
Assuming the following data is in a DataFrame data, we can insert it into the database using to_sql().
In [523]: data
Out[523]:
id Date Col_1 Col_2 Col_3
0 26 2010-10-18 X 27.50 True
1 42 2010-10-19 Y -12.50 False
2 63 2010-10-20 Z 5.73 True
With some databases, writing large DataFrames can result in errors due to packet size limitations being exceeded. This
can be avoided by setting the chunksize parameter when calling to_sql. For example, the following writes data
to the database in batches of 1000 rows at a time:
to_sql() will try to map your data to an appropriate SQL data type based on the dtype of the data. When you have
columns of dtype object, pandas will try to infer the data type.
You can always override the default type by specifying the desired SQL type of any of the columns by using the
dtype argument. This argument needs a dictionary mapping column names to SQLAlchemy types (or strings for the
sqlite3 fallback mode). For example, specifying to use the sqlalchemy String type instead of the default Text type
for string columns:
Note: Due to the limited support for timedelta’s in the different database flavors, columns with type timedelta64
will be written as integer values as nanoseconds to the database and a warning will be raised.
Note: Columns of category dtype will be converted to the dense representation as you would get with np.
asarray(categorical) (e.g. for string categories this gives an array of strings). Because of this, reading the
database table back in does not generate a categorical.
Using SQLAlchemy, to_sql() is capable of writing datetime data that is timezone naive or timezone aware. How-
ever, the resulting data stored in the database ultimately depends on the supported data type for datetime data of the
database system being used.
The following table lists supported data types for datetime data for some common databases. Other database dialects
may have different data types for datetime data.
When writing timezone aware data to databases that do not support timezones, the data will be written as timezone
naive timestamps that are in local time with respect to the timezone.
read_sql_table() is also capable of reading datetime data that is timezone aware or naive. When reading
TIMESTAMP WITH TIME ZONE types, pandas will convert the data to UTC.
Insertion method
Parameters
----------
table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
Reading tables
read_sql_table() will read a database table given the table name and optionally a subset of columns to read.
Note: In order to use read_sql_table(), you must have the SQLAlchemy optional dependency installed.
Note: Note that pandas infers column dtypes from query outputs, and not by looking up data types in the physical
database schema. For example, assume userid is an integer column in a table. Then, intuitively, select userid
... will return integer-valued series, while select cast(userid as text) ... will return object-valued
(str) series. Accordingly, if the query output is empty, then all resulting columns will be returned as object-valued
(since they are most general). If you foresee that your query will sometimes generate an empty result, you may want
to explicitly typecast afterwards to ensure dtype integrity.
You can also specify the name of the column as the DataFrame index, and specify a subset of columns to be read.
If needed you can explicitly specify a format string, or a dict of arguments to pass to pandas.to_datetime():
Schema support
Reading from and writing to different schema’s is supported through the schema keyword in the
read_sql_table() and to_sql() functions. Note however that this depends on the database flavor (sqlite
does not have schema’s). For example:
Querying
You can query using raw SQL in the read_sql_query() function. In this case you must use the SQL variant
appropriate for your database. When using SQLAlchemy, you can also pass SQLAlchemy Expression language
constructs, which are database-agnostic.
Out[533]:
id Col_1 Col_2
0 42 Y -12.5
The read_sql_query() function supports a chunksize argument. Specifying this will return an iterator through
chunks of the query result:
.....: print(chunk)
.....:
a b c
0 0.092961 -0.674003 1.104153
1 -0.092732 -0.156246 -0.585167
2 -0.358119 -0.862331 -1.672907
3 0.550313 -1.507513 -0.617232
4 0.650576 1.033221 0.492464
a b c
0 -1.627786 -0.692062 1.039548
1 -1.802313 -0.890905 -0.881794
2 0.630492 0.016739 0.014500
3 -0.438358 0.647275 -0.052075
(continues on next page)
You can also run a plain query without creating a DataFrame with execute(). This is useful for queries that don’t
return values, such as INSERT. This is functionally equivalent to calling execute on the SQLAlchemy engine or db
connection object. Again, you must use the SQL syntax variant appropriate for your database.
To connect with SQLAlchemy you use the create_engine() function to create an engine object from database
URI. You only need to create the engine once per database you are connecting to.
engine = create_engine("postgresql://scott:tiger@localhost:5432/mydatabase")
engine = create_engine("mysql+mysqldb://scott:tiger@localhost/foo")
engine = create_engine("oracle://scott:tiger@127.0.0.1:1521/sidname")
engine = create_engine("mssql+pyodbc://mydsn")
# sqlite://<nohostname>/<path>
# where <path> is relative:
engine = create_engine("sqlite:///foo.db")
In [538]: pd.read_sql(
.....: sa.text("SELECT * FROM data where Col_1=:col1"), engine, params={"col1":
˓→ "X"}
.....: )
.....:
Out[538]:
index id Date Col_1 Col_2 Col_3
0 0 26 2010-10-18 00:00:00.000000 X 27.5 1
If you have an SQLAlchemy description of your database you can express where conditions using SQLAlchemy
expressions
Out[541]:
Empty DataFrame
Columns: [index, Date, Col_1, Col_2, Col_3]
Index: []
You can combine SQLAlchemy expressions with parameters passed to read_sql() using sqlalchemy.
bindparam()
Sqlite fallback
The use of sqlite is supported without using SQLAlchemy. This mode requires a Python database adapter which
respect the Python DB-API.
You can create connections like so:
import sqlite3
con = sqlite3.connect(":memory:")
data.to_sql("data", con)
pd.read_sql_query("SELECT * FROM data", con)
Warning: Starting in 0.20.0, pandas has split off Google BigQuery support into the separate package
pandas-gbq. You can pip install pandas-gbq to get it.
The method to_stata() will write a DataFrame into a .dta file. The format version of this file is always 115 (Stata
12).
In [546]: df.to_stata("stata.dta")
Stata data files have limited data type support; only strings with 244 or fewer characters, int8, int16, int32,
float32 and float64 can be stored in .dta files. Additionally, Stata reserves certain values to represent missing
data. Exporting a non-missing value that is outside of the permitted range in Stata for a particular data type will retype
the variable to the next larger size. For example, int8 values are restricted to lie between -127 and 100 in Stata, and
so variables with values above 100 will trigger a conversion to int16. nan values in floating points data types are
stored as the basic missing data type (. in Stata).
Note: It is not possible to export missing data values for integer data types.
The Stata writer gracefully handles other data types including int64, bool, uint8, uint16, uint32 by casting
to the smallest supported type that can represent the data. For example, data with a type of uint8 will be cast to
int8 if all values are less than 100 (the upper bound for non-missing int8 data in Stata), or, if values are outside of
this range, the variable is cast to int16.
Warning: Conversion from int64 to float64 may result in a loss of precision if int64 values are larger than
2**53.
Warning: StataWriter and to_stata() only support fixed width strings containing up to 244 characters,
a limitation imposed by the version 115 dta file format. Attempting to write Stata dta files with strings longer than
244 characters raises a ValueError.
The top-level function read_stata will read a dta file and return either a DataFrame or a StataReader that
can be used to read the file incrementally.
In [547]: pd.read_stata("stata.dta")
Out[547]:
index A B
0 0 0.608228 1.064810
1 1 -0.780506 -2.736887
2 2 0.143539 1.170191
3 3 -1.573076 0.075792
4 4 -1.722223 -0.774650
5 5 0.803627 0.221665
6 6 0.584637 0.147264
7 7 1.057825 -0.284136
8 8 0.912395 1.552808
9 9 0.189376 -0.109830
Specifying a chunksize yields a StataReader instance that can be used to read chunksize lines from the file
at a time. The StataReader object can be used as an iterator.
For more fine-grained control, use iterator=True and specify chunksize with each call to read().
The parameter convert_missing indicates whether missing value representations in Stata should be preserved.
If False (the default), missing values are represented as np.nan. If True, missing values are represented using
StataMissingValue objects, and columns containing missing values will have object data type.
Note: read_stata() and StataReader support .dta formats 113-115 (Stata 10-12), 117 (Stata 13), and 118
(Stata 14).
Note: Setting preserve_dtypes=False will upcast to the standard pandas data types: int64 for all integer
types and float64 for floating point data. By default, the Stata data types are preserved when importing.
Categorical data
Categorical data can be exported to Stata data files as value labeled data. The exported data consists of the
underlying category codes as integer data values and the categories as value labels. Stata does not have an explicit
equivalent to a Categorical and information about whether the variable is ordered is lost when exporting.
Warning: Stata only supports string value labels, and so str is called on the categories when exporting data.
Exporting Categorical variables with non-string categories produces a warning, and can result a loss of infor-
mation if the str representations of the categories are not unique.
Labeled data can similarly be imported from Stata data files as Categorical variables using the keyword argu-
ment convert_categoricals (True by default). The keyword argument order_categoricals (True by
default) determines whether imported Categorical variables are ordered.
Note: When importing categorical data, the values of the variables in the Stata data file are not preserved
since Categorical variables always use integer data types between -1 and n-1 where n is the number
of categories. If the original values in the Stata data file are required, these can be imported by setting
convert_categoricals=False, which will import original data (but not the variable labels). The original
values can be matched to the imported categorical data since there is a simple mapping between the original Stata
data values and the category codes of imported Categorical variables: missing values are assigned code -1, and the
smallest original value is assigned 0, the second smallest is assigned 1 and so on until the largest original value is
assigned the code n-1.
Note: Stata supports partially labeled series. These series have value labels for some but not all data values. Importing
a partially labeled series will produce a Categorical with string categories for the values that are labeled and
numeric categories for values with no label.
The top-level function read_sas() can read (but not write) SAS XPORT (.xpt) and (since v0.18.0) SAS7BDAT
(.sas7bdat) format files.
SAS files only contain two value types: ASCII text and floating point values (usually 8 bytes but sometimes truncated).
For xport files, there is no automatic type conversion to integers, dates, or categoricals. For SAS7BDAT files, the
format codes may allow date variables to be automatically converted to dates. By default the whole file is read and
returned as a DataFrame.
Specify a chunksize or use iterator=True to obtain reader objects (XportReader or SAS7BDATReader)
for incrementally reading the file. The reader objects also have attributes that contain additional information about the
file and its variables.
Read a SAS7BDAT file:
df = pd.read_sas("sas_data.sas7bdat")
def do_something(chunk):
pass
The specification for the xport file format is available from the SAS web site.
No official documentation is available for the SAS7BDAT format.
df = pd.read_spss("spss_data.sav")
Extract a subset of columns contained in usecols from an SPSS file and avoid converting categorical columns into
pd.Categorical:
df = pd.read_spss(
"spss_data.sav",
usecols=["foo", "bar"],
convert_categoricals=False,
)
More information about the SAV and ZSAV file formats is available here.
pandas itself only supports IO with a limited set of file formats that map cleanly to its tabular data model. For reading
and writing other file formats into and from pandas, we recommend these packages from the broader community.
netCDF
xarray provides data structures inspired by the pandas DataFrame for working with multi-dimensional datasets, with
a focus on the netCDF file format and easy conversion to and from pandas.
This is an informal comparison of various IO methods, using pandas 0.24.2. Timings are machine dependent and small
differences should be ignored.
In [1]: sz = 1000000
In [2]: df = pd.DataFrame({'A': np.random.randn(sz), 'B': [1] * sz})
In [3]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 2 columns):
A 1000000 non-null float64
B 1000000 non-null int64
dtypes: float64(1), int64(1)
memory usage: 15.3 MB
The following test functions will be used below to compare the performance of several IO methods:
import numpy as np
import os
sz = 1000000
df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz})
sz = 1000000
np.random.seed(42)
df = pd.DataFrame({"A": np.random.randn(sz), "B": [1] * sz})
def test_sql_write(df):
if os.path.exists("test.sql"):
os.remove("test.sql")
sql_db = sqlite3.connect("test.sql")
df.to_sql(name="test_table", con=sql_db)
sql_db.close()
def test_sql_read():
sql_db = sqlite3.connect("test.sql")
pd.read_sql_query("select * from test_table", sql_db)
sql_db.close()
def test_hdf_fixed_read():
pd.read_hdf("test_fixed.hdf", "test")
def test_hdf_fixed_write_compress(df):
df.to_hdf("test_fixed_compress.hdf", "test", mode="w", complib="blosc")
def test_hdf_fixed_read_compress():
pd.read_hdf("test_fixed_compress.hdf", "test")
def test_hdf_table_write(df):
df.to_hdf("test_table.hdf", "test", mode="w", format="table")
def test_hdf_table_read():
pd.read_hdf("test_table.hdf", "test")
def test_hdf_table_write_compress(df):
df.to_hdf(
"test_table_compress.hdf", "test", mode="w", complib="blosc", format="table"
)
def test_hdf_table_read_compress():
pd.read_hdf("test_table_compress.hdf", "test")
def test_csv_write(df):
df.to_csv("test.csv", mode="w")
def test_csv_read():
pd.read_csv("test.csv", index_col=0)
def test_feather_write(df):
df.to_feather("test.feather")
def test_feather_read():
pd.read_feather("test.feather")
def test_pickle_write(df):
df.to_pickle("test.pkl")
def test_pickle_read():
pd.read_pickle("test.pkl")
def test_pickle_write_compress(df):
df.to_pickle("test.pkl.compress", compression="xz")
def test_pickle_read_compress():
pd.read_pickle("test.pkl.compress", compression="xz")
def test_parquet_write(df):
df.to_parquet("test.parquet")
def test_parquet_read():
pd.read_parquet("test.parquet")
When writing, the top three functions in terms of speed are test_feather_write, test_hdf_fixed_write
and test_hdf_fixed_write_compress.
When reading, the top three functions in terms of speed are test_feather_read, test_pickle_read and
test_hdf_fixed_read.
The files test.pkl.compress, test.parquet and test.feather took the least space on disk (in bytes).
Note: The Python and NumPy indexing operators [] and attribute operator . provide quick and easy access to pandas
data structures across a wide range of use cases. This makes interactive work intuitive, as there’s little new to learn if
you already know how to deal with Python dictionaries and NumPy arrays. However, since the type of the data to be
accessed isn’t known in advance, directly using standard operators has some optimization limits. For production code,
we recommended that you take advantage of the optimized pandas data access methods exposed in this chapter.
Warning: Whether a copy or a reference is returned for a setting operation, may depend on the context. This is
sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
See the MultiIndex / Advanced Indexing for MultiIndex and more advanced indexing documentation.
See the cookbook for some advanced strategies.
Object selection has had a number of user-requested additions in order to support more explicit location based index-
ing. pandas now supports three types of multi-axis indexing.
• .loc is primarily label based, but may also be used with a boolean array. .loc will raise KeyError when
the items are not found. Allowed inputs are:
– A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer
position along the index.).
– A list or array of labels ['a', 'b', 'c'].
– A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop
are included, when present in the index! See Slicing with labels and Endpoints are inclusive.)
– A boolean array (any NA values will be treated as False).
– A callable function with one argument (the calling Series or DataFrame) and that returns valid output
for indexing (one of the above).
See more at Selection by Label.
• .iloc is primarily integer position based (from 0 to length-1 of the axis), but may also be used with a
boolean array. .iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers
which allow out-of-bounds indexing. (this conforms with Python/NumPy slice semantics). Allowed inputs are:
– An integer e.g. 5.
– A list or array of integers [4, 3, 0].
– A slice object with ints 1:7.
– A boolean array (any NA values will be treated as False).
– A callable function with one argument (the calling Series or DataFrame) and that returns valid output
for indexing (one of the above).
See more at Selection by Position, Advanced Indexing and Advanced Hierarchical.
• .loc, .iloc, and also [] indexing can accept a callable as indexer. See more at Selection By Callable.
Getting values from an object with multi-axes selection uses the following notation (using .loc as an example, but
the following applies to .iloc as well). Any of the axes accessors may be the null slice :. Axes left out of the
specification are assumed to be :, e.g. p.loc['a'] is equivalent to p.loc['a', :, :].
2.5.2 Basics
As mentioned when introducing the data structures in the last section, the primary function of indexing with [] (a.k.a.
__getitem__ for those familiar with implementing class behavior in Python) is selecting out lower-dimensional
slices. The following table shows return type values when indexing pandas objects with []:
Here we construct a simple time series data set to use for illustrating the indexing functionality:
In [3]: df
Out[3]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
Note: None of the indexing functionality is time series specific unless specifically stated.
Thus, as per above, we have the most basic indexing using []:
In [4]: s = df['A']
In [5]: s[dates[5]]
Out[5]: -0.6736897080883706
You can pass a list of columns to [] to select columns in that order. If a column is not contained in the DataFrame, an
exception will be raised. Multiple columns can also be set in this manner:
In [6]: df
Out[6]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
(continues on next page)
In [8]: df
Out[8]:
A B C D
2000-01-01 -0.282863 0.469112 -1.509059 -1.135632
2000-01-02 -0.173215 1.212112 0.119209 -1.044236
2000-01-03 -2.104569 -0.861849 -0.494929 1.071804
2000-01-04 -0.706771 0.721555 -1.039575 0.271860
2000-01-05 0.567020 -0.424972 0.276232 -1.087401
2000-01-06 0.113648 -0.673690 -1.478427 0.524988
2000-01-07 0.577046 0.404705 -1.715002 -1.039268
2000-01-08 -1.157892 -0.370647 -1.344312 0.844885
You may find this useful for applying a transform (in-place) to a subset of the columns.
Warning: pandas aligns all AXES when setting Series and DataFrame from .loc, and .iloc.
This will not modify df because the column alignment is before value assignment.
In [9]: df[['A', 'B']]
Out[9]:
A B
2000-01-01 -0.282863 0.469112
2000-01-02 -0.173215 1.212112
2000-01-03 -2.104569 -0.861849
2000-01-04 -0.706771 0.721555
2000-01-05 0.567020 -0.424972
2000-01-06 0.113648 -0.673690
2000-01-07 0.577046 0.404705
2000-01-08 -1.157892 -0.370647
In [16]: sa.b
Out[16]: 2
In [17]: dfa.A
Out[17]:
2000-01-01 0.469112
2000-01-02 1.212112
2000-01-03 -0.861849
2000-01-04 0.721555
2000-01-05 -0.424972
2000-01-06 -0.673690
2000-01-07 0.404705
2000-01-08 -0.370647
Freq: D, Name: A, dtype: float64
In [18]: sa.a = 5
In [19]: sa
Out[19]:
a 5
b 2
c 3
dtype: int64
In [21]: dfa
Out[21]:
A B C D
2000-01-01 0 -0.282863 -1.509059 -1.135632
2000-01-02 1 -0.173215 0.119209 -1.044236
2000-01-03 2 -2.104569 -0.494929 1.071804
(continues on next page)
In [23]: dfa
Out[23]:
A B C D
2000-01-01 0 -0.282863 -1.509059 -1.135632
2000-01-02 1 -0.173215 0.119209 -1.044236
2000-01-03 2 -2.104569 -0.494929 1.071804
2000-01-04 3 -0.706771 -1.039575 0.271860
2000-01-05 4 0.567020 0.276232 -1.087401
2000-01-06 5 0.113648 -1.478427 0.524988
2000-01-07 6 0.577046 -1.715002 -1.039268
2000-01-08 7 -1.157892 -1.344312 0.844885
Warning:
• You can use this access only if the index element is a valid Python identifier, e.g. s.1 is not allowed. See
here for an explanation of valid identifiers.
• The attribute will not be available if it conflicts with an existing method name, e.g. s.min is not allowed,
but s['min'] is possible.
• Similarly, the attribute will not be available if it conflicts with any of the following list: index,
major_axis, minor_axis, items.
• In any of these cases, standard indexing will still work, e.g. s['1'], s['min'], and s['index'] will
access the corresponding element or column.
If you are using the IPython environment, you may also use tab-completion to see these accessible attributes.
You can also assign a dict to a row of a DataFrame:
In [24]: x = pd.DataFrame({'x': [1, 2, 3], 'y': [3, 4, 5]})
In [26]: x
Out[26]:
x y
0 1 3
1 9 99
2 3 5
You can use attribute access to modify an existing element of a Series or column of a DataFrame, but be careful; if
you try to use attribute access to create a new column, it creates a new attribute rather than a new column. In 0.21.0
and later, this will raise a UserWarning:
In [1]: df = pd.DataFrame({'one': [1., 2., 3.]})
In [2]: df.two = [4, 5, 6]
(continues on next page)
In [3]: df
Out[3]:
one
0 1.0
1 2.0
2 3.0
The most robust and consistent way of slicing ranges along arbitrary axes is described in the Selection by Position
section detailing the .iloc method. For now, we explain the semantics of slicing using the [] operator.
With Series, the syntax works exactly as with an ndarray, returning a slice of the values and the corresponding labels:
In [27]: s[:5]
Out[27]:
2000-01-01 0.469112
2000-01-02 1.212112
2000-01-03 -0.861849
2000-01-04 0.721555
2000-01-05 -0.424972
Freq: D, Name: A, dtype: float64
In [28]: s[::2]
Out[28]:
2000-01-01 0.469112
2000-01-03 -0.861849
2000-01-05 -0.424972
2000-01-07 0.404705
Freq: 2D, Name: A, dtype: float64
In [29]: s[::-1]
Out[29]:
2000-01-08 -0.370647
2000-01-07 0.404705
2000-01-06 -0.673690
2000-01-05 -0.424972
2000-01-04 0.721555
2000-01-03 -0.861849
2000-01-02 1.212112
2000-01-01 0.469112
Freq: -1D, Name: A, dtype: float64
In [30]: s2 = s.copy()
In [31]: s2[:5] = 0
In [32]: s2
Out[32]:
2000-01-01 0.000000
2000-01-02 0.000000
(continues on next page)
With DataFrame, slicing inside of [] slices the rows. This is provided largely as a convenience since it is such a
common operation.
In [33]: df[:3]
Out[33]:
A B C D
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
In [34]: df[::-1]
Out[34]:
A B C D
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885
2000-01-07 0.404705 0.577046 -1.715002 -1.039268
2000-01-06 -0.673690 0.113648 -1.478427 0.524988
2000-01-05 -0.424972 0.567020 0.276232 -1.087401
2000-01-04 0.721555 -0.706771 -1.039575 0.271860
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804
2000-01-02 1.212112 -0.173215 0.119209 -1.044236
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632
Warning: Whether a copy or a reference is returned for a setting operation, may depend on the context. This is
sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
Warning:
.loc is strict when you present slicers that are not compatible (or convertible) with the index type.
For example using integers in a DatetimeIndex. These will raise a TypeError.
In [35]: dfl = pd.DataFrame(np.random.randn(5, 4),
....: columns=list('ABCD'),
....: index=pd.date_range('20130101', periods=5))
....:
In [36]: dfl
Out[36]:
A B C D
2013-01-01 1.075770 -0.109050 1.643563 -1.469388
2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524 0.413738 0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
2013-01-05 0.895717 0.805244 -1.206412 2.565646
In [4]: dfl.loc[2:3]
TypeError: cannot do slice indexing on <class 'pandas.tseries.index.DatetimeIndex'>
˓→with these indexers [2] of <type 'int'>
String likes in slicing can be convertible to the type of the index and lead to natural slicing.
In [37]: dfl.loc['20130102':'20130104']
Out[37]:
A B C D
2013-01-02 0.357021 -0.674600 -1.776904 -0.968914
2013-01-03 -1.294524 0.413738 0.276662 -0.472035
2013-01-04 -0.013960 -0.362543 -0.006154 -0.923061
pandas provides a suite of methods in order to have purely label based indexing. This is a strict inclusion based
protocol. Every label asked for must be in the index, or a KeyError will be raised. When slicing, both the start
bound AND the stop bound are included, if present in the index. Integers are valid labels, but they refer to the label
and not the position.
The .loc attribute is the primary access method. The following are valid inputs:
• A single label, e.g. 5 or 'a' (Note that 5 is interpreted as a label of the index. This use is not an integer
position along the index.).
• A list or array of labels ['a', 'b', 'c'].
• A slice object with labels 'a':'f' (Note that contrary to usual Python slices, both the start and the stop are
included, when present in the index! See Slicing with labels.
• A boolean array.
• A callable, see Selection By Callable.
In [39]: s1
Out[39]:
a 1.431256
b 1.340309
c -1.170299
d -0.226169
e 0.410835
f 0.813850
dtype: float64
In [40]: s1.loc['c':]
Out[40]:
c -1.170299
d -0.226169
e 0.410835
f 0.813850
dtype: float64
(continues on next page)
In [41]: s1.loc['b']
Out[41]: 1.3403088497993827
In [42]: s1.loc['c':] = 0
In [43]: s1
Out[43]:
a 1.431256
b 1.340309
c 0.000000
d 0.000000
e 0.000000
f 0.000000
dtype: float64
With a DataFrame:
In [45]: df1
Out[45]:
A B C D
a 0.132003 -0.827317 -0.076467 -1.187678
b 1.130127 -1.436737 -1.413681 1.607920
c 1.024180 0.569605 0.875906 -2.211372
d 0.974466 -2.006747 -0.410001 -0.078638
e 0.545952 -1.219217 -1.226825 0.769804
f -1.281247 -0.727707 -0.121306 -0.097883
In [48]: df1.loc['a']
Out[48]:
A 0.132003
(continues on next page)
In [52]: mask
Out[52]:
<BooleanArray>
[True, False, True, False, <NA>, False]
Length: 6, dtype: boolean
In [53]: df1[mask]
Out[53]:
A B C D
a 0.132003 -0.827317 -0.076467 -1.187678
c 1.024180 0.569605 0.875906 -2.211372
When using .loc with slices, if both the start and the stop labels are present in the index, then elements located
between the two (including them) are returned:
In [56]: s.loc[3:5]
Out[56]:
3 b
2 c
5 d
dtype: object
If at least one of the two is absent, but the index is sorted, and can be compared against start and stop labels, then
slicing will still work as expected, by selecting labels which rank between the two:
In [57]: s.sort_index()
Out[57]:
0 a
2 c
3 b
4 e
5 d
dtype: object
In [58]: s.sort_index().loc[1:6]
Out[58]:
2 c
3 b
4 e
5 d
dtype: object
However, if at least one of the two is absent and the index is not sorted, an error will be raised (since doing otherwise
would be computationally expensive, as well as potentially ambiguous for mixed type indexes). For instance, in the
above example, s.loc[1:6] would raise KeyError.
For the rationale behind this behavior, see Endpoints are inclusive.
In [60]: s.loc[3:5]
Out[60]:
3 b
2 c
5 d
dtype: object
Also, if the index has duplicate labels and either the start or the stop label is dupulicated, an error will be raised. For
instance, in the above example, s.loc[2:5] would raise a KeyError.
For more information about duplicate labels, see Duplicate Labels.
Warning: Whether a copy or a reference is returned for a setting operation, may depend on the context. This is
sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
pandas provides a suite of methods in order to get purely integer based indexing. The semantics follow closely
Python and NumPy slicing. These are 0-based indexing. When slicing, the start bound is included, while the upper
bound is excluded. Trying to use a non-integer, even a valid label will raise an IndexError.
The .iloc attribute is the primary access method. The following are valid inputs:
• An integer e.g. 5.
• A list or array of integers [4, 3, 0].
• A slice object with ints 1:7.
• A boolean array.
• A callable, see Selection By Callable.
In [62]: s1
Out[62]:
0 0.695775
2 0.341734
4 0.959726
6 -1.110336
8 -0.619976
dtype: float64
In [63]: s1.iloc[:3]
Out[63]:
0 0.695775
2 0.341734
4 0.959726
dtype: float64
In [64]: s1.iloc[3]
Out[64]: -1.110336102891167
In [65]: s1.iloc[:3] = 0
In [66]: s1
Out[66]:
0 0.000000
2 0.000000
4 0.000000
6 -1.110336
8 -0.619976
dtype: float64
With a DataFrame:
In [68]: df1
Out[68]:
0 2 4 6
0 0.149748 -0.732339 0.687738 0.176444
2 0.403310 -0.154951 0.301624 -2.179861
4 -1.369849 -0.954208 1.462696 -1.743161
6 -0.826591 -0.345352 1.314232 0.690579
8 0.995761 2.396780 0.014871 3.357427
10 -0.317441 -1.236269 0.896171 -0.487602
In [69]: df1.iloc[:3]
Out[69]:
0 2 4 6
0 0.149748 -0.732339 0.687738 0.176444
2 0.403310 -0.154951 0.301624 -2.179861
4 -1.369849 -0.954208 1.462696 -1.743161
In [72]: df1.iloc[1:3, :]
Out[72]:
0 2 4 6
2 0.403310 -0.154951 0.301624 -2.179861
4 -1.369849 -0.954208 1.462696 -1.743161
In [75]: df1.iloc[1]
Out[75]:
0 0.403310
2 -0.154951
4 0.301624
6 -2.179861
Name: 2, dtype: float64
In [77]: x
Out[77]: ['a', 'b', 'c', 'd', 'e', 'f']
In [78]: x[4:10]
Out[78]: ['e', 'f']
In [79]: x[8:10]
Out[79]: []
In [80]: s = pd.Series(x)
In [81]: s
Out[81]:
0 a
1 b
2 c
3 d
4 e
5 f
dtype: object
In [82]: s.iloc[4:10]
Out[82]:
4 e
5 f
dtype: object
In [83]: s.iloc[8:10]
Out[83]: Series([], dtype: object)
Note that using slices that go out of bounds can result in an empty axis (e.g. an empty DataFrame being returned).
In [85]: dfl
Out[85]:
A B
0 -0.082240 -2.182937
(continues on next page)
In [88]: dfl.iloc[4:6]
Out[88]:
A B
4 0.27423 0.132885
A single indexer that is out of bounds will raise an IndexError. A list of indexers where any element is out of
bounds will raise an IndexError.
>>> dfl.iloc[[4, 5, 6]]
IndexError: positional indexers are out-of-bounds
>>> dfl.iloc[:, 4]
IndexError: single positional indexer is out-of-bounds
.loc, .iloc, and also [] indexing can accept a callable as indexer. The callable must be a function with
one argument (the calling Series or DataFrame) that returns valid output for indexing.
In [89]: df1 = pd.DataFrame(np.random.randn(6, 4),
....: index=list('abcdef'),
....: columns=list('ABCD'))
....:
In [90]: df1
Out[90]:
A B C D
a -0.023688 2.410179 1.450520 0.206053
b -0.251905 -2.213588 1.063327 1.266143
c 0.299368 -0.863838 0.408204 -1.048089
d -0.025747 -0.988387 0.094055 1.262731
e 1.289997 0.082423 -0.055758 0.536580
f -0.489682 0.369374 -0.034571 -2.484478
Using these methods / indexers, you can chain data selection operations without using a temporary variable.
In [96]: bb = pd.read_csv('data/baseball.csv', index_col='id')
2007 CIN 6 379 745 101 203 35 2 36 125.0 10.0 1.0 105 127.0 14.
˓→0 1.0 1.0 15.0 18.0
(continues on next page)
If you wish to get the 0th and the 2nd elements from the index in the ‘A’ column, you can do:
In [99]: dfd
Out[99]:
A B
a 1 4
b 2 5
c 3 6
This can also be expressed using .iloc, by explicitly getting locations on the indexers, and using positional indexing
to select things.
In prior versions, using .loc[list-of-labels] would work as long as at least 1 of the keys was found (other-
wise it would raise a KeyError). This behavior was changed and will now raise a KeyError if at least one label is
missing. The recommended alternative is to use .reindex().
For example.
In [104]: s
Out[104]:
0 1
1 2
2 3
dtype: int64
Previous behavior
Current behavior
Out[4]:
1 2.0
2 3.0
3 NaN
dtype: float64
Reindexing
The idiomatic way to achieve selecting potentially not-found elements is via .reindex(). See also the section on
reindexing.
Alternatively, if you want to select only valid keys, the following is idiomatic and efficient; it is guaranteed to preserve
the dtype of the selection.
In [108]: s.loc[s.index.intersection(labels)]
Out[108]:
1 2
2 3
dtype: int64
In [17]: s.reindex(labels)
ValueError: cannot reindex from a duplicate axis
Generally, you can intersect the desired labels with the current axis, and then reindex.
In [111]: s.loc[s.index.intersection(labels)].reindex(labels)
Out[111]:
c 3.0
d NaN
dtype: float64
In [42]: s.loc[s.index.intersection(labels)].reindex(labels)
ValueError: cannot reindex from a duplicate axis
A random selection of rows or columns from a Series or DataFrame with the sample() method. The method will
sample rows by default, and accepts a specific number of rows/columns to return, or a fraction of rows.
By default, sample will return each row at most once, but one can also sample with replacement using the replace
option:
# With replacement:
In [118]: s.sample(n=6, replace=True)
Out[118]:
0 0
4 4
3 3
2 2
4 4
4 4
dtype: int64
By default, each row has an equal probability of being selected, but if you want rows to have different probabilities,
you can pass the sample function sampling weights as weights. These weights can be a list, a NumPy array, or a
Series, but they must be of the same length as the object you are sampling. Missing values will be treated as a weight
of zero, and inf values are not allowed. If weights do not sum to 1, they will be re-normalized by dividing all weights
by the sum of the weights. For example:
In [119]: s = pd.Series([0, 1, 2, 3, 4, 5])
When applied to a DataFrame, you can use a column of the DataFrame as sampling weights (provided you are sampling
rows and not columns) by simply passing the name of the column as a string.
In [124]: df2 = pd.DataFrame({'col1': [9, 8, 7, 6],
.....: 'weight_column': [0.5, 0.4, 0.1, 0]})
.....:
sample also allows users to sample columns instead of rows using the axis argument.
In [126]: df3 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
Finally, one can also set a seed for sample’s random number generator using the random_state argument, which
will accept either an integer (as a seed) or a NumPy RandomState object.
In [128]: df4 = pd.DataFrame({'col1': [1, 2, 3], 'col2': [2, 3, 4]})
# With a given seed, the sample will always draw the same rows.
In [129]: df4.sample(n=2, random_state=2)
Out[129]:
col1 col2
2 3 4
(continues on next page)
The .loc/[] operations can perform enlargement when setting a non-existent key for that axis.
In the Series case this is effectively an appending operation.
In [132]: se
Out[132]:
0 1
1 2
2 3
dtype: int64
In [133]: se[5] = 5.
In [134]: se
Out[134]:
0 1.0
1 2.0
2 3.0
5 5.0
dtype: float64
In [136]: dfi
Out[136]:
A B
0 0 1
1 2 3
2 4 5
In [138]: dfi
Out[138]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
In [139]: dfi.loc[3] = 5
In [140]: dfi
Out[140]:
A B C
0 0 1 0
1 2 3 2
2 4 5 4
3 5 5 5
Since indexing with [] must handle a lot of cases (single-label access, slicing, boolean indexing, etc.), it has a bit of
overhead in order to figure out what you’re asking for. If you only want to access a scalar value, the fastest way is to
use the at and iat methods, which are implemented on all of the data structures.
Similarly to loc, at provides label based scalar lookups, while, iat provides integer based lookups analogously to
iloc
In [141]: s.iat[5]
Out[141]: 5
In [143]: df.iat[3, 0]
Out[143]: 0.7215551622443669
In [145]: df.iat[3, 0] = 7
In [147]: df
Out[147]:
A B C D E 0
2000-01-01 0.469112 -0.282863 -1.509059 -1.135632 NaN NaN
2000-01-02 1.212112 -0.173215 0.119209 -1.044236 NaN NaN
2000-01-03 -0.861849 -2.104569 -0.494929 1.071804 NaN NaN
2000-01-04 7.000000 -0.706771 -1.039575 0.271860 NaN NaN
2000-01-05 -0.424972 0.567020 0.276232 -1.087401 NaN NaN
2000-01-06 -0.673690 0.113648 -1.478427 0.524988 7.0 NaN
2000-01-07 0.404705 0.577046 -1.715002 -1.039268 NaN NaN
2000-01-08 -0.370647 -1.157892 -1.344312 0.844885 NaN NaN
2000-01-09 NaN NaN NaN NaN NaN 7.0
Another common operation is the use of boolean vectors to filter the data. The operators are: | for or, & for and, and
~ for not. These must be grouped by using parentheses, since by default Python will evaluate an expression such as
df['A'] > 2 & df['B'] < 3 as df['A'] > (2 & df['B']) < 3, while the desired evaluation order
is (df['A'] > 2) & (df['B'] < 3).
Using a boolean vector to index a Series works exactly as in a NumPy ndarray:
In [149]: s
Out[149]:
0 -3
1 -2
2 -1
3 0
4 1
5 2
6 3
dtype: int64
You may select rows from a DataFrame using a boolean vector the same length as the DataFrame’s index (for example,
something derived from one of the columns of the DataFrame):
List comprehensions and the map method of Series can also be used to produce more complex criteria:
In [154]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'three', 'two', 'one', 'six
˓→'],
In [156]: df2[criterion]
Out[156]:
a b c
2 two y 0.041290
3 three x 0.361719
4 two y -0.238075
# Multiple criteria
In [158]: df2[criterion & (df2['b'] == 'x')]
Out[158]:
a b c
3 three x 0.361719
With the choice methods Selection by Label, Selection by Position, and Advanced Indexing you may select along more
than one axis using boolean vectors combined with other indexing expressions.
In [159]: df2.loc[criterion & (df2['b'] == 'x'), 'b':'c']
Out[159]:
b c
3 x 0.361719
Warning: iloc supports two kinds of boolean indexing. If the indexer is a boolean Series, an error will be
raised. For instance, in the following example, df.iloc[s.values, 1] is ok. The boolean indexer is an
array. But df.iloc[s, 1] would raise ValueError.
In [160]: df = pd.DataFrame([[1, 2], [3, 4], [5, 6]],
.....: index=list('abc'),
.....: columns=['A', 'B'])
.....:
In [162]: s
Out[162]:
a False
b True
c True
Name: A, dtype: bool
Out[163]:
b 4
c 6
Name: B, dtype: int64
In [164]: df.iloc[s.values, 1]
Out[164]:
b 4
c 6
Name: B, dtype: int64
Consider the isin() method of Series, which returns a boolean vector that is true wherever the Series elements
exist in the passed list. This allows you to select rows where one or more columns have values you want:
In [165]: s = pd.Series(np.arange(5), index=np.arange(5)[::-1], dtype='int64')
In [166]: s
Out[166]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
The same method is available for Index objects and is useful for the cases when you don’t know which of the sought
labels are in fact present:
In [169]: s[s.index.isin([2, 4, 6])]
Out[169]:
4 0
2 2
dtype: int64
In addition to that, MultiIndex allows selecting a separate level to use in the membership check:
.....:
In [172]: s_mi
Out[172]:
0 a 0
b 1
c 2
1 a 3
b 4
c 5
dtype: int64
DataFrame also has an isin() method. When calling isin, pass a set of values as either an array or dict. If values is
an array, isin returns a DataFrame of booleans that is the same shape as the original DataFrame, with True wherever
the element is in the sequence of values.
In [177]: df.isin(values)
Out[177]:
vals ids ids2
0 True True True
1 False True False
2 True False False
3 False False False
Oftentimes you’ll want to match certain values with certain columns. Just make values a dict where the key is the
column, and the value is a list of items you want to check for.
In [179]: df.isin(values)
Out[179]:
vals ids ids2
0 True True False
1 False True False
2 True False False
3 False False False
Combine DataFrame’s isin with the any() and all() methods to quickly select subsets of your data that meet a
given criteria. To select a row where each column meets its own criterion:
In [180]: values = {'ids': ['a', 'b'], 'ids2': ['a', 'c'], 'vals': [1, 3]}
In [182]: df[row_mask]
Out[182]:
vals ids ids2
0 1 a a
Selecting values from a Series with a boolean vector generally returns a subset of the data. To guarantee that selection
output has the same shape as the original data, you can use the where method in Series and DataFrame.
To return only the selected rows:
In [183]: s[s > 0]
Out[183]:
3 1
2 2
1 3
0 4
dtype: int64
Selecting values from a DataFrame with a boolean criterion now also preserves input data shape. where is used under
the hood as the implementation. The code below is equivalent to df.where(df < 0).
In [185]: df[df < 0]
Out[185]:
A B C D
2000-01-01 -2.104139 -1.309525 NaN NaN
2000-01-02 -0.352480 NaN -1.192319 NaN
(continues on next page)
In addition, where takes an optional other argument for replacement of values where the condition is False, in the
returned copy.
You may wish to set values based on some boolean criteria. This can be done intuitively like so:
In [187]: s2 = s.copy()
In [189]: s2
Out[189]:
4 0
3 1
2 2
1 3
0 4
dtype: int64
In [192]: df2
Out[192]:
A B C D
2000-01-01 0.000000 0.000000 0.485855 0.245166
2000-01-02 0.000000 0.390389 0.000000 1.655824
2000-01-03 0.000000 0.299674 0.000000 0.281059
2000-01-04 0.846958 0.000000 0.600705 0.000000
2000-01-05 0.669692 0.000000 0.000000 0.342416
2000-01-06 0.868584 0.000000 2.297780 0.000000
2000-01-07 0.000000 0.000000 0.168904 0.000000
2000-01-08 0.801196 1.392071 0.000000 0.000000
By default, where returns a modified copy of the data. There is an optional parameter inplace so that the original
data can be modified without creating a copy:
In [195]: df_orig
Out[195]:
A B C D
2000-01-01 2.104139 1.309525 0.485855 0.245166
2000-01-02 0.352480 0.390389 1.192319 1.655824
2000-01-03 0.864883 0.299674 0.227870 0.281059
2000-01-04 0.846958 1.222082 0.600705 1.233203
2000-01-05 0.669692 0.605656 1.169184 0.342416
2000-01-06 0.868584 0.948458 2.297780 0.684718
2000-01-07 2.670153 0.114722 0.168904 0.048048
2000-01-08 0.801196 1.392071 0.048788 0.808838
Note: The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m,
df2) is equivalent to np.where(m, df1, df2).
Alignment
Furthermore, where aligns the input boolean condition (ndarray or DataFrame), such that partial selection with setting
is possible. This is analogous to partial setting via .loc (but on the contents rather than the axis labels).
In [199]: df2
Out[199]:
A B C D
2000-01-01 -2.104139 -1.309525 0.485855 0.245166
2000-01-02 -0.352480 3.000000 -1.192319 3.000000
2000-01-03 -0.864883 3.000000 -0.227870 3.000000
2000-01-04 3.000000 -1.222082 3.000000 -1.233203
2000-01-05 0.669692 -0.605656 -1.169184 0.342416
2000-01-06 0.868584 -0.948458 2.297780 -0.684718
2000-01-07 -2.670153 -0.114722 0.168904 -0.048048
2000-01-08 0.801196 1.392071 -0.048788 -0.808838
Where can also accept axis and level parameters to align the input when performing the where.
where can accept a callable as condition and other arguments. The function must be with one argument (the calling
Series or DataFrame) and that returns valid output as condition and other argument.
In [204]: df3 = pd.DataFrame({'A': [1, 2, 3],
.....: 'B': [4, 5, 6],
.....: 'C': [7, 8, 9]})
.....:
Mask
An alternative to where() is to use numpy.where(). Combined with setting a new column, you can use it to
enlarge a dataframe where the values are determined conditionally.
Consider you have two choices to choose from in the following dataframe. And you want to set a new column color
to ‘green’ when the second column has ‘Z’. You can do the following:
In [210]: df
Out[210]:
col1 col2 color
0 A Z green
1 B Z green
2 B X red
3 C Y red
If you have multiple conditions, you can use numpy.select() to achieve that. Say corresponding to three condi-
tions there are three choice of colors, with a fourth color as a fallback, you can do the following.
In [211]: conditions = [
.....: (df['col2'] == 'Z') & (df['col1'] == 'A'),
.....: (df['col2'] == 'Z') & (df['col1'] == 'B'),
.....: (df['col1'] == 'B')
.....: ]
.....:
In [214]: df
Out[214]:
col1 col2 color
0 A Z yellow
1 B Z blue
2 B X purple
3 C Y black
DataFrame objects have a query() method that allows selection using an expression.
You can get the value of the frame where column b has values between the values of columns a and c. For example:
In [215]: n = 10
In [217]: df
Out[217]:
a b c
0 0.438921 0.118680 0.863670
1 0.138138 0.577363 0.686602
2 0.595307 0.564592 0.520630
3 0.913052 0.926075 0.616184
4 0.078718 0.854477 0.898725
5 0.076404 0.523211 0.591538
6 0.792342 0.216974 0.564056
7 0.397890 0.454131 0.915716
8 0.074315 0.437913 0.019794
9 0.559209 0.502065 0.026437
# pure python
In [218]: df[(df['a'] < df['b']) & (df['b'] < df['c'])]
Out[218]:
a b c
1 0.138138 0.577363 0.686602
4 0.078718 0.854477 0.898725
5 0.076404 0.523211 0.591538
7 0.397890 0.454131 0.915716
# query
In [219]: df.query('(a < b) & (b < c)')
Out[219]:
a b c
1 0.138138 0.577363 0.686602
4 0.078718 0.854477 0.898725
5 0.076404 0.523211 0.591538
7 0.397890 0.454131 0.915716
Do the same thing but fall back on a named index if there is no column with the name a.
In [222]: df
Out[222]:
b c
a
0 0 4
1 0 1
2 3 4
3 4 3
4 1 4
5 0 3
(continues on next page)
If instead you don’t want to or cannot name your index, you can use the name index in your query expression:
In [225]: df
Out[225]:
b c
0 3 1
1 3 0
2 5 6
3 5 2
4 7 4
5 0 1
6 2 5
7 0 1
8 6 0
9 7 9
Note: If the name of your index overlaps with a column name, the column name is given precedence. For example,
In [229]: df.query('a > 2') # uses the column 'a', not the index
Out[229]:
a
a
1 3
3 3
You can still use the index in a query expression by using the special identifier ‘index’:
If for some reason you have a column named index, then you can refer to the index as ilevel_0 as well, but at
this point you should consider renaming your columns to something less ambiguous.
You can also use the levels of a DataFrame with a MultiIndex as if they were columns in the frame:
In [231]: n = 10
In [234]: colors
Out[234]:
array(['red', 'red', 'red', 'green', 'green', 'green', 'green', 'green',
'green', 'green'], dtype='<U5')
In [235]: foods
Out[235]:
array(['ham', 'ham', 'eggs', 'eggs', 'eggs', 'ham', 'ham', 'eggs', 'eggs',
'eggs'], dtype='<U4')
In [238]: df
Out[238]:
0 1
color food
red ham 0.194889 -0.381994
ham 0.318587 2.089075
eggs -0.728293 -0.090255
green eggs -0.748199 1.318931
eggs -2.029766 0.792652
ham 0.461007 -0.542749
ham -0.305384 -0.479195
eggs 0.095031 -0.270099
eggs -0.707140 -0.773882
eggs 0.229453 0.304418
If the levels of the MultiIndex are unnamed, you can refer to them using special names:
In [240]: df.index.names = [None, None]
In [241]: df
Out[241]:
(continues on next page)
The convention is ilevel_0, which means “index level 0” for the 0th level of the index.
A use case for query() is when you have a collection of DataFrame objects that have a subset of column names
(or index levels/names) in common. You can pass the same query to both frames without having to specify which
frame you’re interested in querying
In [244]: df
Out[244]:
a b c
0 0.224283 0.736107 0.139168
1 0.302827 0.657803 0.713897
2 0.611185 0.136624 0.984960
3 0.195246 0.123436 0.627712
4 0.618673 0.371660 0.047902
5 0.480088 0.062993 0.185760
6 0.568018 0.483467 0.445289
7 0.309040 0.274580 0.587101
8 0.258993 0.477769 0.370255
9 0.550459 0.840870 0.304611
In [246]: df2
Out[246]:
a b c
0 0.357579 0.229800 0.596001
1 0.309059 0.957923 0.965663
2 0.123102 0.336914 0.318616
3 0.526506 0.323321 0.860813
4 0.518736 0.486514 0.384724
5 0.190804 0.505723 0.614533
6 0.891939 0.623977 0.676639
(continues on next page)
In [250]: df
Out[250]:
a b c
0 7 8 9
1 1 0 7
2 2 7 2
3 6 2 2
4 2 6 3
5 3 8 2
6 1 7 2
7 5 1 5
8 9 8 0
9 1 5 0
Slightly nicer by removing the parentheses (by binding making comparison operators bind tighter than & and |).
query() also supports special use of Python’s in and not in comparison operators, providing a succinct syntax
for calling the isin method of a Series or DataFrame.
# get all rows where columns "a" and "b" have overlapping values
In [256]: df = pd.DataFrame({'a': list('aabbccddeeff'), 'b': list('aaaabbbbcccc'),
.....: 'c': np.random.randint(5, size=12),
.....: 'd': np.random.randint(9, size=12)})
.....:
In [257]: df
Out[257]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
# pure Python
In [261]: df[~df['a'].isin(df['b'])]
Out[261]:
a b c d
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
You can combine this with other expressions for very succinct queries:
# pure Python
In [263]: df[df['b'].isin(df['a']) & (df['c'] < df['d'])]
Out[263]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
4 c b 3 6
5 c b 0 2
10 f c 0 6
11 f c 1 2
Note: Note that in and not in are evaluated in Python, since numexpr has no equivalent of this operation.
However, only the in/not in expression itself is evaluated in vanilla Python. For example, in the expression
df.query('a in b + c + d')
(b + c + d) is evaluated by numexpr and then the in operation is evaluated in plain Python. In general, any
operations that can be evaluated using numexpr will be.
Comparing a list of values to a column using ==/!= works similarly to in/not in.
In [264]: df.query('b == ["a", "b", "c"]')
Out[264]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
# pure Python
In [265]: df[df['b'].isin(["a", "b", "c"])]
Out[265]:
a b c d
0 a a 2 6
1 a a 4 7
2 b a 1 6
3 b a 2 1
4 c b 3 6
5 c b 0 2
6 d b 3 3
7 d b 2 1
8 e c 4 3
9 e c 2 0
10 f c 0 6
11 f c 1 2
# using in/not in
(continues on next page)
# pure Python
In [270]: df[df['c'].isin([1, 2])]
Out[270]:
a b c d
0 a a 2 6
2 b a 1 6
3 b a 2 1
7 d b 2 1
9 e c 2 0
11 f c 1 2
Boolean operators
You can negate boolean expressions with the word not or the ~ operator.
In [273]: df.query('~bools')
Out[273]:
a b c bools
2 0.697753 0.212799 0.329209 False
7 0.275396 0.691034 0.826619 False
8 0.190649 0.558748 0.262467 False
In [278]: shorter
Out[278]:
a b c bools
7 0.275396 0.691034 0.826619 False
In [279]: longer
Out[279]:
a b c bools
7 0.275396 0.691034 0.826619 False
Performance of query()
DataFrame.query() using numexpr is slightly faster than Python for large frames.
Note: You will only see the performance benefits of using the numexpr engine with DataFrame.query() if
your frame has more than approximately 200,000 rows.
This plot was created using a DataFrame with 3 columns each containing floating point values generated using
numpy.random.randn().
If you want to identify and remove duplicate rows in a DataFrame, there are two methods that will help: duplicated
and drop_duplicates. Each takes as an argument the columns to use to identify duplicated rows.
• duplicated returns a boolean vector whose length is the number of rows, and which indicates whether a row
is duplicated.
• drop_duplicates removes duplicate rows.
By default, the first observed row of a duplicate set is considered unique, but each method has a keep parameter to
specify targets to be kept.
• keep='first' (default): mark / drop duplicates except for the first occurrence.
• keep='last': mark / drop duplicates except for the last occurrence.
• keep=False: mark / drop all duplicates.
In [281]: df2 = pd.DataFrame({'a': ['one', 'one', 'two', 'two', 'two', 'three', 'four
˓→'],
In [282]: df2
Out[282]:
a b c
0 one x -1.067137
(continues on next page)
In [283]: df2.duplicated('a')
Out[283]:
0 False
1 True
2 False
3 True
4 True
5 False
6 False
dtype: bool
In [286]: df2.drop_duplicates('a')
Out[286]:
a b c
0 one x -1.067137
2 two x -0.211056
5 three x -1.964475
6 four x 1.298329
To drop duplicates by index value, use Index.duplicated then perform slicing. The same set of options are
available for the keep parameter.
In [292]: df3
Out[292]:
a b
a 0 1.440455
a 1 2.456086
b 2 1.038402
c 3 -0.894409
b 4 0.683536
a 5 3.082764
In [293]: df3.index.duplicated()
Out[293]: array([False, True, False, False, True, True])
In [294]: df3[~df3.index.duplicated()]
Out[294]:
a b
a 0 1.440455
b 2 1.038402
c 3 -0.894409
In [296]: df3[~df3.index.duplicated(keep=False)]
Out[296]:
a b
c 3 -0.894409
Each of Series or DataFrame have a get method which can return a default value.
Sometimes you want to extract a set of values given a sequence of row labels and column labels, this can be achieved
by DataFrame.melt combined by filtering the corresponding rows with DataFrame.loc. For instance:
In [301]: df
Out[301]:
col A B
0 A 80.0 80
1 A 23.0 55
2 B NaN 76
3 B 22.0 67
In [304]: melt.reset_index(drop=True)
Out[304]:
0 80.0
1 23.0
2 76.0
3 67.0
Name: value, dtype: float64
Formerly this could be achieved with the dedicated DataFrame.lookup method which was deprecated in version
1.2.0.
The pandas Index class and its subclasses can be viewed as implementing an ordered multiset. Duplicates are
allowed. However, if you try to convert an Index object with duplicate entries into a set, an exception will be
raised.
Index also provides the infrastructure necessary for lookups, data alignment, and reindexing. The easiest way to
create an Index directly is to pass a list or other sequence to Index:
In [306]: index
Out[306]: Index(['e', 'd', 'a', 'b'], dtype='object')
In [309]: index.name
Out[309]: 'something'
In [313]: df
Out[313]:
cols A B C
rows
0 1.295989 -1.051694 1.340429
1 -2.366110 0.428241 0.387275
2 0.433306 0.929548 0.278094
3 2.154730 -0.315628 0.264223
4 1.126818 1.132290 -0.353310
In [314]: df['A']
Out[314]:
rows
0 1.295989
1 -2.366110
2 0.433306
3 2.154730
4 1.126818
Name: A, dtype: float64
Setting metadata
Indexes are “mostly immutable”, but it is possible to set and change their name attribute. You can use the rename,
set_names to set these attributes directly, and they default to returning a copy.
See Advanced Indexing for usage of MultiIndexes.
In [316]: ind.rename("apple")
Out[316]: Int64Index([1, 2, 3], dtype='int64', name='apple')
In [317]: ind
Out[317]: Int64Index([1, 2, 3], dtype='int64')
In [320]: ind
Out[320]: Int64Index([1, 2, 3], dtype='int64', name='bob')
In [322]: index
Out[322]:
MultiIndex([(0, 'one'),
(0, 'two'),
(1, 'one'),
(1, 'two'),
(2, 'one'),
(2, 'two')],
names=['first', 'second'])
In [323]: index.levels[1]
Out[323]: Index(['one', 'two'], dtype='object', name='second')
The two main operations are union and intersection. Difference is provided via the .difference()
method.
In [327]: a.difference(b)
Out[327]: Index(['a', 'b'], dtype='object')
Also available is the symmetric_difference operation, which returns elements that appear in either idx1 or
idx2, but not in both. This is equivalent to the Index created by idx1.difference(idx2).union(idx2.
difference(idx1)), with duplicates dropped.
In [330]: idx1.symmetric_difference(idx2)
Out[330]: Int64Index([1, 5], dtype='int64')
Note: The resulting index from a set operation will be sorted in ascending order.
When performing Index.union() between indexes with different dtypes, the indexes must be cast to a common
dtype. Typically, though not always, this is object dtype. The exception is when performing a union between integer
and float data. In this case, the integer values are converted to float
In [333]: idx1.union(idx2)
Out[333]: Float64Index([0.0, 0.5, 1.0, 1.5, 2.0], dtype='float64')
Missing values
Important: Even though Index can hold missing values (NaN), it should be avoided if you do not want any
unexpected results. For example, some operations exclude missing values implicitly.
In [335]: idx1
Out[335]: Float64Index([1.0, nan, 3.0, 4.0], dtype='float64')
In [336]: idx1.fillna(2)
Out[336]: Float64Index([1.0, 2.0, 3.0, 4.0], dtype='float64')
In [338]: idx2
Out[338]: DatetimeIndex(['2011-01-01', 'NaT', '2011-01-03'], dtype='datetime64[ns]',
˓→freq=None)
In [339]: idx2.fillna(pd.Timestamp('2011-01-02'))
Out[339]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype=
˓→'datetime64[ns]', freq=None)
Occasionally you will load or create a data set into a DataFrame and want to add an index after you’ve already done
so. There are a couple of different ways.
Set an index
DataFrame has a set_index() method which takes a column name (for a regular Index) or a list of column names
(for a MultiIndex). To create a new, re-indexed DataFrame:
In [340]: data
Out[340]:
a b c d
0 bar one z 1.0
1 bar two y 2.0
2 foo one x 3.0
3 foo two w 4.0
In [342]: indexed1
Out[342]:
a b d
c
z bar one 1.0
y bar two 2.0
x foo one 3.0
w foo two 4.0
In [344]: indexed2
Out[344]:
c d
a b
bar one z 1.0
two y 2.0
foo one x 3.0
two w 4.0
The append keyword option allow you to keep the existing index and append the given columns to a MultiIndex:
In [347]: frame
Out[347]:
c d
c a b
z bar one z 1.0
y bar two y 2.0
x foo one x 3.0
w foo two w 4.0
Other options in set_index allow you not drop the index columns or to add the index in-place (without creating a
new object):
In [350]: data
Out[350]:
c d
a b
bar one z 1.0
two y 2.0
foo one x 3.0
two w 4.0
As a convenience, there is a new function on DataFrame called reset_index() which transfers the index values
into the DataFrame’s columns and sets a simple integer index. This is the inverse operation of set_index().
In [351]: data
Out[351]:
c d
a b
bar one z 1.0
two y 2.0
foo one x 3.0
two w 4.0
In [352]: data.reset_index()
Out[352]:
a b c d
0 bar one z 1.0
1 bar two y 2.0
(continues on next page)
The output is more similar to a SQL table or a record array. The names for the columns derived from the index are the
ones stored in the names attribute.
You can use the level keyword to remove only a portion of the index:
In [353]: frame
Out[353]:
c d
c a b
z bar one z 1.0
y bar two y 2.0
x foo one x 3.0
w foo two w 4.0
In [354]: frame.reset_index(level=1)
Out[354]:
a c d
c b
z one bar z 1.0
y two bar y 2.0
x one foo x 3.0
w two foo w 4.0
reset_index takes an optional parameter drop which if true simply discards the index, instead of putting index
values in the DataFrame’s columns.
If you create an index yourself, you can just assign it to the index field:
data.index = index
When setting values in a pandas object, care must be taken to avoid what is called chained indexing. Here is an
example.
In [355]: dfmi = pd.DataFrame([list('abcd'),
.....: list('efgh'),
.....: list('ijkl'),
.....: list('mnop')],
.....: columns=pd.MultiIndex.from_product([['one', 'two'],
.....: ['first', 'second
˓→']]))
.....:
In [356]: dfmi
Out[356]:
one two
first second first second
0 a b c d
(continues on next page)
In [357]: dfmi['one']['second']
Out[357]:
0 b
1 f
2 j
3 n
Name: second, dtype: object
These both yield the same results, so which should you use? It is instructive to understand the order of operations on
these and why method 2 (.loc) is much preferred over method 1 (chained []).
dfmi['one'] selects the first level of the columns and returns a DataFrame that is singly-indexed. Then an-
other Python operation dfmi_with_one['second'] selects the series indexed by 'second'. This is indicated
by the variable dfmi_with_one because pandas sees these operations as separate events. e.g. separate calls to
__getitem__, so it has to treat them as linear operations, they happen one after another.
Contrast this to df.loc[:,('one','second')] which passes a nested tuple of (slice(None),('one',
'second')) to a single call to __getitem__. This allows pandas to deal with this as a single entity. Furthermore
this order of operations can be significantly faster, and allows one to index both axes if so desired.
The problem in the previous section is just a performance issue. What’s up with the SettingWithCopy warning?
We don’t usually throw warnings around when you do something that might cost a few extra milliseconds!
But it turns out that assigning to the product of chained indexing has inherently unpredictable results. To see this,
think about how the Python interpreter executes this code:
dfmi['one']['second'] = value
# becomes
dfmi.__getitem__('one').__setitem__('second', value)
See that __getitem__ in there? Outside of simple cases, it’s very hard to predict whether it will return a view or a
copy (it depends on the memory layout of the array, about which pandas makes no guarantees), and therefore whether
the __setitem__ will modify dfmi or a temporary object that gets thrown out immediately afterward. That’s what
SettingWithCopy is warning you about!
Note: You may be wondering whether we should be concerned about the loc property in the first example. But
dfmi.loc is guaranteed to be dfmi itself with modified indexing behavior, so dfmi.loc.__getitem__ /
dfmi.loc.__setitem__ operate on dfmi directly. Of course, dfmi.loc.__getitem__(idx) may be
a view or a copy of dfmi.
Sometimes a SettingWithCopy warning will arise at times when there’s no obvious chained indexing going on.
These are the bugs that SettingWithCopy is designed to catch! pandas is probably trying to warn you that you’ve
done this:
def do_something(df):
foo = df[['bar', 'baz']] # Is foo a view? A copy? Nobody knows!
# ... many lines here ...
# We don't know whether this will modify df or not!
foo['quux'] = value
return foo
Yikes!
When you use chained indexing, the order and type of the indexing operation partially determine whether the result is
a slice into the original object, or a copy of the slice.
pandas has the SettingWithCopyWarning because assigning to a copy of a slice is frequently not intentional,
but a mistake caused by chained indexing returning a copy where a slice was expected.
If you would like pandas to be more or less trusting about assignment to a chained indexing expression, you can set
the option mode.chained_assignment to one of these values:
• 'warn', the default, means a SettingWithCopyWarning is printed.
• 'raise' means pandas will raise a SettingWithCopyException you have to deal with.
• None will suppress the warnings entirely.
>>> pd.set_option('mode.chained_assignment','warn')
>>> dfb[dfb['a'].str.startswith('o')]['c'] = 42
Traceback (most recent call last)
...
SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
The following is the recommended access method using .loc for multiple items (using mask) and a single item
using a fixed index:
In [365]: dfd
Out[365]:
a c
0 one 42
1 one 42
2 two 2
3 three 3
4 two 4
5 one 42
6 six 6
In [368]: dfd
Out[368]:
a c
0 one 0
1 one 1
2 11 2
3 three 3
4 two 4
5 one 5
6 six 6
The following can work at times, but it is not guaranteed to, and therefore should be avoided:
In [371]: dfd
Out[371]:
a c
0 one 0
1 one 1
2 111 2
(continues on next page)
Last, the subsequent example will not work at all, and so should be avoided:
>>> pd.set_option('mode.chained_assignment','raise')
>>> dfd.loc[0]['a'] = 1111
Traceback (most recent call last)
...
SettingWithCopyException:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_index,col_indexer] = value instead
Warning: The chained assignment warnings / exceptions are aiming to inform the user of a possibly invalid
assignment. There may be false positives; situations where a chained assignment is inadvertently reported.
This section covers indexing with a MultiIndex and other advanced indexing features.
See the Indexing and Selecting Data for general indexing documentation.
Warning: Whether a copy or a reference is returned for a setting operation may depend on the context. This is
sometimes called chained assignment and should be avoided. See Returning a View versus Copy.
Hierarchical / Multi-level indexing is very exciting as it opens the door to some quite sophisticated data analysis and
manipulation, especially for working with higher dimensional data. In essence, it enables you to store and manipulate
data with an arbitrary number of dimensions in lower dimensional data structures like Series (1d) and DataFrame
(2d).
In this section, we will show what exactly we mean by “hierarchical” indexing and how it integrates with all of the
pandas indexing functionality described above and in prior sections. Later, when discussing group by and pivoting and
reshaping data, we’ll show non-trivial applications to illustrate how it aids in structuring data for analysis.
See the cookbook for some advanced strategies.
Changed in version 0.24.0: MultiIndex.labels has been renamed to MultiIndex.codes and
MultiIndex.set_labels to MultiIndex.set_codes.
The MultiIndex object is the hierarchical analogue of the standard Index object which typically stores the axis
labels in pandas objects. You can think of MultiIndex as an array of tuples where each tuple is unique. A
MultiIndex can be created from a list of arrays (using MultiIndex.from_arrays()), an array of tuples
(using MultiIndex.from_tuples()), a crossed set of iterables (using MultiIndex.from_product()),
or a DataFrame (using MultiIndex.from_frame()). The Index constructor will attempt to return a
MultiIndex when it is passed a list of tuples. The following examples demonstrate different ways to initialize
MultiIndexes.
In [1]: arrays = [
...: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
...: ["one", "two", "one", "two", "one", "two", "one", "two"],
...: ]
...:
In [3]: tuples
Out[3]:
[('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')]
In [5]: index
Out[5]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('baz', 'one'),
('baz', 'two'),
('foo', 'one'),
('foo', 'two'),
('qux', 'one'),
('qux', 'two')],
names=['first', 'second'])
In [7]: s
Out[7]:
first second
bar one 0.469112
two -0.282863
baz one -1.509059
two -1.135632
foo one 1.212112
two -0.173215
qux one 0.119209
two -1.044236
dtype: float64
When you want every pairing of the elements in two iterables, it can be easier to use the MultiIndex.
from_product() method:
You can also construct a MultiIndex from a DataFrame directly, using the method MultiIndex.
from_frame(). This is a complementary method to MultiIndex.to_frame().
New in version 0.24.0.
In [10]: df = pd.DataFrame(
....: [["bar", "one"], ["bar", "two"], ["foo", "one"], ["foo", "two"]],
....: columns=["first", "second"],
....: )
....:
In [11]: pd.MultiIndex.from_frame(df)
Out[11]:
MultiIndex([('bar', 'one'),
('bar', 'two'),
('foo', 'one'),
('foo', 'two')],
names=['first', 'second'])
As a convenience, you can pass a list of arrays directly into Series or DataFrame to construct a MultiIndex
automatically:
In [12]: arrays = [
....: np.array(["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"]),
....: np.array(["one", "two", "one", "two", "one", "two", "one", "two"]),
....: ]
....:
In [14]: s
Out[14]:
bar one -0.861849
two -2.104569
baz one -0.494929
two 1.071804
foo one 0.721555
two -0.706771
qux one -1.039575
two 0.271860
dtype: float64
(continues on next page)
In [16]: df
Out[16]:
0 1 2 3
bar one -0.424972 0.567020 0.276232 -1.087401
two -0.673690 0.113648 -1.478427 0.524988
baz one 0.404705 0.577046 -1.715002 -1.039268
two -0.370647 -1.157892 -1.344312 0.844885
foo one 1.075770 -0.109050 1.643563 -1.469388
two 0.357021 -0.674600 -1.776904 -0.968914
qux one -1.294524 0.413738 0.276662 -0.472035
two -0.013960 -0.362543 -0.006154 -0.923061
All of the MultiIndex constructors accept a names argument which stores string names for the levels themselves.
If no names are provided, None will be assigned:
In [17]: df.index.names
Out[17]: FrozenList([None, None])
This index can back any axis of a pandas object, and the number of levels of the index is up to you:
In [19]: df
Out[19]:
first bar baz foo qux
second one two one two one two one two
A 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.170299 -0.226169
B 0.410835 0.813850 0.132003 -0.827317 -0.076467 -1.187678 1.130127 -1.436737
C -1.413681 1.607920 1.024180 0.569605 0.875906 -2.211372 0.974466 -2.006747
We’ve “sparsified” the higher levels of the indexes to make the console output a bit easier on the eyes. Note that how
the index is displayed can be controlled using the multi_sparse option in pandas.set_options():
It’s worth keeping in mind that there’s nothing preventing you from using tuples as atomic labels on an axis:
The reason that the MultiIndex matters is that it can allow you to do grouping, selection, and reshaping operations
as we will describe below and in subsequent areas of the documentation. As you will see in later sections, you can find
yourself working with hierarchically-indexed data without creating a MultiIndex explicitly yourself. However,
when loading data from a file, you may wish to generate your own MultiIndex when preparing the data set.
The method get_level_values() will return a vector of the labels for each location at a particular level:
In [23]: index.get_level_values(0)
Out[23]: Index(['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], dtype='object
˓→', name='first')
In [24]: index.get_level_values("second")
Out[24]: Index(['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two'], dtype='object
˓→', name='second')
One of the important features of hierarchical indexing is that you can select data by a “partial” label identifying a
subgroup in the data. Partial selection “drops” levels of the hierarchical index in the result in a completely analogous
way to selecting a column in a regular DataFrame:
In [25]: df["bar"]
Out[25]:
second one two
A 0.895717 0.805244
B 0.410835 0.813850
C -1.413681 1.607920
In [27]: df["bar"]["one"]
Out[27]:
A 0.895717
B 0.410835
C -1.413681
(continues on next page)
In [28]: s["qux"]
Out[28]:
one -1.039575
two 0.271860
dtype: float64
See Cross-section with hierarchical index for how to select on a deeper level.
Defined levels
The MultiIndex keeps all the defined levels of an index, even if they are not actually used. When slicing an index,
you may notice this. For example:
In [29]: df.columns.levels # original MultiIndex
Out[29]: FrozenList([['bar', 'baz', 'foo', 'qux'], ['one', 'two']])
This is done to avoid a recomputation of the levels in order to make slicing highly performant. If you want to see only
the used levels, you can use the get_level_values() method.
In [31]: df[["foo", "qux"]].columns.to_numpy()
Out[31]:
array([('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')],
dtype=object)
To reconstruct the MultiIndex with only the used levels, the remove_unused_levels() method may be used.
In [33]: new_mi = df[["foo", "qux"]].columns.remove_unused_levels()
In [34]: new_mi.levels
Out[34]: FrozenList([['foo', 'qux'], ['one', 'two']])
Operations between differently-indexed objects having MultiIndex on the axes will work as you expect; data
alignment will work the same as an Index of tuples:
In [35]: s + s[:-2]
Out[35]:
bar one -1.723698
two -4.209138
baz one -0.989859
two 2.143608
foo one 1.443110
two -1.413542
qux one NaN
(continues on next page)
In [36]: s + s[::2]
Out[36]:
bar one -1.723698
two NaN
baz one -0.989859
two NaN
foo one 1.443110
two NaN
qux one -2.079150
two NaN
dtype: float64
The reindex() method of Series/DataFrames can be called with another MultiIndex, or even a list or array
of tuples:
In [37]: s.reindex(index[:3])
Out[37]:
first second
bar one -0.861849
two -2.104569
baz one -0.494929
dtype: float64
Syntactically integrating MultiIndex in advanced indexing with .loc is a bit challenging, but we’ve made every
effort to do so. In general, MultiIndex keys take the form of tuples. For example, the following works as you would
expect:
In [39]: df = df.T
In [40]: df
Out[40]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
qux one -1.170299 1.130127 0.974466
two -0.226169 -1.436737 -2.006747
Note that df.loc['bar', 'two'] would also work in this example, but this shorthand notation can lead to
ambiguity in general.
If you also want to index a specific column with .loc, you must use a tuple like this:
You don’t have to specify all levels of the MultiIndex by passing only the first elements of the tuple. For example,
you can use “partial” indexing to get all elements with bar in the first level as follows:
In [43]: df.loc["bar"]
Out[43]:
A B C
second
one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
This is a shortcut for the slightly more verbose notation df.loc[('bar',),] (equivalent to df.loc['bar',]
in this example).
“Partial” slicing also works quite nicely.
In [44]: df.loc["baz":"foo"]
Out[44]:
A B C
first second
baz one -1.206412 0.132003 1.024180
two 2.565646 -0.827317 0.569605
foo one 1.431256 -0.076467 0.875906
two 1.340309 -1.187678 -2.211372
Note: It is important to note that tuples and lists are not treated identically in pandas when it comes to indexing.
Whereas a tuple is interpreted as one multi-level key, a list is used to specify several keys. Or in other words, tuples
go horizontally (traversing levels), lists go vertically (scanning levels).
Importantly, a list of tuples indexes several complete MultiIndex keys, whereas a tuple of lists refer to several
values within a level:
In [48]: s = pd.Series(
....: [1, 2, 3, 4, 5, 6],
....: index=pd.MultiIndex.from_product([["A", "B"], ["c", "d", "e"]]),
....: )
....:
Using slicers
Warning: You should specify all axes in the .loc specifier, meaning the indexer for the index and for the
columns. There are some ambiguous cases where the passed indexer could be mis-interpreted as indexing both
axes, rather than into say the MultiIndex for the rows.
You should do this:
df.loc[(slice("A1", "A3"), ...), :] # noqa: E999
....: )
....:
In [54]: dfmi = (
....: pd.DataFrame(
....: np.arange(len(miindex) * len(micolumns)).reshape(
....: (len(miindex), len(micolumns))
....: ),
....: index=miindex,
....: columns=micolumns,
....: )
....: .sort_index()
....: .sort_index(axis=1)
....: )
....:
In [55]: dfmi
Out[55]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9 8 11 10
D1 13 12 15 14
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237 236 239 238
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249 248 251 250
D1 253 252 255 254
You can use pandas.IndexSlice to facilitate a more natural syntax using :, rather than using slice(None).
In [57]: idx = pd.IndexSlice
It is possible to perform quite complicated selections using this method on multiple axes at the same time.
In [59]: dfmi.loc["A1", (slice(None), "foo")]
Out[59]:
lvl0 a b
lvl1 foo foo
B0 C0 D0 64 66
D1 68 70
C1 D0 72 74
D1 76 78
C2 D0 80 82
... ... ...
B1 C1 D1 108 110
C2 D0 112 114
D1 116 118
C3 D0 120 122
D1 124 126
Using a boolean indexer you can provide selection related to the values.
In [61]: mask = dfmi[("a", "foo")] > 200
You can also specify the axis argument to .loc to interpret the passed slicers on a single axis.
In [63]: dfmi.loc(axis=0)[:, :, ["C1", "C3"]]
Out[63]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C1 D0 9 8 11 10
D1 13 12 15 14
C3 D0 25 24 27 26
D1 29 28 31 30
B1 C1 D0 41 40 43 42
... ... ... ... ...
A3 B0 C3 D1 221 220 223 222
B1 C1 D0 233 232 235 234
D1 237 236 239 238
C3 D0 249 248 251 250
D1 253 252 255 254
Furthermore, you can set the values using the following methods.
In [64]: df2 = dfmi.copy()
In [66]: df2
Out[66]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 -10 -10 -10 -10
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 -10 -10 -10 -10
D1 -10 -10 -10 -10
In [69]: df2
Out[69]:
lvl0 a b
lvl1 bar foo bah foo
A0 B0 C0 D0 1 0 3 2
D1 5 4 7 6
C1 D0 9000 8000 11000 10000
D1 13000 12000 15000 14000
C2 D0 17 16 19 18
... ... ... ... ...
A3 B1 C1 D1 237000 236000 239000 238000
C2 D0 241 240 243 242
D1 245 244 247 246
C3 D0 249000 248000 251000 250000
D1 253000 252000 255000 254000
Cross-section
The xs() method of DataFrame additionally takes a level argument to make selecting data at a particular level of a
MultiIndex easier.
In [70]: df
Out[70]:
A B C
first second
bar one 0.895717 0.410835 -1.413681
two 0.805244 0.813850 1.607920
(continues on next page)
You can also select on the columns with xs, by providing the axis argument.
In [73]: df = df.T
You can pass drop_level=False to xs to retain the level that was selected.
In [78]: df.xs("one", level="second", axis=1, drop_level=False)
Out[78]:
first bar baz foo qux
second one one one one
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
Compare the above with the result using drop_level=True (the default value).
In [79]: df.xs("one", level="second", axis=1, drop_level=True)
Out[79]:
first bar baz foo qux
A 0.895717 -1.206412 1.431256 -1.170299
B 0.410835 0.132003 -0.076467 1.130127
C -1.413681 1.024180 0.875906 0.974466
Using the parameter level in the reindex() and align() methods of pandas objects is useful to broadcast
values across a level. For instance:
In [80]: midx = pd.MultiIndex(
....: levels=[["zero", "one"], ["x", "y"]], codes=[[1, 1, 0, 0], [1, 0, 1, 0]]
....: )
....:
In [82]: df
Out[82]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [84]: df2
Out[84]:
0 1
one 1.060074 -0.109716
zero 1.271532 0.713416
# aligning
In [86]: df_aligned, df2_aligned = df.align(df2, level=0)
In [87]: df_aligned
Out[87]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
In [88]: df2_aligned
Out[88]:
0 1
one y 1.060074 -0.109716
x 1.060074 -0.109716
zero y 1.271532 0.713416
x 1.271532 0.713416
In [89]: df[:5]
Out[89]:
0 1
one y 1.519970 -0.493662
x 0.600178 0.274230
zero y 0.132885 -0.023688
x 2.410179 1.450520
The reorder_levels() method generalizes the swaplevel method, allowing you to permute the hierarchical
index levels in one step:
The rename() method is used to rename the labels of a MultiIndex, and is typically used to rename the columns
of a DataFrame. The columns argument of rename allows a dictionary to be specified that includes only the
columns you wish to rename.
This method can also be used to rename specific labels of the main index of the DataFrame.
The rename_axis() method is used to rename the name of a Index or MultiIndex. In particular, the names of
the levels of a MultiIndex can be specified, which is useful if reset_index() is later used to move the values
from the MultiIndex to a column.
Note that the columns of a DataFrame are an index, so that using rename_axis with the columns argument
will change the name of that index.
In [95]: df.rename_axis(columns="Cols").columns
Out[95]: RangeIndex(start=0, stop=2, step=1, name='Cols')
Both rename and rename_axis support specifying a dictionary, Series or a mapping function to map la-
bels/names to new values.
When working with an Index object directly, rather than via a DataFrame, Index.set_names() can be used
to change the names.
In [96]: mi = pd.MultiIndex.from_product([[1, 2], ["a", "b"]], names=["x", "y"])
In [97]: mi.names
Out[97]: FrozenList(['x', 'y'])
In [99]: mi2
Out[99]:
MultiIndex([(1, 'a'),
(1, 'b'),
(2, 'a'),
(2, 'b')],
names=['new name', 'y'])
For MultiIndex-ed objects to be indexed and sliced effectively, they need to be sorted. As with any index, you can
use sort_index().
In [101]: import random
In [102]: random.shuffle(tuples)
In [104]: s
Out[104]:
bar two 0.206053
one -0.251905
(continues on next page)
In [105]: s.sort_index()
Out[105]:
bar one -0.251905
two 0.206053
baz one 0.299368
two 1.063327
foo one 1.266143
two -0.863838
qux one 0.408204
two -2.213588
dtype: float64
In [106]: s.sort_index(level=0)
Out[106]:
bar one -0.251905
two 0.206053
baz one 0.299368
two 1.063327
foo one 1.266143
two -0.863838
qux one 0.408204
two -2.213588
dtype: float64
In [107]: s.sort_index(level=1)
Out[107]:
bar one -0.251905
baz one 0.299368
foo one 1.266143
qux one 0.408204
bar two 0.206053
baz two 1.063327
foo two -0.863838
qux two -2.213588
dtype: float64
You may also pass a level name to sort_index if the MultiIndex levels are named.
In [108]: s.index.set_names(["L1", "L2"], inplace=True)
In [109]: s.sort_index(level="L1")
Out[109]:
L1 L2
bar one -0.251905
two 0.206053
baz one 0.299368
two 1.063327
foo one 1.266143
two -0.863838
(continues on next page)
In [110]: s.sort_index(level="L2")
Out[110]:
L1 L2
bar one -0.251905
baz one 0.299368
foo one 1.266143
qux one 0.408204
bar two 0.206053
baz two 1.063327
foo two -0.863838
qux two -2.213588
dtype: float64
On higher dimensional objects, you can sort any of the other axes by level if they have a MultiIndex:
Indexing will work even if the data are not sorted, but will be rather inefficient (and show a PerformanceWarning).
It will also return a copy of the data rather than a view:
.....: )
.....:
In [114]: dfm
Out[114]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 z 0.537020
y 0.110968
Out[4]:
jolie
jim joe
1 z 0.64094
Furthermore, if you try to index something that is not fully lexsorted, this can raise:
The is_lexsorted() method on a MultiIndex shows if the index is sorted, and the lexsort_depth prop-
erty returns the sort depth:
In [115]: dfm.index.is_lexsorted()
Out[115]: False
In [116]: dfm.index.lexsort_depth
Out[116]: 1
In [118]: dfm
Out[118]:
jolie
jim joe
0 x 0.490671
x 0.120248
1 y 0.110968
z 0.537020
In [119]: dfm.index.is_lexsorted()
Out[119]: True
In [120]: dfm.index.lexsort_depth
Out[120]: 2
Similar to NumPy ndarrays, pandas Index, Series, and DataFrame also provides the take() method that
retrieves elements along a given axis at the given indices. The given indices must be either a list or an ndarray of
integer index positions. take will also accept negative integers as relative positions to the end of the object.
In [122]: index = pd.Index(np.random.randint(0, 1000, 10))
In [123]: index
Out[123]: Int64Index([214, 502, 712, 567, 786, 175, 993, 133, 758, 329], dtype='int64
˓→')
In [125]: index[positions]
Out[125]: Int64Index([214, 329, 567], dtype='int64')
In [128]: ser.iloc[positions]
Out[128]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
In [129]: ser.take(positions)
Out[129]:
0 -0.179666
9 1.824375
3 0.392149
dtype: float64
For DataFrames, the given indices should be a 1d list or ndarray that specifies row or column positions.
It is important to note that the take method on pandas objects are not intended to work on boolean indices and may
return unexpected results.
Finally, as a small note on performance, because the take method handles a narrower range of inputs, it can offer
performance that is a good deal faster than fancy indexing.
In [141]: random.shuffle(indexer)
We have discussed MultiIndex in the previous sections pretty extensively. Documentation about
DatetimeIndex and PeriodIndex are shown here, and documentation about TimedeltaIndex is found
here.
In the following sub-sections we will highlight some other index types.
CategoricalIndex
CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a container
around a Categorical and allows efficient indexing and storage of an index with a large number of duplicated
elements.
In [149]: df.dtypes
Out[149]:
A int64
B category
dtype: object
In [150]: df["B"].cat.categories
Out[150]: Index(['c', 'a', 'b'], dtype='object')
In [152]: df2.index
Out[152]: CategoricalIndex(['a', 'a', 'b', 'b', 'c', 'a'], categories=['c', 'a', 'b'],
˓→ ordered=False, name='B', dtype='category')
Indexing with __getitem__/.iloc/.loc works similarly to an Index with duplicates. The indexers must be
in the category or the operation will raise a KeyError.
In [153]: df2.loc["a"]
Out[153]:
A
B
a 0
a 1
a 5
Sorting the index will sort by the order of the categories (recall that we created the index with
CategoricalDtype(list('cab')), so the sorted order is cab).
In [155]: df2.sort_index()
Out[155]:
A
B
c 4
a 0
a 1
a 5
b 2
b 3
Groupby operations on the index will preserve the index nature as well.
In [156]: df2.groupby(level=0).sum()
Out[156]:
A
B
c 4
a 6
b 5
In [157]: df2.groupby(level=0).sum().index
Out[157]: CategoricalIndex(['c', 'a', 'b'], categories=['c', 'a', 'b'], ordered=False,
˓→ name='B', dtype='category')
Reindexing operations will return a resulting index based on the type of the passed indexer. Passing a list will return
a plain-old Index; indexing with a Categorical will return a CategoricalIndex, indexed according to the
categories of the passed Categorical dtype. This allows one to arbitrarily index these even with values not in the
categories, similarly to how you can reindex any pandas index.
In [160]: df3
Out[160]:
A
B
a 0
b 1
c 2
Warning: Reshaping and Comparison operations on a CategoricalIndex must have the same categories or
a TypeError will be raised.
In [165]: df4 = pd.DataFrame({"A": np.arange(2), "B": list("ba")})
In [168]: df4.index
Out[168]: CategoricalIndex(['b', 'a'], categories=['a', 'b'], ordered=False, name='B
˓→', dtype='category')
In [172]: df5.index
Out[172]: CategoricalIndex(['b', 'c'], categories=['b', 'c'], ordered=False, name='B
˓→', dtype='category')
Int64Index is a fundamental basic index in pandas. This is an immutable array implementing an ordered, sliceable
set.
RangeIndex is a sub-class of Int64Index that provides the default index for all NDFrame objects.
RangeIndex is an optimized version of Int64Index that can represent a monotonic ordered set. These are
analogous to Python range types.
Float64Index
By default a Float64Index will be automatically created when passing floating, or mixed-integer-floating values
in index creation. This enables a pure label-based slicing paradigm that makes [],ix,loc for scalar indexing and
slicing work exactly the same.
In [174]: indexf
Out[174]: Float64Index([1.5, 2.0, 3.0, 4.5, 5.0], dtype='float64')
In [176]: sf
Out[176]:
1.5 0
2.0 1
3.0 2
4.5 3
(continues on next page)
Scalar selection for [],.loc will always be label based. An integer will match an equal float index (e.g. 3 is
equivalent to 3.0).
In [177]: sf[3]
Out[177]: 2
In [178]: sf[3.0]
Out[178]: 2
In [179]: sf.loc[3]
Out[179]: 2
In [180]: sf.loc[3.0]
Out[180]: 2
In [181]: sf.iloc[3]
Out[181]: 3
A scalar index that is not found will raise a KeyError. Slicing is primarily on the values of the index when using
[],ix,loc, and always positional when using iloc. The exception is when the slice is boolean, in which case it
will always be positional.
In [182]: sf[2:4]
Out[182]:
2.0 1
3.0 2
dtype: int64
In [183]: sf.loc[2:4]
Out[183]:
2.0 1
3.0 2
dtype: int64
In [184]: sf.iloc[2:4]
Out[184]:
3.0 2
4.5 3
dtype: int64
In [185]: sf[2.1:4.6]
Out[185]:
3.0 2
4.5 3
dtype: int64
In [186]: sf.loc[2.1:4.6]
Out[186]:
3.0 2
(continues on next page)
In [1]: pd.Series(range(5))[3.5]
TypeError: the label [3.5] is not a proper indexer for this index type (Int64Index)
In [1]: pd.Series(range(5))[3.5:4.5]
TypeError: the slice start [3.5] is not a proper indexer for this index type
˓→(Int64Index)
Here is a typical use-case for using this type of indexing. Imagine that you have a somewhat irregular timedelta-like
indexing scheme, but the data is recorded as floats. This could, for example, be millisecond offsets.
.....: ),
.....: pd.DataFrame(
.....: np.random.randn(6, 2),
.....: index=np.arange(4, 10) * 250.1,
.....: columns=list("AB"),
.....: ),
.....: ]
.....: )
.....:
In [188]: dfir
Out[188]:
A B
0.0 -0.435772 -1.188928
250.0 -0.808286 -0.284634
500.0 -1.815703 1.347213
750.0 -0.243487 0.514704
1000.0 1.162969 -0.287725
1000.4 -0.179734 0.993962
1250.5 -0.212673 0.909872
1500.6 -0.733333 -0.349893
1750.7 0.456434 -0.306735
2000.8 0.553396 0.166221
2250.9 -0.101684 -0.734907
Selection operations then will always work on a value basis, for all selection operators.
In [189]: dfir[0:1000.4]
Out[189]:
A B
0.0 -0.435772 -1.188928
250.0 -0.808286 -0.284634
500.0 -1.815703 1.347213
750.0 -0.243487 0.514704
1000.0 1.162969 -0.287725
1000.4 -0.179734 0.993962
(continues on next page)
In [191]: dfir.loc[1000.4]
Out[191]:
A -0.179734
B 0.993962
Name: 1000.4, dtype: float64
You could retrieve the first 1 second (1000 ms) of data as such:
In [192]: dfir[0:1000]
Out[192]:
A B
0.0 -0.435772 -1.188928
250.0 -0.808286 -0.284634
500.0 -1.815703 1.347213
750.0 -0.243487 0.514704
1000.0 1.162969 -0.287725
In [193]: dfir.iloc[0:5]
Out[193]:
A B
0.0 -0.435772 -1.188928
250.0 -0.808286 -0.284634
500.0 -1.815703 1.347213
750.0 -0.243487 0.514704
1000.0 1.162969 -0.287725
IntervalIndex
IntervalIndex together with its own dtype, IntervalDtype as well as the Interval scalar type, allow
first-class support in pandas for interval notation.
The IntervalIndex allows some unique indexing and is also used as a return type for the categories in cut()
and qcut().
In [194]: df = pd.DataFrame(
.....: {"A": [1, 2, 3, 4]}, index=pd.IntervalIndex.from_breaks([0, 1, 2, 3, 4])
.....: )
.....:
In [195]: df
Out[195]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
(3, 4] 4
Label based indexing via .loc along the edges of an interval works as you would expect, selecting that particular
interval.
In [196]: df.loc[2]
Out[196]:
A 2
Name: (1, 2], dtype: int64
If you select a label contained within an interval, this will also select the interval.
In [198]: df.loc[2.5]
Out[198]:
A 3
Name: (2, 3], dtype: int64
Selecting using an Interval will only return exact matches (starting from pandas 0.25.0).
Trying to select an Interval that is not exactly contained in the IntervalIndex will raise a KeyError.
Selecting all Intervals that overlap a given Interval can be performed using the overlaps() method to
create a boolean indexer.
In [202]: idxr
Out[202]: array([ True, True, True, False])
In [203]: df[idxr]
Out[203]:
A
(0, 1] 1
(1, 2] 2
(2, 3] 3
cut() and qcut() both return a Categorical object, and the bins they create are stored as an IntervalIndex
in its .categories attribute.
In [205]: c
Out[205]:
[(-0.003, 1.5], (-0.003, 1.5], (1.5, 3.0], (1.5, 3.0]]
Categories (2, interval[float64]): [(-0.003, 1.5] < (1.5, 3.0]]
In [206]: c.categories
Out[206]:
IntervalIndex([(-0.003, 1.5], (1.5, 3.0]],
closed='right',
dtype='interval[float64]')
cut() also accepts an IntervalIndex for its bins argument, which enables a useful pandas idiom. First, We
call cut() with some data and bins set to a fixed number, to generate the bins. Then, we pass the values of .
categories as the bins argument in subsequent calls to cut(), supplying new data which will be binned into
the same bins.
Any value which falls outside all bins will be assigned a NaN value.
If we need intervals on a regular frequency, we can use the interval_range() function to create an
IntervalIndex using various combinations of start, end, and periods. The default frequency for
interval_range is a 1 for numeric intervals, and calendar day for datetime-like intervals:
closed='right',
dtype='interval[datetime64[ns]]')
closed='right',
dtype='interval[timedelta64[ns]]')
The freq parameter can used to specify non-default frequencies, and can utilize a variety of frequency aliases with
datetime-like intervals:
closed='right',
dtype='interval[datetime64[ns]]')
closed='right',
dtype='interval[timedelta64[ns]]')
Additionally, the closed parameter can be used to specify which side(s) the intervals are closed on. Intervals are
closed on the right side by default.
Specifying start, end, and periods will generate a range of evenly spaced intervals from start to end inclu-
sively, with periods number of elements in the resulting IntervalIndex:
Out[217]:
IntervalIndex([(2018-01-01, 2018-01-20 08:00:00], (2018-01-20 08:00:00, 2018-02-08 16:
˓→00:00], (2018-02-08 16:00:00, 2018-02-28]],
closed='right',
dtype='interval[datetime64[ns]]')
Integer indexing
Label-based indexing with integer axis labels is a thorny topic. It has been discussed heavily on mailing lists and
among various members of the scientific Python community. In pandas, our general viewpoint is that labels matter
more than integer locations. Therefore, with an integer axis index only label-based indexing is possible with the
standard tools like .loc. The following code will generate exceptions:
In [218]: s = pd.Series(range(5))
In [219]: s[-1]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/pandas/pandas/core/indexes/range.py in get_loc(self, key, method, tolerance)
350 try:
--> 351 return self._range.index(new_key)
352 except ValueError as err:
The above exception was the direct cause of the following exception:
KeyError: -1
In [221]: df
Out[221]:
0 1 2 3
0 -0.130121 -0.476046 0.759104 0.213379
1 -0.082641 0.448008 0.656420 -1.051443
2 0.594956 -0.151360 -0.069303 1.221431
3 -0.182832 0.791235 0.042745 2.069775
4 1.446552 0.019814 -1.389212 -0.702312
In [222]: df.loc[-2:]
Out[222]:
0 1 2 3
0 -0.130121 -0.476046 0.759104 0.213379
1 -0.082641 0.448008 0.656420 -1.051443
2 0.594956 -0.151360 -0.069303 1.221431
3 -0.182832 0.791235 0.042745 2.069775
4 1.446552 0.019814 -1.389212 -0.702312
This deliberate decision was made to prevent ambiguities and subtle bugs (many users reported finding bugs when the
API change was made to stop “falling back” on position-based indexing).
If the index of a Series or DataFrame is monotonically increasing or decreasing, then the bounds of a label-based
slice can be outside the range of the index, much like slice indexing a normal Python list. Monotonicity of an index
can be tested with the is_monotonic_increasing() and is_monotonic_decreasing() attributes.
In [224]: df.index.is_monotonic_increasing
Out[224]: True
On the other hand, if the index is not monotonic, then both slice bounds must be unique members of the index.
In [228]: df.index.is_monotonic_increasing
Out[228]: False
In [231]: weakly_monotonic
Out[231]: Index(['a', 'b', 'c', 'c'], dtype='object')
In [232]: weakly_monotonic.is_monotonic_increasing
Out[232]: True
Compared with standard Python sequence slicing in which the slice endpoint is not inclusive, label-based slicing in
pandas is inclusive. The primary reason for this is that it is often not possible to easily determine the “successor” or
next element after a particular label in an index. For example, consider the following Series:
In [235]: s
Out[235]:
a 0.301379
b 1.240445
c -0.846068
d -0.043312
e -1.658747
f -0.819549
dtype: float64
Suppose we wished to slice from c to e, using integers this would be accomplished as such:
In [236]: s[2:5]
Out[236]:
c -0.846068
d -0.043312
e -1.658747
dtype: float64
However, if you only had c and e, determining the next element in the index can be somewhat complicated. For
example, the following does not work:
s.loc['c':'e' + 1]
A very common use case is to limit a time series to start and end at two specific dates. To enable this, we made the
design choice to make label-based slicing include both endpoints:
In [237]: s.loc["c":"e"]
Out[237]:
c -0.846068
d -0.043312
e -1.658747
dtype: float64
This is most definitely a “practicality beats purity” sort of thing, but it is something to watch out for if you expect
label-based slicing to behave exactly in the way that standard Python integer slicing works.
The different indexing operation can potentially change the dtype of a Series.
In [239]: series1.dtype
Out[239]: dtype('int64')
In [242]: res
Out[242]:
0 1.0
4 NaN
dtype: float64
In [244]: series2.dtype
Out[244]: dtype('bool')
In [246]: res.dtype
Out[246]: dtype('O')
In [247]: res
Out[247]:
0 True
1 NaN
2 NaN
dtype: object
This is because the (re)indexing operations above silently inserts NaNs and the dtype changes accordingly. This can
cause some issues when using numpy ufuncs such as numpy.logical_and.
See the this old issue for a more detailed discussion.
pandas provides various facilities for easily combining together Series or DataFrame with various kinds of set logic
for the indexes and relational algebra functionality in the case of join / merge-type operations.
In addition, pandas also provides utilities to compare two Series or DataFrame and summarize their differences.
The concat() function (in the main pandas namespace) does all of the heavy lifting of performing concatenation
operations along an axis while performing optional set logic (union or intersection) of the indexes (if any) on the other
axes. Note that I say “if any” because there is only a single possible axis of concatenation for Series.
Before diving into all of the details of concat and what it can do, here is a simple example:
Like its sibling function on ndarrays, numpy.concatenate, pandas.concat takes a list or dict of
homogeneously-typed objects and concatenates them with some configurable handling of “what to do with the other
axes”:
pd.concat(
objs,
axis=0,
join="outer",
ignore_index=False,
keys=None,
levels=None,
names=None,
verify_integrity=False,
copy=True,
)
• objs : a sequence or mapping of Series or DataFrame objects. If a dict is passed, the sorted keys will be used
as the keys argument, unless it is passed, in which case the values will be selected (see below). Any None
objects will be dropped silently unless they are all None in which case a ValueError will be raised.
• axis : {0, 1, . . . }, default 0. The axis to concatenate along.
• join : {‘inner’, ‘outer’}, default ‘outer’. How to handle indexes on other axis(es). Outer for union and inner
for intersection.
• ignore_index : boolean, default False. If True, do not use the index values on the concatenation axis. The
resulting axis will be labeled 0, . . . , n - 1. This is useful if you are concatenating objects where the concatenation
axis does not have meaningful indexing information. Note the index values on the other axes are still respected
in the join.
• keys : sequence, default None. Construct hierarchical index using the passed keys as the outermost level. If
multiple levels passed, should contain tuples.
• levels : list of sequences, default None. Specific levels (unique values) to use for constructing a MultiIndex.
Otherwise they will be inferred from the keys.
• names : list, default None. Names for the levels in the resulting hierarchical index.
• verify_integrity : boolean, default False. Check whether the new concatenated axis contains duplicates.
This can be very expensive relative to the actual data concatenation.
• copy : boolean, default True. If False, do not copy data unnecessarily.
Without a little bit of context many of these arguments don’t make much sense. Let’s revisit the above example.
Suppose we wanted to associate specific keys with each of the pieces of the chopped up DataFrame. We can do this
using the keys argument:
As you can see (if you’ve read the rest of the documentation), the resulting object’s index has a hierarchical index.
This means that we can now select out each chunk by key:
In [7]: result.loc["y"]
Out[7]:
A B C D
4 A4 B4 C4 D4
5 A5 B5 C5 D5
6 A6 B6 C6 D6
7 A7 B7 C7 D7
It’s not a stretch to see how this can be very useful. More detail on this functionality below.
Note: It is worth noting that concat() (and therefore append()) makes a full copy of the data, and that constantly
reusing this function can create a significant performance hit. If you need to use the operation over several datasets,
use a list comprehension.
Note: When concatenating DataFrames with named axes, pandas will attempt to preserve these index/column names
whenever possible. In the case where all inputs share a common name, this name will be assigned to the result. When
the input names do not all agree, the result will be unnamed. The same is true for MultiIndex, but the logic is
applied separately on a level-by-level basis.
When gluing together multiple DataFrames, you have a choice of how to handle the other axes (other than the one
being concatenated). This can be done in the following two ways:
• Take the union of them all, join='outer'. This is the default option as it results in zero information loss.
• Take the intersection, join='inner'.
Here is an example of each of these methods. First, the default join='outer' behavior:
Lastly, suppose we just wanted to reuse the exact index from the original DataFrame:
A useful shortcut to concat() are the append() instance methods on Series and DataFrame. These methods
actually predated concat. They concatenate along axis=0, namely the index:
In the case of DataFrame, the indexes must be disjoint but the columns do not need to be:
Note: Unlike the append() method, which appends to the original list and returns None, append() here does
not modify df1 and returns its copy with df2 appended.
For DataFrame objects which don’t have a meaningful index, you may wish to append them and ignore the fact that
they may have overlapping indexes. To do this, use the ignore_index argument:
You can concatenate a mix of Series and DataFrame objects. The Series will be transformed to DataFrame
with the column name as the name of the Series.
Note: Since we’re concatenating a Series to a DataFrame, we could have achieved the same result with
DataFrame.assign(). To concatenate an arbitrary number of pandas objects (DataFrame or Series), use
concat.
A fairly common use of the keys argument is to override the column names when creating a new DataFrame based
on existing Series. Notice how the default behaviour consists on letting the resulting DataFrame inherit the parent
Series’ name, when these existed.
Through the keys argument we can override the existing column names.
You can also pass a dict to concat in which case the dict keys will be used for the keys argument (unless other keys
are specified):
The MultiIndex created has levels that are constructed from the passed keys and the index of the DataFrame pieces:
In [32]: result.index.levels
Out[32]: FrozenList([['z', 'y'], [4, 5, 6, 7, 8, 9, 10, 11]])
If you wish to specify other levels (as will occasionally be the case), you can do so using the levels argument:
....: )
....:
In [34]: result.index.levels
Out[34]: FrozenList([['z', 'y', 'x', 'w'], [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]])
This is fairly esoteric, but it is actually necessary for implementing things like GroupBy where the order of a categorical
variable is meaningful.
While not especially efficient (since a new object must be created), you can append a single row to a DataFrame by
passing a Series or dict to append, which returns a new DataFrame as above.
You should use ignore_index with this method to instruct DataFrame to discard its index. If you wish to preserve
the index, you should construct an appropriately-indexed DataFrame and append or concatenate those objects.
You can also pass a list of dicts or Series:
In [37]: dicts = [{"A": 1, "B": 2, "C": 3, "X": 4}, {"A": 5, "B": 6, "C": 7, "Y": 8}]
pandas has full-featured, high performance in-memory join operations idiomatically very similar to relational
databases like SQL. These methods perform significantly better (in some cases well over an order of magnitude better)
than other open source implementations (like base::merge.data.frame in R). The reason for this is careful
algorithmic design and the internal layout of the data in DataFrame.
See the cookbook for some advanced strategies.
Users who are familiar with SQL but new to pandas might be interested in a comparison with SQL.
pandas provides a single function, merge(), as the entry point for all standard database join operations between
DataFrame or named Series objects:
pd.merge(
left,
right,
how="inner",
on=None,
left_on=None,
right_on=None,
left_index=False,
right_index=False,
sort=True,
suffixes=("_x", "_y"),
copy=True,
indicator=False,
validate=None,
)
Note: Support for specifying index levels as the on, left_on, and right_on parameters was added in version
0.23.0. Support for merging named Series objects was added in version 0.24.0.
The return type will be the same as left. If left is a DataFrame or named Series and right is a subclass of
DataFrame, the return type will still be DataFrame.
merge is a function in the pandas namespace, and it is also available as a DataFrame instance method merge(),
with the calling DataFrame being implicitly considered the left object in the join.
The related join() method, uses merge internally for the index-on-index (by default) and column(s)-on-index join.
If you are joining on index only, you may wish to use DataFrame.join to save yourself some typing.
Experienced users of relational databases like SQL will be familiar with the terminology used to describe join oper-
ations between two SQL-table like structures (DataFrame objects). There are several cases to consider which are
very important to understand:
• one-to-one joins: for example when joining two DataFrame objects on their indexes (which must contain
unique values).
• many-to-one joins: for example when joining an index (unique) to one or more columns in a different
DataFrame.
• many-to-many joins: joining columns on columns.
Note: When joining columns on columns (potentially a many-to-many join), any indexes on the passed DataFrame
objects will be discarded.
It is worth spending some time understanding the result of the many-to-many join case. In SQL / standard relational
algebra, if a key combination appears more than once in both tables, the resulting table will have the Cartesian
product of the associated data. Here is a very basic example with one unique key combination:
Here is a more complicated example with multiple join keys. Only the keys appearing in left and right are present
(the intersection), since how='inner' by default.
The how argument to merge specifies how to determine which keys are to be included in the resulting table. If a
key combination does not appear in either the left or right tables, the values in the joined table will be NA. Here is a
summary of the how options and their SQL equivalent names:
You can merge a mult-indexed Series and a DataFrame, if the names of the MultiIndex correspond to the columns
from the DataFrame. Transform the Series to a DataFrame using Series.reset_index() before merging, as
shown in the following example.
In [49]: df = pd.DataFrame({"Let": ["A", "B", "C"], "Num": [1, 2, 3]})
In [50]: df
Out[50]:
Let Num
0 A 1
1 B 2
2 C 3
In [52]: ser
Out[52]:
Let Num
A 1 a
B 2 b
C 3 c
A 4 d
B 5 e
C 6 f
dtype: object
Warning: Joining / merging on duplicate keys can cause a returned frame that is the multiplication of the row
dimensions, which may result in memory overflow. It is the user’ s responsibility to manage duplicate values in
keys before joining large DataFrames.
Users can use the validate argument to automatically check whether there are unexpected duplicates in their merge
keys. Key uniqueness is checked before merge operations and so should protect against memory overflows. Checking
key uniqueness is also a good way to ensure user data structures are as expected.
In the following example, there are duplicate values of B in the right DataFrame. As this is not a one-to-one merge
– as specified in the validate argument – an exception will be raised.
If the user is aware of the duplicates in the right DataFrame but wants to ensure there are no duplicates in the left
DataFrame, one can use the validate='one_to_many' argument instead, which will not raise an exception.
merge() accepts the argument indicator. If True, a Categorical-type column called _merge will be added to
the output object that takes on values:
The indicator argument will also accept string arguments, in which case the indicator function will use the value
of the passed string as the name for the indicator column.
Merge dtypes
In [65]: left
Out[65]:
key v1
0 1 10
In [67]: right
Out[67]:
key v1
0 1 20
1 2 30
Of course if you have missing values that are introduced, then the resulting dtype will be upcast.
In [70]: pd.merge(left, right, how="outer", on="key")
Out[70]:
key v1_x v1_y
0 1 10.0 20
1 2 NaN 30
Merging will preserve category dtypes of the mergands. See also the section on categoricals.
The left frame.
In [72]: from pandas.api.types import CategoricalDtype
In [76]: left
Out[76]:
X Y
0 bar one
1 foo one
2 foo three
3 bar three
4 foo one
5 bar one
6 bar three
7 bar three
8 bar three
9 foo three
In [77]: left.dtypes
(continues on next page)
In [79]: right
Out[79]:
X Z
0 foo 1
1 bar 2
In [80]: right.dtypes
Out[80]:
X category
Z int64
dtype: object
In [82]: result
Out[82]:
X Y Z
0 bar one 2
1 bar three 2
2 bar one 2
3 bar three 2
4 bar three 2
5 bar three 2
6 foo one 1
7 foo three 1
8 foo one 1
9 foo three 1
In [83]: result.dtypes
Out[83]:
X category
Y object
Z int64
dtype: object
Note: The category dtypes must be exactly the same, meaning the same categories and the ordered attribute. Other-
wise the result will coerce to the categories’ dtype.
Note: Merging on category dtypes that are the same can be quite performant compared to object dtype merging.
Joining on index
DataFrame.join() is a convenient method for combining the columns of two potentially differently-indexed
DataFrames into a single result DataFrame. Here is a very basic example:
....: )
....:
....: )
....:
The data alignment here is on the indexes (row labels). This same behavior can be achieved using merge plus
additional arguments instructing it to use the indexes:
In [89]: result = pd.merge(left, right, left_index=True, right_index=True, how="outer
˓→")
join() takes an optional on argument which may be a column or multiple column names, which specifies that the
passed DataFrame is to be aligned on that column in the DataFrame. These two function calls are completely
equivalent:
left.join(right, on=key_or_keys)
pd.merge(
left, right, left_on=key_or_keys, right_index=True, how="left", sort=False
)
Obviously you can choose whichever form you find more convenient. For many-to-one joins (where one of the
DataFrame’s is already indexed by the join key), using join may be more convenient. Here is a simple example:
In [91]: left = pd.DataFrame(
....: {
....: "A": ["A0", "A1", "A2", "A3"],
....: "B": ["B0", "B1", "B2", "B3"],
(continues on next page)
In [92]: right = pd.DataFrame({"C": ["C0", "C1"], "D": ["D0", "D1"]}, index=["K0", "K1
˓→"])
Now this can be joined by passing the two key column names:
The
default for DataFrame.join is to perform a left join (essentially a “VLOOKUP” operation, for Excel users),
which uses only the keys found in the calling DataFrame. Other join types, for example inner join, can be just as easily
performed:
As you can see, this drops any rows where there was no match.
You can join a singly-indexed DataFrame with a level of a MultiIndexed DataFrame. The level will match on the
name of the index of the singly-indexed frame against a level name of the MultiIndexed frame.
This is equivalent but less verbose and more memory efficient / faster than this.
In [104]: result = pd.merge(
.....: left.reset_index(), right.reset_index(), on=["key"], how="inner"
.....: ).set_index(["key","Y"])
.....:
This is supported in a limited way, provided that the index for the right argument is completely used in the join, and is
a subset of the indices in the left argument, as in this example:
In [105]: leftindex = pd.MultiIndex.from_product(
.....: [list("abc"), list("xy"), [1, 2]], names=["abc", "xy", "num"]
.....: )
.....:
In [107]: left
Out[107]:
v1
abc xy num
a x 1 0
(continues on next page)
In [110]: right
Out[110]:
v2
abc xy
a x 100
y 200
b x 300
y 400
c x 500
y 600
If that condition is not satisfied, a join with two multi-indexes can be done using the following code.
In [112]: leftindex = pd.MultiIndex.from_tuples(
.....: [("K0", "X0"), ("K0", "X1"), ("K1", "X2")], names=["key", "X"]
.....: )
.....:
.....: )
.....:
.....: )
.....:
Strings passed as the on, left_on, and right_on parameters may refer to either column names or index level
names. This enables merging DataFrame instances on a combination of index levels and columns without resetting
indexes.
Note: When DataFrames are merged on a string that matches an index level in both frames, the index level is
preserved as an index level in the resulting DataFrame.
Note: When DataFrames are merged using only some of the levels of a MultiIndex, the extra levels will be
dropped from the resulting merge. In order to preserve those levels, use reset_index on those level names to move
those levels to columns prior to doing the merge.
Note: If a string matches both a column name and an index level name, then a warning is issued and the column takes
precedence. This will result in an ambiguity error in a future version.
The merge suffixes argument takes a tuple of list of strings to append to overlapping column names in the input
DataFrames to disambiguate the result columns:
A list or tuple of DataFrames can also be passed to join() to join them together on their indexes.
Another fairly common situation is to have two like-indexed (or similarly indexed) Series or DataFrame objects
and wanting to “patch” values in one object from values for matching indices in the other. Here is an example:
In [132]: df2 = pd.DataFrame([[-42.6, np.nan, -8.2], [-5.0, 1.6, 4]], index=[1, 2])
Note that this method only takes values from the right DataFrame if they are missing in the left DataFrame. A
related method, update(), alters non-NA values in place:
In [134]: df1.update(df2)
A merge_ordered() function allows combining time series and other ordered data. In particular it has an optional
fill_method keyword to fill/interpolate missing data:
.....: )
.....:
Merging asof
A merge_asof() is similar to an ordered left-join except that we match on nearest key rather than equal keys. For
each row in the left DataFrame, we select the last row in the right DataFrame whose on key is less than the
left’s key. Both DataFrames must be sorted by the key.
Optionally an asof merge can perform a group-wise merge. This matches the by key equally, in addition to the nearest
match on the on key.
For example; we might have trades and quotes and we want to asof merge them.
.....: "bid": [720.50, 51.95, 51.97, 51.99, 720.50, 97.99, 720.50, 52.01],
.....: "ask": [720.93, 51.96, 51.98, 52.00, 720.93, 98.01, 720.88, 52.03],
.....: },
.....: columns=["time", "ticker", "bid", "ask"],
.....: )
.....:
In [140]: trades
Out[140]:
time ticker price quantity
0 2016-05-25 13:30:00.023 MSFT 51.95 75
1 2016-05-25 13:30:00.038 MSFT 51.95 155
2 2016-05-25 13:30:00.048 GOOG 720.77 100
3 2016-05-25 13:30:00.048 GOOG 720.92 100
4 2016-05-25 13:30:00.048 AAPL 98.00 100
In [141]: quotes
Out[141]:
time ticker bid ask
0 2016-05-25 13:30:00.023 GOOG 720.50 720.93
1 2016-05-25 13:30:00.023 MSFT 51.95 51.96
2 2016-05-25 13:30:00.030 MSFT 51.97 51.98
3 2016-05-25 13:30:00.041 MSFT 51.99 52.00
4 2016-05-25 13:30:00.048 GOOG 720.50 720.93
5 2016-05-25 13:30:00.049 AAPL 97.99 98.01
6 2016-05-25 13:30:00.072 GOOG 720.50 720.88
7 2016-05-25 13:30:00.075 MSFT 52.01 52.03
We only asof within 2ms between the quote time and the trade time.
Out[143]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
We only asof within 10ms between the quote time and the trade time and we exclude exact matches on time. Note
that though we exclude the exact matches (of the quotes), prior quotes do propagate to that point in time.
In [144]: pd.merge_asof(
.....: trades,
.....: quotes,
.....: on="time",
.....: by="ticker",
.....: tolerance=pd.Timedelta("10ms"),
.....: allow_exact_matches=False,
.....: )
.....:
Out[144]:
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 NaN NaN
1 2016-05-25 13:30:00.038 MSFT 51.95 155 51.97 51.98
2 2016-05-25 13:30:00.048 GOOG 720.77 100 NaN NaN
3 2016-05-25 13:30:00.048 GOOG 720.92 100 NaN NaN
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
The compare() and compare() methods allow you to compare two DataFrame or Series, respectively, and sum-
marize their differences.
This feature was added in V1.1.0.
For example, you might want to compare two DataFrame and stack their differences side by side.
In [145]: df = pd.DataFrame(
.....: {
.....: "col1": ["a", "a", "b", "b", "a"],
.....: "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
.....: "col3": [1.0, 2.0, 3.0, 4.0, 5.0],
.....: },
.....: columns=["col1", "col2", "col3"],
.....: )
.....:
(continues on next page)
In [146]: df
Out[146]:
col1 col2 col3
0 a 1.0 1.0
1 a 2.0 2.0
2 b 3.0 3.0
3 b NaN 4.0
4 a 5.0 5.0
In [150]: df2
Out[150]:
col1 col2 col3
0 c 1.0 1.0
1 a 2.0 2.0
2 b 3.0 4.0
3 b NaN 4.0
4 a 5.0 5.0
In [151]: df.compare(df2)
Out[151]:
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
By default, if two corresponding values are equal, they will be shown as NaN. Furthermore, if all values in an entire
row / column, the row / column will be omitted from the result. The remaining differences will be aligned on columns.
If you wish, you may choose to stack the differences on rows.
If you wish to keep all original rows and columns, set keep_shape argument to True.
You may also keep all the original values even if they are equal.
In [1]: df
Out[1]:
date variable value
0 2000-01-03 A 0.469112
1 2000-01-04 A -0.282863
2 2000-01-05 A -1.509059
3 2000-01-03 B -1.135632
4 2000-01-04 B 1.212112
5 2000-01-05 B -0.173215
6 2000-01-03 C 0.119209
7 2000-01-04 C -1.044236
8 2000-01-05 C -0.861849
9 2000-01-03 D -2.104569
10 2000-01-04 D -0.494929
11 2000-01-05 D 1.071804
For the curious here is how the above DataFrame was created:
import pandas._testing as tm
def unpivot(frame):
N, K = frame.shape
data = {
(continues on next page)
df = unpivot(tm.makeTimeDataFrame(3))
But suppose we wish to do time series operations with the variables. A better representation would be where the
columns are the unique variables and an index of dates identifies individual observations. To reshape the data into
this form, we use the DataFrame.pivot() method (also implemented as a top level function pivot()):
If the values argument is omitted, and the input DataFrame has more than one column of values which are not
used as column or index inputs to pivot, then the resulting “pivoted” DataFrame will have hierarchical columns
whose topmost level indicates the respective value column:
In [6]: pivoted
Out[6]:
value value2
˓→
variable A B C D A B C
˓→ D
date
˓→
In [7]: pivoted["value2"]
Out[7]:
variable A B C D
date
2000-01-03 0.938225 -2.271265 0.238417 -4.209138
2000-01-04 -0.565727 2.424224 -2.088472 -0.989859
2000-01-05 -3.018117 -0.346429 -1.723698 2.143608
Note that this returns a view on the underlying data in the case where the data are homogeneously-typed.
Note: pivot() will error with a ValueError: Index contains duplicate entries, cannot
reshape if the index/column pair is not unique. In this case, consider using pivot_table() which is a gen-
eralization of pivot that can handle duplicate values for one index/column pair.
Closely related to the pivot() method are the related stack() and unstack() methods available on Series
and DataFrame. These methods are designed to work together with MultiIndex objects (see the section on
hierarchical indexing). Here are essentially what these methods do:
• stack: “pivot” a level of the (possibly hierarchical) column labels, returning a DataFrame with an index
with a new inner-most level of row labels.
• unstack: (inverse operation of stack) “pivot” a level of the (possibly hierarchical) row index to the column
axis, producing a reshaped DataFrame with a new inner-most level of column labels.
The clearest way to explain is by example. Let’s take a prior example data set from the hierarchical indexing section:
In [12]: df2
Out[12]:
A B
first second
bar one 0.721555 -0.706771
two -1.039575 0.271860
baz one -0.424972 0.567020
two 0.276232 -1.087401
The stack function “compresses” a level in the DataFrame’s columns to produce either:
• A Series, in the case of a simple column Index.
• A DataFrame, in the case of a MultiIndex in the columns.
If the columns have a MultiIndex, you can choose which level to stack. The stacked level becomes the new lowest
In [14]: stacked
Out[14]:
first second
bar one A 0.721555
B -0.706771
two A -1.039575
B 0.271860
baz one A -0.424972
B 0.567020
two A 0.276232
B -1.087401
dtype: float64
With a “stacked” DataFrame or Series (having a MultiIndex as the index), the inverse operation of stack
is unstack, which by default unstacks the last level:
In [15]: stacked.unstack()
Out[15]:
A B
first second
bar one 0.721555 -0.706771
two -1.039575 0.271860
baz one -0.424972 0.567020
two 0.276232 -1.087401
In [16]: stacked.unstack(1)
Out[16]:
second one two
first
bar A 0.721555 -1.039575
B -0.706771 0.271860
baz A -0.424972 0.276232
B 0.567020 -1.087401
In [17]: stacked.unstack(0)
Out[17]:
first bar baz
second
one A 0.721555 -0.424972
B -0.706771 0.567020
two A -1.039575 0.276232
B 0.271860 -1.087401
If the indexes have names, you can use the level names instead of specifying the level numbers:
In [18]: stacked.unstack("second")
Out[18]:
second one two
first
bar A 0.721555 -1.039575
B -0.706771 0.271860
baz A -0.424972 0.276232
B 0.567020 -1.087401
Notice that the stack and unstack methods implicitly sort the index levels involved. Hence a call to stack and
then unstack, or vice versa, will result in a sorted copy of the original DataFrame or Series:
In [19]: index = pd.MultiIndex.from_product([[2, 1], ["a", "b"]])
In [21]: df
Out[21]:
A
2 a -0.370647
b -1.157892
1 a -1.344312
b 0.844885
The above code will raise a TypeError if the call to sort_index is removed.
Multiple levels
You may also stack or unstack more than one level at a time by passing a list of levels, in which case the end result is
as if each level in the list were processed individually.
In [23]: columns = pd.MultiIndex.from_tuples(
....: [
....: ("A", "cat", "long"),
....: ("B", "cat", "long"),
....: ("A", "dog", "short"),
....: ("B", "dog", "short"),
....: ],
(continues on next page)
In [25]: df
Out[25]:
exp A B A B
animal cat cat dog dog
hair_length long long short short
0 1.075770 -0.109050 1.643563 -1.469388
1 0.357021 -0.674600 -1.776904 -0.968914
2 -1.294524 0.413738 0.276662 -0.472035
3 -0.013960 -0.362543 -0.006154 -0.923061
The list of levels can contain either level names or level numbers (but not a mixture of the two).
# df.stack(level=['animal', 'hair_length'])
# from above is equivalent to:
In [27]: df.stack(level=[1, 2])
Out[27]:
exp A B
animal hair_length
0 cat long 1.075770 -0.109050
dog short 1.643563 -1.469388
1 cat long 0.357021 -0.674600
dog short -1.776904 -0.968914
2 cat long -1.294524 0.413738
dog short 0.276662 -0.472035
3 cat long -0.013960 -0.362543
dog short -0.006154 -0.923061
Missing data
These functions are intelligent about handling missing data and do not expect each subgroup within the hierarchical
index to have the same set of labels. They also can handle the index being unsorted (but you can make it sorted by
calling sort_index, of course). Here is a more complex example:
In [28]: columns = pd.MultiIndex.from_tuples(
....: [
....: ("A", "cat"),
....: ("B", "dog"),
....: ("B", "cat"),
....: ("A", "dog"),
....: ],
....: names=["exp", "animal"],
....: )
....:
In [32]: df2
Out[32]:
exp A B A
animal cat dog cat dog
first second
bar one 0.895717 0.805244 -1.206412 2.565646
two 1.431256 1.340309 -1.170299 -0.226169
baz one 0.410835 0.813850 0.132003 -0.827317
foo one -1.413681 1.607920 1.024180 0.569605
two 0.875906 -2.211372 0.974466 -2.006747
qux two -1.226825 0.769804 -1.281247 -0.727707
As mentioned above, stack can be called with a level argument to select which level in the columns to stack:
In [33]: df2.stack("exp")
Out[33]:
animal cat dog
first second exp
bar one A 0.895717 2.565646
B -1.206412 0.805244
two A 1.431256 -0.226169
B -1.170299 1.340309
baz one A 0.410835 -0.827317
B 0.132003 0.813850
foo one A -1.413681 0.569605
B 1.024180 1.607920
two A 0.875906 -2.006747
B 0.974466 -2.211372
qux two A -1.226825 -0.727707
B -1.281247 0.769804
In [34]: df2.stack("animal")
(continues on next page)
Unstacking can result in missing values if subgroups do not have the same set of labels. By default, missing values
will be replaced with the default fill value for that data type, NaN for float, NaT for datetimelike, etc. For integer types,
by default data will converted to float and missing values will be set to NaN.
In [36]: df3
Out[36]:
exp B
animal dog cat
first second
bar one 0.805244 -1.206412
two 1.340309 -1.170299
foo one 1.607920 1.024180
qux two 0.769804 -1.281247
In [37]: df3.unstack()
Out[37]:
exp B
animal dog cat
second one two one two
first
bar 0.805244 1.340309 -1.206412 -1.170299
foo 1.607920 NaN 1.024180 NaN
qux NaN 0.769804 NaN -1.281247
Alternatively, unstack takes an optional fill_value argument, for specifying the value of missing data.
In [38]: df3.unstack(fill_value=-1e9)
Out[38]:
exp B
animal dog cat
second one two one two
first
bar 8.052440e-01 1.340309e+00 -1.206412e+00 -1.170299e+00
foo 1.607920e+00 -1.000000e+09 1.024180e+00 -1.000000e+09
qux -1.000000e+09 7.698036e-01 -1.000000e+09 -1.281247e+00
With a MultiIndex
Unstacking when the columns are a MultiIndex is also careful about doing the right thing:
In [39]: df[:3].unstack(0)
Out[39]:
exp A B A
animal cat dog cat dog
first bar baz bar baz bar baz bar baz
second
one 0.895717 0.410835 0.805244 0.81385 -1.206412 0.132003 2.565646 -0.827317
two 1.431256 NaN 1.340309 NaN -1.170299 NaN -0.226169 NaN
In [40]: df2.unstack(1)
Out[40]:
exp A B A
animal cat dog cat dog
second one two one two one two one two
first
bar 0.895717 1.431256 0.805244 1.340309 -1.206412 -1.170299 2.565646 -0.226169
baz 0.410835 NaN 0.813850 NaN 0.132003 NaN -0.827317 NaN
foo -1.413681 0.875906 1.607920 -2.211372 1.024180 0.974466 0.569605 -2.006747
qux NaN -1.226825 NaN 0.769804 NaN -1.281247 NaN -0.727707
The top-level melt() function and the corresponding DataFrame.melt() are useful to massage a DataFrame
into a format where one or more columns are identifier variables, while all other columns, considered measured
variables, are “unpivoted” to the row axis, leaving just two non-identifier columns, “variable” and “value”. The names
of those columns can be customized by supplying the var_name and value_name parameters.
For instance,
In [42]: cheese
Out[42]:
first last height weight
0 John Doe 5.5 130
1 Mary Bo 6.0 150
When transforming a DataFrame using melt(), the index will be ignored. The original index values can be kept
around by setting the ignore_index parameter to False (default is True). This will however duplicate them.
New in version 1.1.0.
In [45]: index = pd.MultiIndex.from_tuples([("person", "A"), ("person", "B")])
In [47]: cheese
Out[47]:
first last height weight
person A John Doe 5.5 130
B Mary Bo 6.0 150
Another way to transform is to use the wide_to_long() panel data convenience function. It is less flexible than
melt(), but more user-friendly.
In [52]: dft
Out[52]:
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -0.121306 0
1 b e 1.2 1.3 -0.097883 1
2 c f 0.7 0.1 0.695775 2
It should be no shock that combining pivot / stack / unstack with GroupBy and the basic Series and DataFrame
statistical functions can produce some very expressive and fast data manipulations.
In [54]: df
Out[54]:
exp A B A
animal cat dog cat dog
first second
bar one 0.895717 0.805244 -1.206412 2.565646
two 1.431256 1.340309 -1.170299 -0.226169
baz one 0.410835 0.813850 0.132003 -0.827317
two -0.076467 -1.187678 1.130127 -1.436737
foo one -1.413681 1.607920 1.024180 0.569605
two 0.875906 -2.211372 0.974466 -2.006747
qux one -0.410001 -0.078638 0.545952 -1.219217
two -1.226825 0.769804 -1.281247 -0.727707
In [55]: df.stack().mean(1).unstack()
Out[55]:
animal cat dog
first second
bar one -0.155347 1.685445
two 0.130479 0.557070
baz one 0.271419 -0.006733
two 0.526830 -1.312207
foo one -0.194750 1.088763
two 0.925186 -2.109060
qux one 0.067976 -0.648927
two -1.254036 0.021048
In [57]: df.stack().groupby(level=1).mean()
Out[57]:
exp A B
second
one 0.071448 0.455513
two -0.424186 -0.204486
In [58]: df.mean().unstack(0)
Out[58]:
exp A B
animal
cat 0.060843 0.018596
(continues on next page)
While pivot() provides general purpose pivoting with various data types (strings, numerics, etc.), pandas also
provides pivot_table() for pivoting with aggregation of numeric data.
The function pivot_table() can be used to create spreadsheet-style pivot tables. See the cookbook for some
advanced strategies.
It takes a number of arguments:
• data: a DataFrame object.
• values: a column or a list of columns to aggregate.
• index: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on the
pivot table index. If an array is passed, it is being used as the same manner as column values.
• columns: a column, Grouper, array which has the same length as data, or list of them. Keys to group by on
the pivot table column. If an array is passed, it is being used as the same manner as column values.
• aggfunc: function to use for aggregation, defaulting to numpy.mean.
Consider a data set like this:
In [60]: df = pd.DataFrame(
....: {
....: "A": ["one", "one", "two", "three"] * 6,
....: "B": ["A", "B", "C"] * 8,
....: "C": ["foo", "foo", "foo", "bar", "bar", "bar"] * 4,
....: "D": np.random.randn(24),
....: "E": np.random.randn(24),
....: "F": [datetime.datetime(2013, i, 1) for i in range(1, 13)]
....: + [datetime.datetime(2013, i, 15) for i in range(1, 13)],
....: }
....: )
....:
In [61]: df
Out[61]:
A B C D E F
0 one A foo 0.341734 -0.317441 2013-01-01
1 one B foo 0.959726 -1.236269 2013-02-01
2 two C foo -1.110336 0.896171 2013-03-01
3 three A bar -0.619976 -0.487602 2013-04-01
4 one B bar 0.149748 -0.082240 2013-05-01
.. ... .. ... ... ... ...
19 three B foo 0.690579 -2.213588 2013-08-15
20 one C foo 0.995761 1.063327 2013-09-15
21 one A bar 2.396780 1.266143 2013-10-15
22 two B bar 0.014871 0.299368 2013-11-15
23 three C bar 3.357427 -0.863838 2013-12-15
Out[63]:
A one three two
C bar foo bar foo bar foo
B
A 2.241830 -1.028115 -2.363137 NaN NaN 2.001971
B -0.676843 0.005518 NaN 0.867024 0.316495 NaN
C -1.077692 1.399070 1.177566 NaN NaN 0.352360
In [64]: pd.pivot_table(
....: df, values=["D", "E"],
....: index=["B"],
....: columns=["A", "C"],
....: aggfunc=np.sum,
....: )
....:
Out[64]:
D E
˓→
The result object is a DataFrame having potentially hierarchical indexes on the rows and columns. If the values
column name is not given, the pivot table will include all of the data that can be aggregated in an additional level of
hierarchy in the columns:
In [65]: pd.pivot_table(df, index=["A", "B"], columns=["C"])
Out[65]:
D E
C bar foo bar foo
A B
(continues on next page)
Also, you can use Grouper for index and columns keywords. For detail of Grouper, see Grouping with a
Grouper specification.
Out[66]:
C bar foo
F
2013-01-31 NaN -0.514058
2013-02-28 NaN 0.002759
2013-03-31 NaN 0.176180
2013-04-30 -1.181568 NaN
2013-05-31 -0.338421 NaN
2013-06-30 -0.538846 NaN
2013-07-31 NaN 1.000985
2013-08-31 NaN 0.433512
2013-09-30 NaN 0.699535
2013-10-31 1.120915 NaN
2013-11-30 0.158248 NaN
2013-12-31 0.588783 NaN
You can render a nice output of the table omitting the missing values by calling to_string if you wish:
In [68]: print(table.to_string(na_rep=""))
D E
C bar foo bar foo
A B
one A 1.120915 -0.514058 1.393057 -0.021605
B -0.338421 0.002759 0.684140 -0.551692
C -0.538846 0.699535 -0.988442 0.747859
three A -1.181568 0.961289
B 0.433512 -1.064372
C 0.588783 -0.131830
two A 1.000985 0.064245
B 0.158248 -0.097147
C 0.176180 0.436241
Note that pivot_table is also available as an instance method on DataFrame, i.e. DataFrame.
pivot_table().
Adding margins
If you pass margins=True to pivot_table, special All columns and rows will be added with partial group
aggregates across the categories on the rows and columns:
In [69]: df.pivot_table(index=["A", "B"], columns="C", margins=True, aggfunc=np.std)
Out[69]:
D E
C bar foo All bar foo All
A B
one A 1.804346 1.210272 1.569879 0.179483 0.418374 0.858005
B 0.690376 1.353355 0.898998 1.083825 0.968138 1.101401
C 0.273641 0.418926 0.771139 1.689271 0.446140 1.422136
three A 0.794212 NaN 0.794212 2.049040 NaN 2.049040
B NaN 0.363548 0.363548 NaN 1.625237 1.625237
C 3.915454 NaN 3.915454 1.035215 NaN 1.035215
two A NaN 0.442998 0.442998 NaN 0.447104 0.447104
B 0.202765 NaN 0.202765 0.560757 NaN 0.560757
C NaN 1.819408 1.819408 NaN 0.650439 0.650439
All 1.556686 0.952552 1.246608 1.250924 0.899904 1.059389
Use crosstab() to compute a cross-tabulation of two (or more) factors. By default crosstab computes a fre-
quency table of the factors unless an array of values and an aggregation function are passed.
It takes a number of arguments
• index: array-like, values to group by in the rows.
• columns: array-like, values to group by in the columns.
• values: array-like, optional, array of values to aggregate according to the factors.
• aggfunc: function, optional, If no values array is passed, computes a frequency table.
• rownames: sequence, default None, must match number of row arrays passed.
• colnames: sequence, default None, if passed, must match number of column arrays passed.
• margins: boolean, default False, Add row/column margins (subtotals)
• normalize: boolean, {‘all’, ‘index’, ‘columns’}, or {0,1}, default False. Normalize by dividing all values
by the sum of values.
Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are
specified
For example:
In [70]: foo, bar, dull, shiny, one, two = "foo", "bar", "dull", "shiny", "one", "two"
In [75]: df = pd.DataFrame(
....: {"A": [1, 2, 2, 2, 2], "B": [3, 3, 4, 4, 4], "C": [1, 1, np.nan, 1, 1]}
....: )
....:
In [76]: df
Out[76]:
A B C
0 1 3 1.0
1 2 3 1.0
2 2 4 NaN
3 2 4 1.0
4 2 4 1.0
If you want to include all of data categories even if the actual data does not contain any instances of a particular
category, you should set dropna=False.
For example:
Normalization
Frequency tables can also be normalized to show percentages rather than counts using the normalize argument:
normalize can also normalize values within each row or within each column:
crosstab can also be passed a third Series and an aggregation function (aggfunc) that will be applied to the
values of the third Series within each group defined by the first two Series:
Adding margins
In [85]: pd.crosstab(
....: df["A"], df["B"], values=df["C"], aggfunc=np.sum, normalize=True,
˓→margins=True
....: )
....:
Out[85]:
B 3 4 All
A
1 0.25 0.0 0.25
2 0.25 0.5 0.75
All 0.50 0.5 1.00
2.8.7 Tiling
The cut() function computes groupings for the values of the input array and is often used to transform continuous
variables to discrete or categorical variables:
In [86]: ages = np.array([10, 15, 13, 12, 23, 25, 28, 59, 60])
Categories (3, interval[float64]): [(9.95, 26.667] < (26.667, 43.333] < (43.333, 60.
˓→0]]
If the bins keyword is an integer, then equal-width bins are formed. Alternatively we can specify custom bin-edges:
In [88]: c = pd.cut(ages, bins=[0, 18, 35, 70])
In [89]: c
Out[89]:
[(0, 18], (0, 18], (0, 18], (0, 18], (18, 35], (18, 35], (18, 35], (35, 70], (35, 70]]
Categories (3, interval[int64]): [(0, 18] < (18, 35] < (35, 70]]
If the bins keyword is an IntervalIndex, then these will be used to bin the passed data.:
pd.cut([25, 20, 50], bins=c.categories)
To convert a categorical variable into a “dummy” or “indicator” DataFrame, for example a column in a DataFrame
(a Series) which has k distinct values, can derive a DataFrame containing k columns of 1s and 0s using
get_dummies():
In [90]: df = pd.DataFrame({"key": list("bbacab"), "data1": range(6)})
In [91]: pd.get_dummies(df["key"])
Out[91]:
a b c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
4 1 0 0
5 0 1 0
Sometimes it’s useful to prefix the column names, for example when merging the result with the original DataFrame:
In [92]: dummies = pd.get_dummies(df["key"], prefix="key")
In [93]: dummies
Out[93]:
key_a key_b key_c
0 0 1 0
1 0 1 0
2 1 0 0
3 0 0 1
(continues on next page)
In [94]: df[["data1"]].join(dummies)
Out[94]:
data1 key_a key_b key_c
0 0 0 1 0
1 1 0 1 0
2 2 1 0 0
3 3 0 0 1
4 4 1 0 0
5 5 0 1 0
This function is often used along with discretization functions like cut:
In [96]: values
Out[96]:
array([ 0.4082, -1.0481, -0.0257, -0.9884, 0.0941, 1.2627, 1.29 ,
0.0824, -0.0558, 0.5366])
In [99]: df = pd.DataFrame({"A": ["a", "b", "a"], "B": ["c", "c", "b"], "C": [1, 2,
˓→3]})
In [100]: pd.get_dummies(df)
Out[100]:
C A_a A_b B_b B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
All non-object columns are included untouched in the output. You can control the columns that are encoded with the
columns keyword.
Notice that the B column is still included in the output, it just hasn’t been encoded. You can drop B before calling
get_dummies if you don’t want to include it in the output.
As with the Series version, you can pass values for the prefix and prefix_sep. By default the column name
is used as the prefix, and ‘_’ as the prefix separator. You can specify prefix and prefix_sep in 3 ways:
• string: Use the same value for prefix or prefix_sep for each column to be encoded.
• list: Must be the same length as the number of columns being encoded.
• dict: Mapping column name to prefix.
In [103]: simple
Out[103]:
C new_prefix_a new_prefix_b new_prefix_b new_prefix_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [105]: from_list
Out[105]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
In [107]: from_dict
Out[107]:
C from_A_a from_A_b from_B_b from_B_c
0 1 1 0 0 1
1 2 0 1 0 1
2 3 1 0 1 0
Sometimes it will be useful to only keep k-1 levels of a categorical variable to avoid collinearity when feeding the
result to statistical models. You can switch to this mode by turn on drop_first.
In [108]: s = pd.Series(list("abcaa"))
In [109]: pd.get_dummies(s)
Out[109]:
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
(continues on next page)
When a column contains only one level, it will be omitted in the result.
In [111]: df = pd.DataFrame({"A": list("aaaaa"), "B": list("ababc")})
In [112]: pd.get_dummies(df)
Out[112]:
A_a B_a B_b B_c
0 1 1 0 0
1 1 0 1 0
2 1 1 0 0
3 1 0 1 0
4 1 0 0 1
By default new columns will have np.uint8 dtype. To choose another dtype, use the dtype argument:
In [114]: df = pd.DataFrame({"A": list("abc"), "B": [1.1, 2.2, 3.3]})
In [117]: x
Out[117]:
0 A
1 A
(continues on next page)
In [119]: labels
Out[119]: array([ 0, 0, -1, 1, 2, 3])
In [120]: uniques
Out[120]: Index(['A', 'B', 3.14, inf], dtype='object')
Note that factorize is similar to numpy.unique, but differs in its handling of NaN:
Note: The following numpy.unique will fail under Python 3 with a TypeError because of an ordering bug. See
also here.
Note: If you just want to handle one column as a categorical variable (like R’s factor), you can use df["cat_col"]
= pd.Categorical(df["col"]) or df["cat_col"] = df["col"].astype("category"). For
full docs on Categorical, see the Categorical introduction and the API documentation.
2.8.10 Examples
In this section, we will review frequently asked questions and examples. The column names and relevant column
values are named to correspond with how this DataFrame will be pivoted in the answers below.
In [121]: np.random.seed([3, 1415])
In [122]: n = 20
In [127]: df
Out[127]:
key row item col val0 val1
0 key0 row3 item1 col3 0.81 0.04
1 key1 row2 item1 col2 0.44 0.07
2 key1 row0 item1 col0 0.77 0.01
3 key0 row4 item0 col2 0.15 0.59
4 key1 row0 item2 col1 0.81 0.64
.. ... ... ... ... ... ...
15 key0 row3 item1 col1 0.31 0.23
16 key0 row0 item2 col3 0.86 0.01
17 key0 row4 item0 col3 0.64 0.21
18 key2 row2 item2 col0 0.13 0.45
19 key0 row2 item0 col4 0.37 0.70
Suppose we wanted to pivot df such that the col values are columns, row values are the index, and the mean of
val0 are the values? In particular, the resulting DataFrame should look like:
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
This solution uses pivot_table(). Also note that aggfunc='mean' is the default. It is included here to be
explicit.
In [128]: df.pivot_table(values="val0", index="row", columns="col", aggfunc="mean")
Out[128]:
col col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65
row2 0.13 NaN 0.395 0.500 0.25
row3 NaN 0.310 NaN 0.545 NaN
row4 NaN 0.100 0.395 0.760 0.24
Note that we can also replace the missing values by using the fill_value parameter.
In [129]: df.pivot_table(
.....: values="val0",
.....: index="row",
.....: columns="col",
.....: aggfunc="mean",
.....: fill_value=0,
.....: )
.....:
Out[129]:
col col0 col1 col2 col3 col4
row
(continues on next page)
Also note that we can pass in other aggregation functions as well. For example, we can also pass in sum.
In [130]: df.pivot_table(
.....: values="val0",
.....: index="row",
.....: columns="col",
.....: aggfunc="sum",
.....: fill_value=0,
.....: )
.....:
Out[130]:
col col0 col1 col2 col3 col4
row
row0 0.77 1.21 0.00 0.86 0.65
row2 0.13 0.00 0.79 0.50 0.50
row3 0.00 0.31 0.00 1.09 0.00
row4 0.00 0.10 0.79 1.52 0.24
Another aggregation we can do is calculate the frequency in which the columns and rows occur together a.k.a. “cross
tabulation”. To do this, we can pass size to the aggfunc parameter.
We can also perform multiple aggregations. For example, to perform both a sum and mean, we can pass in a list to
the aggfunc argument.
In [132]: df.pivot_table(
.....: values="val0",
.....: index="row",
.....: columns="col",
.....: aggfunc=["mean", "sum"],
.....: )
.....:
Out[132]:
mean sum
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65 0.77 1.21 NaN 0.86 0.65
row2 0.13 NaN 0.395 0.500 0.25 0.13 NaN 0.79 0.50 0.50
row3 NaN 0.310 NaN 0.545 NaN NaN 0.31 NaN 1.09 NaN
row4 NaN 0.100 0.395 0.760 0.24 NaN 0.10 0.79 1.52 0.24
Note to aggregate over multiple value columns, we can pass in a list to the values parameter.
In [133]: df.pivot_table(
.....: values=["val0", "val1"],
.....: index="row",
.....: columns="col",
.....: aggfunc=["mean"],
.....: )
.....:
Out[133]:
mean
val0 val1
col col0 col1 col2 col3 col4 col0 col1 col2 col3 col4
row
row0 0.77 0.605 NaN 0.860 0.65 0.01 0.745 NaN 0.010 0.02
row2 0.13 NaN 0.395 0.500 0.25 0.45 NaN 0.34 0.440 0.79
row3 NaN 0.310 NaN 0.545 NaN NaN 0.230 NaN 0.075 NaN
row4 NaN 0.100 0.395 0.760 0.24 NaN 0.070 0.42 0.300 0.46
Note to subdivide over multiple columns we can pass in a list to the columns parameter.
In [134]: df.pivot_table(
.....: values=["val0"],
.....: index="row",
.....: columns=["item", "col"],
.....: aggfunc=["mean"],
.....: )
.....:
Out[134]:
mean
val0
item item0 item1 item2
col col2 col3 col4 col0 col1 col2 col3 col4 col0 col1 col3 col4
row
row0 NaN NaN NaN 0.77 NaN NaN NaN NaN NaN 0.605 0.86 0.65
row2 0.35 NaN 0.37 NaN NaN 0.44 NaN NaN 0.13 NaN 0.50 0.13
row3 NaN NaN NaN NaN 0.31 NaN 0.81 NaN NaN NaN 0.28 NaN
row4 0.15 0.64 NaN NaN 0.10 0.64 0.88 0.24 NaN NaN NaN NaN
In [138]: df
Out[138]:
keys values
0 panda1 [eats, shoots]
1 panda2 [shoots, leaves]
2 panda3 [eats, leaves]
We can ‘explode’ the values column, transforming each list-like to a separate row, by using explode(). This will
replicate the index values from the original row:
In [139]: df["values"].explode()
Out[139]:
0 eats
0 shoots
1 shoots
1 leaves
2 eats
2 leaves
Name: values, dtype: object
In [140]: df.explode("values")
Out[140]:
keys values
0 panda1 eats
0 panda1 shoots
1 panda2 shoots
1 panda2 leaves
2 panda3 eats
2 panda3 leaves
Series.explode() will replace empty lists with np.nan and preserve scalar entries. The dtype of the resulting
Series is always object.
In [142]: s
Out[142]:
0 [1, 2, 3]
1 foo
2 []
3 [a, b]
dtype: object
In [143]: s.explode()
Out[143]:
0 1
0 2
0 3
1 foo
2 NaN
3 a
3 b
dtype: object
Here is a typical usecase. You have comma separated strings in a column and want to expand this.
In [145]: df
Out[145]:
var1 var2
(continues on next page)
Creating a long form DataFrame is now straightforward using explode and chained operations
In [146]: df.assign(var1=df.var1.str.split(",")).explode("var1")
Out[146]:
var1 var2
0 a 1
0 b 1
0 c 1
1 d 2
1 e 2
1 f 2
Warning: StringArray is currently considered experimental. The implementation and parts of the API may
change without warning.
For backwards-compatibility, object dtype remains the default type we infer a list of strings to
In [5]: s
Out[5]:
0 a
1 b
2 c
dtype: object
In [6]: s.astype("string")
Out[6]:
0 a
1 b
2 c
dtype: string
In [8]: s
Out[8]:
0 a
1 2
2 <NA>
dtype: string
In [9]: type(s[1])
Out[9]: str
In [11]: s1
Out[11]:
0 1
(continues on next page)
In [12]: s2 = s1.astype("string")
In [13]: s2
Out[13]:
0 1
1 2
2 <NA>
dtype: string
In [14]: type(s2[0])
Out[14]: str
Behavior differences
These are places where the behavior of StringDtype objects differ from object dtype
l. For StringDtype, string accessor methods that return numeric output will always return a nullable integer
dtype, rather than either int or float dtype, depending on the presence of NA values. Methods returning boolean
output will return a nullable boolean dtype.
In [16]: s
Out[16]:
0 a
1 <NA>
2 b
dtype: string
In [17]: s.str.count("a")
Out[17]:
0 1
1 <NA>
2 0
dtype: Int64
In [18]: s.dropna().str.count("a")
Out[18]:
0 1
2 0
dtype: Int64
In [20]: s2.str.count("a")
Out[20]:
0 1.0
1 NaN
2 0.0
(continues on next page)
In [21]: s2.dropna().str.count("a")
Out[21]:
0 1
2 0
dtype: int64
When NA values are present, the output dtype is float64. Similarly for methods returning boolean values.
In [22]: s.str.isdigit()
Out[22]:
0 False
1 <NA>
2 False
dtype: boolean
In [23]: s.str.match("a")
Out[23]:
0 True
1 <NA>
2 False
dtype: boolean
2. Some string methods, like Series.str.decode() are not available on StringArray because
StringArray only holds strings, not bytes.
3. In comparison operations, arrays.StringArray and Series backed by a StringArray will return
an object with BooleanDtype, rather than a bool dtype object. Missing values in a StringArray will
propagate in comparison operations, rather than always comparing unequal like numpy.nan.
Everything else that follows in the rest of this document applies equally to string and object dtype.
Series and Index are equipped with a set of string processing methods that make it easy to operate on each element of
the array. Perhaps most importantly, these methods exclude missing/NA values automatically. These are accessed via
the str attribute and generally have names matching the equivalent (scalar) built-in string methods:
In [24]: s = pd.Series(
....: ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype=
˓→"string"
....: )
....:
In [25]: s.str.lower()
Out[25]:
0 a
1 b
2 c
3 aaba
4 baca
5 <NA>
6 caba
7 dog
(continues on next page)
In [26]: s.str.upper()
Out[26]:
0 A
1 B
2 C
3 AABA
4 BACA
5 <NA>
6 CABA
7 DOG
8 CAT
dtype: string
In [27]: s.str.len()
Out[27]:
0 1
1 1
2 1
3 4
4 4
5 <NA>
6 4
7 3
8 3
dtype: Int64
In [28]: idx = pd.Index([" jack", "jill ", " jesse ", "frank"])
In [29]: idx.str.strip()
Out[29]: Index(['jack', 'jill', 'jesse', 'frank'], dtype='object')
In [30]: idx.str.lstrip()
Out[30]: Index(['jack', 'jill ', 'jesse ', 'frank'], dtype='object')
In [31]: idx.str.rstrip()
Out[31]: Index([' jack', 'jill', ' jesse', 'frank'], dtype='object')
The string methods on Index are especially useful for cleaning up or transforming DataFrame columns. For instance,
you may have columns with leading or trailing whitespace:
In [32]: df = pd.DataFrame(
....: np.random.randn(3, 2), columns=[" Column A ", " Column B "],
˓→index=range(3)
....: )
....:
In [33]: df
Out[33]:
Column A Column B
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
In [34]: df.columns.str.strip()
Out[34]: Index(['Column A', 'Column B'], dtype='object')
In [35]: df.columns.str.lower()
Out[35]: Index([' column a ', ' column b '], dtype='object')
These string methods can then be used to clean up the columns as needed. Here we are removing leading and trailing
whitespaces, lower casing all names, and replacing any remaining whitespaces with underscores:
In [36]: df.columns = df.columns.str.strip().str.lower().str.replace(" ", "_")
In [37]: df
Out[37]:
column_a column_b
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
Note: If you have a Series where lots of elements are repeated (i.e. the number of unique elements in the
Series is a lot smaller than the length of the Series), it can be faster to convert the original Series to one of
type category and then use .str.<method> or .dt.<property> on that. The performance difference comes
from the fact that, for Series of type category, the string operations are done on the .categories and not on
each element of the Series.
Please note that a Series of type category with string .categories has some limitations in comparison to
Series of type string (e.g. you can’t add strings to each other: s + " " + s won’t work if s is a Series of type
category). Also, .str methods which operate on elements of type list are not available on such a Series.
Warning: Before v.0.25.0, the .str-accessor did only the most rudimentary type checks. Starting with v.0.25.0,
the type of the Series is inferred and the allowed types (i.e. strings) are enforced more rigorously.
Generally speaking, the .str accessor is intended to work only on strings. With very few exceptions, other uses
are not supported, and may be disabled at a later point.
In [39]: s2.str.split("_")
Out[39]:
0 [a, b, c]
1 [c, d, e]
2 <NA>
3 [f, g, h]
dtype: object
In [41]: s2.str.split("_").str[1]
Out[41]:
0 b
1 d
2 <NA>
3 g
dtype: object
When original Series has StringDtype, the output columns will all be StringDtype as well.
It is also possible to limit the number of splits:
In [43]: s2.str.split("_", expand=True, n=1)
Out[43]:
0 1
0 a b_c
1 c d_e
2 <NA> <NA>
3 f g_h
rsplit is similar to split except it works in the reverse direction, i.e., from the end of the string to the beginning
of the string:
In [44]: s2.str.rsplit("_", expand=True, n=1)
Out[44]:
0 1
0 a_b c
1 c_d e
2 <NA> <NA>
3 f_g h
In [46]: s3
Out[46]:
(continues on next page)
Warning: Some caution must be taken when dealing with regular expressions! The current behavior is to treat
single character patterns as literal strings, even when regex is set to True. This behavior is deprecated and will
be removed in a future version so that the regex keyword is always respected.
The replace method can also take a callable as replacement. It is called on every pat using re.sub(). The
callable should expect one positional argument (a regex object) and return a string.
The replace method also accepts a compiled regular expression object from re.compile() as a pattern. All
flags should be included in the compiled regular expression object.
In [57]: import re
Including a flags argument when calling replace with a compiled regular expression object will raise a
ValueError.
2.9.4 Concatenation
There are several ways to concatenate a Series or Index, either with itself or others, all based on cat(), resp.
Index.str.cat.
In [62]: s.str.cat(sep=",")
Out[62]: 'a,b,c,d'
If not specified, the keyword sep for the separator defaults to the empty string, sep='':
In [63]: s.str.cat()
Out[63]: 'abcd'
By default, missing values are ignored. Using na_rep, they can be given a representation:
In [64]: t = pd.Series(["a", "b", np.nan, "d"], dtype="string")
In [65]: t.str.cat(sep=",")
Out[65]: 'a,b,d'
The first argument to cat() can be a list-like object, provided that it matches the length of the calling Series (or
Index).
In [67]: s.str.cat(["A", "B", "C", "D"])
Out[67]:
0 aA
1 bB
2 cC
3 dD
dtype: string
Missing values on either side will result in missing values in the result as well, unless na_rep is specified:
In [68]: s.str.cat(t)
Out[68]:
0 aa
1 bb
2 <NA>
(continues on next page)
The parameter others can also be two-dimensional. In this case, the number or rows must match the lengths of the
calling Series (or Index).
In [71]: s
Out[71]:
0 a
1 b
2 c
3 d
dtype: string
In [72]: d
Out[72]:
0 1
0 a a
1 b b
2 <NA> c
3 d d
For concatenation with a Series or DataFrame, it is possible to align the indexes before concatenation by setting
the join-keyword.
In [75]: s
Out[75]:
0 a
1 b
(continues on next page)
In [76]: u
Out[76]:
1 b
3 d
0 a
2 c
dtype: string
In [77]: s.str.cat(u)
Out[77]:
0 aa
1 bb
2 cc
3 dd
dtype: string
Warning: If the join keyword is not passed, the method cat() will currently fall back to the behavior before
version 0.23.0 (i.e. no alignment), but a FutureWarning will be raised if any of the involved indexes differ,
since this default will change to join='left' in a future version.
The usual options are available for join (one of 'left', 'outer', 'inner', 'right'). In particular,
alignment also means that the different lengths do not need to coincide anymore.
In [80]: s
Out[80]:
0 a
1 b
2 c
3 d
dtype: string
In [81]: v
Out[81]:
-1 z
0 a
1 b
3 d
4 e
dtype: string
(continues on next page)
In [85]: s
Out[85]:
0 a
1 b
2 c
3 d
dtype: string
In [86]: f
Out[86]:
0 1
3 d d
2 <NA> c
1 b b
0 a a
Several array-like items (specifically: Series, Index, and 1-dimensional variants of np.ndarray) can be com-
bined in a list-like container (including iterators, dict-views, etc.).
In [88]: s
Out[88]:
0 a
1 b
2 c
3 d
dtype: string
In [89]: u
Out[89]:
1 b
3 d
0 a
2 c
dtype: string
All elements without an index (e.g. np.ndarray) within the passed list-like must match in length to the calling
Series (or Index), but Series and Index may have arbitrary length (as long as alignment is not disabled with
join=None):
In [91]: v
Out[91]:
-1 z
0 a
1 b
3 d
4 e
dtype: string
If using join='right' on a list-like of others that contains different indexes, the union of these indexes will be
used as the basis for the final concatenation:
In [93]: u.loc[[3]]
Out[93]:
(continues on next page)
You can use [] notation to directly index by position locations. If you index past the end of the string, the result will
be a NaN.
In [96]: s = pd.Series(
....: ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype=
˓→"string"
....: )
....:
In [97]: s.str[0]
Out[97]:
0 A
1 B
2 C
3 A
4 B
5 <NA>
6 C
7 d
8 c
dtype: string
In [98]: s.str[1]
Out[98]:
0 <NA>
1 <NA>
2 <NA>
3 a
4 a
5 <NA>
6 A
7 o
8 a
dtype: string
Warning: Before version 0.23, argument expand of the extract method defaulted to False. When
expand=False, expand returns a Series, Index, or DataFrame, depending on the subject and regu-
lar expression pattern. When expand=True, it always returns a DataFrame, which is more consistent and less
confusing from the perspective of a user. expand=True has been the default since version 0.23.0.
The extract method accepts a regular expression with at least one capture group.
Extracting a regular expression with more than one group returns a DataFrame with one column per group.
In [99]: pd.Series(
....: ["a1", "b2", "c3"],
....: dtype="string",
....: ).str.extract(r"([ab])(\d)", expand=False)
....:
Out[99]:
0 1
0 a 1
1 b 2
2 <NA> <NA>
Elements that do not match return a row filled with NaN. Thus, a Series of messy strings can be “converted” into a
like-indexed Series or DataFrame of cleaned-up or more useful strings, without necessitating get() to access tuples
or re.match objects. The dtype of the result is always object, even if no match is found and the result only contains
NaN.
Named groups like
In [101]: pd.Series(
.....: ["a1", "b2", "3"],
.....: dtype="string",
.....: ).str.extract(r"([ab])?(\d)", expand=False)
.....:
Out[101]:
0 1
0 a 1
1 b 2
2 <NA> 3
can also be used. Note that any capture group names in the regular expression will be used for column names;
otherwise capture group numbers will be used.
Extracting a regular expression with one group returns a DataFrame with one column if expand=True.
Out[102]:
0
0 1
1 2
2 <NA>
Out[103]:
0 1
1 2
2 <NA>
dtype: string
Calling on an Index with a regex with exactly one capture group returns a DataFrame with one column if
expand=True.
In [105]: s
Out[105]:
A11 a1
B22 b2
C33 c3
dtype: string
Calling on an Index with a regex with more than one capture group returns a DataFrame if expand=True.
The table below summarizes the behavior of extract(expand=False) (input subject in first column, number of
groups in regex in first row)
In [110]: s
Out[110]:
A a1a2
B b1
C c1
dtype: string
the extractall method returns every match. The result of extractall is always a DataFrame with a
MultiIndex on its rows. The last level of the MultiIndex is named match and indicates the order in the
subject.
In [113]: s.str.extractall(two_groups)
Out[113]:
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 c 1
When each subject string in the Series has exactly one match,
In [115]: s
Out[115]:
0 a3
1 b3
2 c2
dtype: string
In [117]: extract_result
Out[117]:
letter digit
0 a 3
1 b 3
2 c 2
In [119]: extractall_result
Out[119]:
letter digit
match
0 0 a 3
1 0 b 3
2 0 c 2
Index also supports .str.extractall. It returns a DataFrame which has the same result as a Series.str.
extractall with a default index (starts from 0).
In [124]: pd.Series(
.....: ["1", "2", "3a", "3b", "03c", "4dx"],
.....: dtype="string",
.....: ).str.contains(pattern)
.....:
Out[124]:
0 False
1 False
2 True
3 True
4 True
5 True
dtype: boolean
In [125]: pd.Series(
.....: ["1", "2", "3a", "3b", "03c", "4dx"],
.....: dtype="string",
.....: ).str.match(pattern)
.....:
Out[125]:
0 False
1 False
2 True
3 True
4 False
5 True
dtype: boolean
In [126]: pd.Series(
.....: ["1", "2", "3a", "3b", "03c", "4dx"],
.....: dtype="string",
.....: ).str.fullmatch(pattern)
.....:
Out[126]:
0 False
1 False
2 True
3 True
4 False
5 False
dtype: boolean
Note: The distinction between match, fullmatch, and contains is strictness: fullmatch tests whether
the entire string matches the regular expression; match tests whether there is a match of the regular expression that
begins at the first character of the string; and contains tests whether there is a match of the regular expression at
any position within the string.
The corresponding functions in the re package for these three match modes are re.fullmatch, re.match, and re.search,
respectively.
Methods like match, fullmatch, contains, startswith, and endswith take an extra na argument so
missing values can be considered True or False:
In [127]: s4 = pd.Series(
.....: ["A", "B", "C", "Aaba", "Baca", np.nan, "CABA", "dog", "cat"], dtype=
˓→"string"
.....: )
.....:
You can extract dummy variables from string columns. For example if they are separated by a '|':
In [130]: s.str.get_dummies(sep="|")
Out[130]:
a b c
0 1 0 0
1 1 1 0
2 0 0 0
3 1 0 1
In [132]: idx.str.get_dummies(sep="|")
Out[132]:
MultiIndex([(1, 0, 0),
(1, 1, 0),
(0, 0, 0),
(1, 0, 1)],
names=['a', 'b', 'c'])
Method Description
cat() Concatenate strings
split() Split strings on delimiter
rsplit() Split strings on delimiter working from the end of the string
get() Index into each element (retrieve i-th element)
join() Join strings in each element of the Series with passed separator
get_dummies() Split strings on the delimiter returning DataFrame of dummy variables
contains() Return boolean array if each string contains pattern/regex
replace() Replace occurrences of pattern/regex/string with some other string or the return value of a
callable given the occurrence
repeat() Duplicate values (s.str.repeat(3) equivalent to x * 3)
pad() Add whitespace to left, right, or both sides of strings
center() Equivalent to str.center
ljust() Equivalent to str.ljust
rjust() Equivalent to str.rjust
zfill() Equivalent to str.zfill
wrap() Split long strings into lines with length less than a given width
slice() Slice each string in the Series
slice_replace() Replace slice in each string with passed value
count() Count occurrences of pattern
startswith() Equivalent to str.startswith(pat) for each element
endswith() Equivalent to str.endswith(pat) for each element
findall() Compute list of all occurrences of pattern/regex for each string
match() Call re.match on each element, returning matched groups as list
extract() Call re.search on each element, returning DataFrame with one row for each element
and one column for each regex capture group
extractall() Call re.findall on each element, returning DataFrame with one row for each match
and one column for each regex capture group
len() Compute string lengths
strip() Equivalent to str.strip
rstrip() Equivalent to str.rstrip
lstrip() Equivalent to str.lstrip
partition() Equivalent to str.partition
rpartition() Equivalent to str.rpartition
lower() Equivalent to str.lower
casefold() Equivalent to str.casefold
upper() Equivalent to str.upper
find() Equivalent to str.find
rfind() Equivalent to str.rfind
index() Equivalent to str.index
rindex() Equivalent to str.rindex
capitalize() Equivalent to str.capitalize
swapcase() Equivalent to str.swapcase
normalize() Return Unicode normal form. Equivalent to unicodedata.normalize
translate() Equivalent to str.translate
isalnum() Equivalent to str.isalnum
isalpha() Equivalent to str.isalpha
isdigit() Equivalent to str.isdigit
isspace() Equivalent to str.isspace
continues on next page
In this section, we will discuss missing (also referred to as NA) values in pandas.
Note: The choice of using NaN internally to denote missing data was largely for simplicity and performance reasons.
Starting from pandas 1.0, some optional data types start experimenting with a native NA scalar using a mask-based
approach. See here for more.
As data comes in many shapes and forms, pandas aims to be flexible with regard to handling missing data. While
NaN is the default missing value marker for reasons of computational speed and convenience, we need to be able to
easily detect this value with data of different types: floating point, integer, boolean, and general object. In many cases,
however, the Python None will arise and we wish to also consider that “missing” or “not available” or “NA”.
Note: If you want to consider inf and -inf to be “NA” in computations, you can set pandas.options.mode.
use_inf_as_na = True.
In [1]: df = pd.DataFrame(
...: np.random.randn(5, 3),
...: index=["a", "c", "e", "f", "h"],
...: columns=["one", "two", "three"],
...: )
...:
In [4]: df
Out[4]:
one two three four five
a 0.469112 -0.282863 -1.509059 bar True
c -1.135632 1.212112 -0.173215 bar False
e 0.119209 -1.044236 -0.861849 bar True
f -2.104569 -0.494929 1.071804 bar False
h 0.721555 -0.706771 -1.039575 bar True
In [5]: df2 = df.reindex(["a", "b", "c", "d", "e", "f", "g", "h"])
To make detecting missing values easier (and across different array dtypes), pandas provides the isna() and
notna() functions, which are also methods on Series and DataFrame objects:
In [7]: df2["one"]
Out[7]:
a 0.469112
b NaN
c -1.135632
d NaN
e 0.119209
f -2.104569
g NaN
h 0.721555
Name: one, dtype: float64
In [8]: pd.isna(df2["one"])
Out[8]:
a False
b True
c False
d True
e False
f False
g True
h False
Name: one, dtype: bool
In [9]: df2["four"].notna()
Out[9]:
a True
b False
c True
d False
e True
f True
g False
h True
Name: four, dtype: bool
In [10]: df2.isna()
Out[10]:
one two three four five
a False False False False False
b True True True True True
(continues on next page)
Warning: One has to be mindful that in Python (and NumPy), the nan's don’t compare equal, but None's do.
Note that pandas/NumPy uses the fact that np.nan != np.nan, and treats None like np.nan.
In [11]: None == None # noqa: E711
Out[11]: True
So as compared to above, a scalar equality comparison versus a None/np.nan doesn’t provide useful informa-
tion.
In [13]: df2["one"] == np.nan
Out[13]:
a False
b False
c False
d False
e False
f False
g False
h False
Name: one, dtype: bool
Because NaN is a float, a column of integers with even one missing values is cast to floating-point dtype (see Support
for integer NA for more). pandas provides a nullable integer array, which can be used by explicitly requesting the
dtype:
Alternatively, the string alias dtype='Int64' (note the capital "I") can be used.
See Nullable integer data type for more.
Datetimes
For datetime64[ns] types, NaT represents missing values. This is a pseudo-native sentinel value that can be represented
by NumPy in a singular dtype (datetime64[ns]). pandas objects provide compatibility between NaT and NaN.
In [17]: df2
Out[17]:
one two three four five timestamp
a 0.469112 -0.282863 -1.509059 bar True 2012-01-01
c -1.135632 1.212112 -0.173215 bar False 2012-01-01
e 0.119209 -1.044236 -0.861849 bar True 2012-01-01
f -2.104569 -0.494929 1.071804 bar False 2012-01-01
h 0.721555 -0.706771 -1.039575 bar True 2012-01-01
In [19]: df2
Out[19]:
one two three four five timestamp
a NaN -0.282863 -1.509059 bar True NaT
c NaN 1.212112 -0.173215 bar False NaT
e 0.119209 -1.044236 -0.861849 bar True 2012-01-01
f -2.104569 -0.494929 1.071804 bar False 2012-01-01
h NaN -0.706771 -1.039575 bar True NaT
In [20]: df2.dtypes.value_counts()
Out[20]:
float64 3
object 1
bool 1
datetime64[ns] 1
dtype: int64
You can insert missing values by simply assigning to containers. The actual missing value used will be chosen based
on the dtype.
For example, numeric containers will always use NaN regardless of the missing value type chosen:
In [23]: s
Out[23]:
0 NaN
1 2.0
2 3.0
dtype: float64
In [27]: s
Out[27]:
0 None
1 NaN
2 c
dtype: object
Missing values propagate naturally through arithmetic operations between pandas objects.
In [28]: a
Out[28]:
one two
a NaN -0.282863
c NaN 1.212112
e 0.119209 -1.044236
f -2.104569 -0.494929
h -2.104569 -0.706771
In [29]: b
Out[29]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h NaN -0.706771 -1.039575
In [30]: a + b
Out[30]:
one three two
a NaN NaN -0.565727
c NaN NaN 2.424224
e 0.238417 NaN -2.088472
f -4.209138 NaN -0.989859
h NaN NaN -1.413542
The descriptive statistics and computational methods discussed in the data structure overview (and listed here and
here) are all written to account for missing data. For example:
• When summing data, NA (missing) values will be treated as zero.
• If the data are all NA, the result will be 0.
• Cumulative methods like cumsum() and cumprod() ignore NA values by default, but preserve them in the
resulting arrays. To override this behaviour and include NA values, use skipna=False.
In [31]: df
Out[31]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h NaN -0.706771 -1.039575
In [32]: df["one"].sum()
Out[32]: -1.9853605075978744
In [33]: df.mean(1)
Out[33]:
a -0.895961
c 0.519449
e -0.595625
f -0.509232
h -0.873173
dtype: float64
In [34]: df.cumsum()
Out[34]:
one two three
a NaN -0.282863 -1.509059
c NaN 0.929249 -1.682273
e 0.119209 -0.114987 -2.544122
f -1.985361 -0.609917 -1.472318
h NaN -1.316688 -2.511893
In [35]: df.cumsum(skipna=False)
Out[35]:
one two three
a NaN -0.282863 -1.509059
c NaN 0.929249 -1.682273
e NaN -0.114987 -2.544122
f NaN -0.609917 -1.472318
h NaN -1.316688 -2.511893
Warning: This behavior is now standard as of v0.22.0 and is consistent with the default in numpy; previously
sum/prod of all-NA or empty Series/DataFrames would return NaN. See v0.22.0 whatsnew for more.
In [36]: pd.Series([np.nan]).sum()
Out[36]: 0.0
In [38]: pd.Series([np.nan]).prod()
Out[38]: 1.0
NA groups in GroupBy are automatically excluded. This behavior is consistent with R, for example:
In [40]: df
Out[40]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h NaN -0.706771 -1.039575
In [41]: df.groupby("one").mean()
Out[41]:
two three
one
-2.104569 -0.494929 1.071804
0.119209 -1.044236 -0.861849
pandas objects are equipped with various data manipulation methods for dealing with missing data.
fillna() can “fill in” NA values with non-NA data in a couple of ways, which we illustrate:
Replace NA with a scalar value
In [42]: df2
Out[42]:
one two three four five timestamp
a NaN -0.282863 -1.509059 bar True NaT
c NaN 1.212112 -0.173215 bar False NaT
e 0.119209 -1.044236 -0.861849 bar True 2012-01-01
f -2.104569 -0.494929 1.071804 bar False 2012-01-01
h NaN -0.706771 -1.039575 bar True NaT
In [43]: df2.fillna(0)
Out[43]:
one two three four five timestamp
a 0.000000 -0.282863 -1.509059 bar True 0
c 0.000000 1.212112 -0.173215 bar False 0
e 0.119209 -1.044236 -0.861849 bar True 2012-01-01 00:00:00
f -2.104569 -0.494929 1.071804 bar False 2012-01-01 00:00:00
(continues on next page)
In [44]: df2["one"].fillna("missing")
Out[44]:
a missing
c missing
e 0.119209
f -2.104569
h missing
Name: one, dtype: object
In [45]: df
Out[45]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h NaN -0.706771 -1.039575
In [46]: df.fillna(method="pad")
Out[46]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e 0.119209 -1.044236 -0.861849
f -2.104569 -0.494929 1.071804
h -2.104569 -0.706771 -1.039575
In [47]: df
Out[47]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e NaN NaN NaN
f NaN NaN NaN
h NaN -0.706771 -1.039575
Method Action
pad / ffill Fill values forward
bfill / backfill Fill values backward
With time series data, using pad/ffill is extremely common so that the “last known value” is available at every time
point.
ffill() is equivalent to fillna(method='ffill') and bfill() is equivalent to
fillna(method='bfill')
You can also fillna using a dict or Series that is alignable. The labels of the dict or index of the Series must match the
columns of the frame you wish to fill. The use case of this is to fill a DataFrame with the mean of that column.
In [53]: dff
Out[53]:
A B C
0 0.271860 -0.424972 0.567020
1 0.276232 -1.087401 -0.673690
2 0.113648 -1.478427 0.524988
3 NaN 0.577046 -1.715002
4 NaN NaN -1.157892
5 -1.344312 NaN NaN
6 -0.109050 1.643563 NaN
7 0.357021 -0.674600 NaN
8 -0.968914 -1.294524 0.413738
9 0.276662 -0.472035 -0.013960
In [54]: dff.fillna(dff.mean())
Out[54]:
A B C
0 0.271860 -0.424972 0.567020
1 0.276232 -1.087401 -0.673690
2 0.113648 -1.478427 0.524988
3 -0.140857 0.577046 -1.715002
4 -0.140857 -0.401419 -1.157892
5 -1.344312 -0.401419 -0.293543
6 -0.109050 1.643563 -0.293543
7 0.357021 -0.674600 -0.293543
8 -0.968914 -1.294524 0.413738
9 0.276662 -0.472035 -0.013960
In [55]: dff.fillna(dff.mean()["B":"C"])
Out[55]:
A B C
0 0.271860 -0.424972 0.567020
(continues on next page)
Same result as above, but is aligning the ‘fill’ value which is a Series in this case.
You may wish to simply exclude labels from a data set which refer to missing data. To do this, use dropna():
In [57]: df
Out[57]:
one two three
a NaN -0.282863 -1.509059
c NaN 1.212112 -0.173215
e NaN 0.000000 0.000000
f NaN 0.000000 0.000000
h NaN -0.706771 -1.039575
In [58]: df.dropna(axis=0)
Out[58]:
Empty DataFrame
Columns: [one, two, three]
Index: []
In [59]: df.dropna(axis=1)
Out[59]:
two three
a -0.282863 -1.509059
c 1.212112 -0.173215
e 0.000000 0.000000
f 0.000000 0.000000
h -0.706771 -1.039575
An equivalent dropna() is available for Series. DataFrame.dropna has considerably more options than Se-
ries.dropna, which can be examined in the API.
2.10.9 Interpolation
Both Series and DataFrame objects have interpolate() that, by default, performs linear interpolation at missing
data points.
In [61]: ts
Out[61]:
2000-01-31 0.469112
2000-02-29 NaN
2000-03-31 NaN
2000-04-28 NaN
2000-05-31 NaN
...
2007-12-31 -6.950267
2008-01-31 -7.904475
2008-02-29 -6.441779
2008-03-31 -8.184940
2008-04-30 -9.011531
Freq: BM, Length: 100, dtype: float64
In [62]: ts.count()
Out[62]: 66
In [63]: ts.plot()
Out[63]: <AxesSubplot:>
In [64]: ts.interpolate()
Out[64]:
2000-01-31 0.469112
2000-02-29 0.434469
2000-03-31 0.399826
2000-04-28 0.365184
2000-05-31 0.330541
...
2007-12-31 -6.950267
2008-01-31 -7.904475
2008-02-29 -6.441779
2008-03-31 -8.184940
2008-04-30 -9.011531
Freq: BM, Length: 100, dtype: float64
In [65]: ts.interpolate().count()
Out[65]: 100
In [66]: ts.interpolate().plot()
Out[66]: <AxesSubplot:>
In [67]: ts2
Out[67]:
2000-01-31 0.469112
2000-02-29 NaN
2002-07-31 -5.785037
2005-01-31 NaN
2008-04-30 -9.011531
dtype: float64
In [68]: ts2.interpolate()
Out[68]:
2000-01-31 0.469112
2000-02-29 -2.657962
2002-07-31 -5.785037
2005-01-31 -7.398284
2008-04-30 -9.011531
dtype: float64
In [69]: ts2.interpolate(method="time")
Out[69]:
2000-01-31 0.469112
2000-02-29 0.270241
2002-07-31 -5.785037
(continues on next page)
In [70]: ser
Out[70]:
0.0 0.0
1.0 NaN
10.0 10.0
dtype: float64
In [71]: ser.interpolate()
Out[71]:
0.0 0.0
1.0 5.0
10.0 10.0
dtype: float64
In [72]: ser.interpolate(method="values")
Out[72]:
0.0 0.0
1.0 1.0
10.0 10.0
dtype: float64
In [73]: df = pd.DataFrame(
....: {
....: "A": [1, 2.1, np.nan, 4.7, 5.6, 6.8],
....: "B": [0.25, np.nan, np.nan, 4, 12.2, 14.4],
....: }
....: )
....:
In [74]: df
Out[74]:
A B
0 1.0 0.25
1 2.1 NaN
2 NaN NaN
3 4.7 4.00
4 5.6 12.20
5 6.8 14.40
In [75]: df.interpolate()
Out[75]:
A B
0 1.0 0.25
1 2.1 1.50
2 3.4 2.75
3 4.7 4.00
4 5.6 12.20
5 6.8 14.40
The method argument gives access to fancier interpolation methods. If you have scipy installed, you can pass the
name of a 1-d interpolation routine to method. You’ll want to consult the full scipy interpolation documentation and
reference guide for details. The appropriate interpolation method will depend on the type of data you are working
with.
• If you are dealing with a time series that is growing at an increasing rate, method='quadratic' may be
appropriate.
• If you have values approximating a cumulative distribution function, then method='pchip' should work
well.
• To fill missing values with goal of smooth plotting, consider method='akima'.
In [76]: df.interpolate(method="barycentric")
Out[76]:
A B
0 1.00 0.250
1 2.10 -7.660
2 3.53 -4.515
3 4.70 4.000
4 5.60 12.200
5 6.80 14.400
In [77]: df.interpolate(method="pchip")
Out[77]:
A B
0 1.00000 0.250000
1 2.10000 0.672808
2 3.43454 1.928950
3 4.70000 4.000000
4 5.60000 12.200000
5 6.80000 14.400000
In [78]: df.interpolate(method="akima")
Out[78]:
A B
0 1.000000 0.250000
1 2.100000 -0.873316
2 3.406667 0.320034
3 4.700000 4.000000
4 5.600000 12.200000
5 6.800000 14.400000
When interpolating via a polynomial or spline approximation, you must also specify the degree or order of the approx-
imation:
In [81]: np.random.seed(2)
In [83]: missing = np.array([4, 13, 14, 15, 16, 17, 18, 20, 29])
In [87]: df.plot()
Out[87]: <AxesSubplot:>
Another use case is interpolation at new values. Suppose you have 100 observations from some distribution. And let’s
suppose that you’re particularly interested in what’s happening around the middle. You can mix pandas’ reindex
and interpolate methods to interpolate at the new values.
# interpolate at new_index
In [89]: new_index = ser.index.union(pd.Index([49.25, 49.5, 49.75, 50.25, 50.5, 50.
˓→75]))
In [91]: interp_s[49:51]
Out[91]:
49.00 0.471410
49.25 0.476841
49.50 0.481780
49.75 0.485998
50.00 0.489266
50.25 0.491814
50.50 0.493995
50.75 0.495763
51.00 0.497074
dtype: float64
Interpolation limits
Like other pandas fill methods, interpolate() accepts a limit keyword argument. Use this argument to limit
the number of consecutive NaN values filled since the last valid observation:
In [92]: ser = pd.Series([np.nan, np.nan, 5, np.nan, np.nan, np.nan, 13, np.nan, np.
˓→nan])
In [93]: ser
Out[93]:
0 NaN
1 NaN
2 5.0
3 NaN
4 NaN
5 NaN
6 13.0
7 NaN
8 NaN
dtype: float64
By default, NaN values are filled in a forward direction. Use limit_direction parameter to fill backward or
from both directions.
# fill one consecutive value backwards
In [96]: ser.interpolate(limit=1, limit_direction="backward")
Out[96]:
0 NaN
1 5.0
2 5.0
(continues on next page)
By default, NaN values are filled whether they are inside (surrounded by) existing valid values, or outside existing
valid values. The limit_area parameter restricts filling to either inside or outside values.
In [103]: ser.replace(0, 5)
Out[103]:
0 5.0
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
Instead of replacing with specified values, you can treat all given values as missing and interpolate over them:
Note: Python strings prefixed with the r character such as r'hello world' are so-called “raw” strings. They
have different semantics regarding backslashes than strings without this prefix. Backslashes in raw strings will be
interpreted as an escaped backslash, e.g., r'\' == '\\'. You should read about them if this is unclear.
In [109]: d = {"a": list(range(4)), "b": list("ab.."), "c": ["a", "b", np.nan, "d"]}
In [110]: df = pd.DataFrame(d)
Now do it with a regular expression that removes surrounding whitespace (regex -> regex):
Same as the previous example, but use a regular expression for searching instead (dict of regex -> dict):
You can pass nested dictionaries of regular expressions that use regex=True:
You can also use the group of a regular expression match when replacing (dict of regex -> dict of regex), this works
for lists as well.
You can pass a list of regular expressions, of which those that match will be replaced with a scalar (list of regex ->
regex).
All of the regular expression examples can also be passed with the to_replace argument as the regex argument.
In this case the value argument must be passed explicitly by name or regex must be a nested dictionary. The
previous example, in this case, would then be:
This can be convenient if you do not want to pass regex=True every time you want to use a regular expression.
Note: Anywhere in the above replace examples that you see a regular expression a compiled regular expression is
valid as well.
In [127]: df[1].dtype
Out[127]: dtype('float64')
While pandas supports storing arrays of integer and boolean type, these types are not capable of storing missing data.
Until we can switch to using a native NA type in NumPy, we’ve established some “casting rules”. When a reindexing
operation introduces missing data, the Series will be cast according to the rules introduced in the table below.
For example:
In [130]: s > 0
Out[130]:
0 True
2 True
4 True
6 True
7 True
dtype: bool
In [133]: crit
Out[133]:
0 True
1 NaN
2 True
3 NaN
4 True
5 NaN
6 True
7 True
dtype: object
In [134]: crit.dtype
Out[134]: dtype('O')
Ordinarily NumPy will complain if you try to use an object array (even if it contains boolean values) instead of a
boolean array to get or set values from an ndarray (e.g. selecting values based on some criteria). If a boolean vector
contains NAs, an exception will be generated:
In [136]: reindexed[crit]
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-136-0dac417a4890> in <module>
----> 1 reindexed[crit]
(continues on next page)
/pandas/pandas/core/common.py in is_bool_indexer(key)
112 # Don't raise on e.g. ["A", "B", np.nan], see
113 # test_loc_getitem_list_of_labels_categoricalindex_with_
˓→na
However, these can be filled in using fillna() and it will work fine:
In [137]: reindexed[crit.fillna(False)]
Out[137]:
0 0.126504
2 0.696198
4 0.697416
6 0.601516
7 0.003659
dtype: float64
In [138]: reindexed[crit.fillna(True)]
Out[138]:
0 0.126504
1 0.000000
2 0.696198
3 0.000000
4 0.697416
5 0.000000
6 0.601516
7 0.003659
dtype: float64
pandas provides a nullable integer dtype, but you must explicitly request it when creating the series or column. Notice
that we use a capital “I” in the dtype="Int64".
In [140]: s
Out[140]:
0 0
1 1
2 <NA>
3 3
4 4
dtype: Int64
Warning: Experimental: the behaviour of pd.NA can still change without warning.
In [142]: s
Out[142]:
0 1
1 2
2 <NA>
dtype: Int64
In [143]: s[2]
Out[143]: <NA>
Currently, pandas does not yet use those data types by default (when creating a DataFrame or Series, or when reading
in data), so you need to specify the dtype explicitly. An easy way to convert to those dtypes is explained here.
In general, missing values propagate in operations involving pd.NA. When one of the operands is unknown, the
outcome of the operation is also unknown.
For example, pd.NA propagates in arithmetic operations, similarly to np.nan:
In [145]: pd.NA + 1
Out[145]: <NA>
There are a few special cases when the result is known, even when one of the operands is NA.
In [147]: pd.NA ** 0
Out[147]: 1
In [148]: 1 ** pd.NA
Out[148]: 1
In equality and comparison operations, pd.NA also propagates. This deviates from the behaviour of np.nan, where
comparisons with np.nan always return False.
In [149]: pd.NA == 1
Out[149]: <NA>
An exception on this basic propagation rule are reductions (such as the mean or the minimum), where pandas defaults
to skipping missing values. See above for more.
Logical operations
For logical operations, pd.NA follows the rules of the three-valued logic (or Kleene logic, similarly to R, SQL and
Julia). This logic means to only propagate missing values when it is logically required.
For example, for the logical “or” operation (|), if one of the operands is True, we already know the result will be
True, regardless of the other value (so regardless the missing value would be True or False). In this case, pd.NA
does not propagate:
In [153]: True | False
Out[153]: True
On the other hand, if one of the operands is False, the result depends on the value of the other operand. Therefore,
in this case pd.NA propagates:
In [156]: False | True
Out[156]: True
The behaviour of the logical “and” operation (&) can be derived using similar logic (where now pd.NA will not
propagate if one of the operands is already False):
In [159]: False & True
Out[159]: False
NA in a boolean context
Since the actual value of an NA is unknown, it is ambiguous to convert NA to a boolean value. The following raises
an error:
In [165]: bool(pd.NA)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-165-5477a57d5abb> in <module>
----> 1 bool(pd.NA)
/pandas/pandas/_libs/missing.pyx in pandas._libs.missing.NAType.__bool__()
This also means that pd.NA cannot be used in a context where it is evaluated to a boolean, such as if condition:
... where condition can potentially be pd.NA. In such cases, isna() can be used to check for pd.NA or
condition being pd.NA can be avoided, for example by filling missing values beforehand.
A similar situation occurs when using Series or DataFrame objects in if statements, see Using if/truth statements with
pandas.
NumPy ufuncs
pandas.NA implements NumPy’s __array_ufunc__ protocol. Most ufuncs work with NA, and generally return
NA:
In [166]: np.log(pd.NA)
Out[166]: <NA>
In [167]: np.add(pd.NA, 1)
Out[167]: <NA>
Warning: Currently, ufuncs involving an ndarray and NA will return an object-dtype filled with NA values.
In [168]: a = np.array([1, 2, 3])
The return type here may change to return a different array type in the future.
Conversion
If you have a DataFrame or Series using traditional types that have missing data represented using np.nan, there are
convenience methods convert_dtypes() in Series and convert_dtypes() in DataFrame that can convert
data to use the newer dtypes for integers, strings and booleans listed here. This is especially helpful after reading in
data sets when letting the readers such as read_csv() and read_excel() infer default dtypes.
In this example, while the dtypes of all columns are changed, we show the results for the first 10 columns.
In [171]: bb[bb.columns[:10]].dtypes
Out[171]:
player object
year int64
stint int64
team object
lg object
g int64
ab int64
r int64
h int64
X2b int64
dtype: object
In [173]: bbn[bbn.columns[:10]].dtypes
Out[173]:
player string
year Int64
stint Int64
team string
lg string
g Int64
ab Int64
r Int64
h Int64
X2b Int64
dtype: object
Index objects are not required to be unique; you can have duplicate row or column labels. This may be a bit confusing
at first. If you’re familiar with SQL, you know that row labels are similar to a primary key on a table, and you would
never want duplicates in a SQL table. But one of pandas’ roles is to clean messy, real-world data before it goes to
some downstream system. And real-world data has duplicates, even in fields that are supposed to be unique.
This section describes how duplicate labels change the behavior of certain operations, and how prevent duplicates
from arising during operations, or to detect them if they do.
Some pandas methods (Series.reindex() for example) just don’t work with duplicates present. The output can’t
be determined, and so pandas raises.
In [3]: s1 = pd.Series([0, 1, 2], index=["a", "b", "b"])
4825
4826 axis = self._get_axis_number(a)
-> 4827 obj = obj._reindex_with_indexers(
4828 {axis: [new_index, indexer]},
4829 fill_value=fill_value,
4870
4871 # TODO: speed up on homogeneous DataFrame objects
-> 4872 new_data = new_data.reindex_indexer(
4873 index,
4874 indexer,
Other methods, like indexing, can give very surprising results. Typically indexing with a scalar will reduce dimension-
ality. Slicing a DataFrame with a scalar will return a Series. Slicing a Series with a scalar will return a scalar.
But with duplicates, this isn’t the case.
In [6]: df1
Out[6]:
A A B
0 0 1 2
1 3 4 5
In [10]: df2
Out[10]:
A
a 0
a 1
b 2
You can check whether an Index (storing the row or column labels) is unique with Index.is_unique:
In [13]: df2
Out[13]:
A
a 0
a 1
b 2
In [14]: df2.index.is_unique
Out[14]: False
In [15]: df2.columns.is_unique
Out[15]: True
Note: Checking whether an index is unique is somewhat expensive for large datasets. pandas does cache this result,
so re-checking on the same index is very fast.
In [16]: df2.index.duplicated()
Out[16]: array([False, True, False])
In [17]: df2.loc[~df2.index.duplicated(), :]
Out[17]:
A
a 0
b 2
If you need additional logic to handle duplicate labels, rather than just dropping the repeats, using groupby() on the
index is a common trick. For example, we’ll resolve duplicates by taking the average of all rows with the same label.
In [18]: df2.groupby(level=0).mean()
Out[18]:
A
a 0.5
b 2.0
---------------------------------------------------------------------------
DuplicateLabelError Traceback (most recent call last)
<ipython-input-19-11af4ee9738e> in <module>
----> 1 pd.Series([0, 1, 2], index=["a", "b", "b"]).set_flags(allows_duplicate_
˓→labels=False)
/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
469 msg += f"\n{duplicates}"
470
--> 471 raise DuplicateLabelError(msg)
472
473 @final
This attribute can be checked or set with allows_duplicate_labels, which indicates whether that object can
have duplicate labels.
In [22]: df
Out[22]:
A
x 0
y 1
X 2
Y 3
In [23]: df.flags.allows_duplicate_labels
Out[23]: False
In [25]: df2.flags.allows_duplicate_labels
Out[25]: True
The new DataFrame returned is a view on the same data as the old DataFrame. Or the property can just be set
directly on the same object
In [26]: df2.flags.allows_duplicate_labels = False
In [27]: df2.flags.allows_duplicate_labels
Out[27]: False
When processing raw, messy data you might initially read in the messy data (which potentially has duplicate labels),
deduplicate, and then disallow duplicates going forward, to ensure that your data pipeline doesn’t introduce duplicates.
>>> raw = pd.read_csv("...")
>>> deduplicated = raw.groupby(level=0).first() # remove duplicates
>>> deduplicated.flags.allows_duplicate_labels = False # disallow going forward
/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
469 msg += f"\n{duplicates}"
470
--> 471 raise DuplicateLabelError(msg)
472
473 @final
This error message contains the labels that are duplicated, and the numeric positions of all the duplicates (including
the “original”) in the Series or DataFrame
In [30]: s1
Out[30]:
a 0
b 0
dtype: int64
4271 """
4272 if callable(index) or is_dict_like(index):
-> 4273 return super().rename(
4274 index, copy=copy, inplace=inplace, level=level, errors=errors
4275 )
/pandas/pandas/core/indexes/base.py in _maybe_check_unique(self)
469 msg += f"\n{duplicates}"
470
--> 471 raise DuplicateLabelError(msg)
472
473 @final
Warning: This is an experimental feature. Currently, many methods fail to propagate the
allows_duplicate_labels value. In future versions it is expected that every method taking or returning
one or more DataFrame or Series objects will propagate allows_duplicate_labels.
This is an introduction to pandas categorical data type, including a short comparison with R’s factor.
Categoricals are a pandas data type corresponding to categorical variables in statistics. A categorical variable
takes on a limited, and usually fixed, number of possible values (categories; levels in R). Examples are gender,
social class, blood type, country affiliation, observation time or rating via Likert scales.
In contrast to statistical categorical variables, categorical data might have an order (e.g. ‘strongly agree’ vs ‘agree’ or
‘first observation’ vs. ‘second observation’), but numerical operations (additions, divisions, . . . ) are not possible.
All values of categorical data are either in categories or np.nan. Order is defined by the order of categories,
not lexical order of the values. Internally, the data structure consists of a categories array and an integer array of
codes which point to the real value in the categories array.
The categorical data type is useful in the following cases:
• A string variable consisting of only a few different values. Converting such a string variable to a categorical
variable will save some memory, see here.
• The lexical order of a variable is not the same as the logical order (“one”, “two”, “three”). By converting to a
categorical and specifying an order on the categories, sorting and min/max will use the logical order instead of
the lexical order, see here.
• As a signal to other Python libraries that this column should be treated as a categorical variable (e.g. to use
suitable statistical methods or plot types).
See also the API docs on categoricals.
Series creation
In [2]: s
Out[2]:
0 a
1 b
2 c
3 a
(continues on next page)
In [5]: df
Out[5]:
A B
0 a a
1 b b
2 c c
3 a a
By using special functions, such as cut(), which groups data into discrete bins. See the example on tiling in the
docs.
In [9]: df.head(10)
Out[9]:
value group
0 65 60 - 69
1 49 40 - 49
2 56 50 - 59
3 43 40 - 49
4 43 40 - 49
5 91 90 - 99
6 32 30 - 39
7 87 80 - 89
8 36 30 - 39
9 8 0 - 9
In [11]: s = pd.Series(raw_cat)
In [12]: s
Out[12]:
0 NaN
1 b
2 c
3 NaN
dtype: category
(continues on next page)
In [15]: df
Out[15]:
A B
0 a NaN
1 b b
2 c c
3 a NaN
In [16]: df.dtypes
Out[16]:
A object
B category
dtype: object
DataFrame creation
Similar to the previous section where a single column was converted to categorical, all columns in a DataFrame can
be batch converted to categorical either during or after construction.
This can be done during construction by specifying dtype="category" in the DataFrame constructor:
In [18]: df.dtypes
Out[18]:
A category
B category
dtype: object
Note that the categories present in each column differ; the conversion is done column by column, so only labels present
in a given column are categories:
In [19]: df["A"]
Out[19]:
0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'c']
In [20]: df["B"]
Out[20]:
0 b
1 c
2 c
3 d
(continues on next page)
Analogously, all columns in an existing DataFrame can be batch converted using DataFrame.astype():
In [23]: df_cat.dtypes
Out[23]:
A category
B category
dtype: object
In [24]: df_cat["A"]
Out[24]:
0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (3, object): ['a', 'b', 'c']
In [25]: df_cat["B"]
Out[25]:
0 b
1 c
2 c
3 d
Name: B, dtype: category
Categories (3, object): ['b', 'c', 'd']
Controlling behavior
In the examples above where we passed dtype='category', we used the default behavior:
1. Categories are inferred from the data.
2. Categories are unordered.
To control those behaviors, instead of passing 'category', use an instance of CategoricalDtype.
In [30]: s_cat
Out[30]:
0 NaN
(continues on next page)
Similarly, a CategoricalDtype can be used with a DataFrame to ensure that categories are consistent among
all columns.
In [35]: df_cat["A"]
Out[35]:
0 a
1 b
2 c
3 a
Name: A, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']
In [36]: df_cat["B"]
Out[36]:
0 b
1 c
2 c
3 d
Name: B, dtype: category
Categories (4, object): ['a' < 'b' < 'c' < 'd']
Note: To perform table-wise conversion, where all labels in the entire DataFrame are used as categories for each
column, the categories parameter can be determined programmatically by categories = pd.unique(df.
to_numpy().ravel()).
If you already have codes and categories, you can use the from_codes() constructor to save the factorize
step during normal constructor mode:
To get back to the original Series or NumPy array, use Series.astype(original_dtype) or np.
asarray(categorical):
In [40]: s
Out[40]:
0 a
1 b
2 c
3 a
dtype: object
In [41]: s2 = s.astype("category")
In [42]: s2
Out[42]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
In [43]: s2.astype(str)
Out[43]:
0 a
1 b
2 c
3 a
dtype: object
In [44]: np.asarray(s2)
Out[44]: array(['a', 'b', 'c', 'a'], dtype=object)
Note: In contrast to R’s factor function, categorical data is not converting input values to strings; categories will
end up the same data type as the original values.
Note: In contrast to R’s factor function, there is currently no way to assign/change labels at creation time. Use
categories to change the categories after creation time.
2.12.2 CategoricalDtype
In [48]: CategoricalDtype()
Out[48]: CategoricalDtype(categories=None, ordered=False)
A CategoricalDtype can be used in any place pandas expects a dtype. For example pandas.read_csv(),
pandas.DataFrame.astype(), or in the Series constructor.
Note: As a convenience, you can use the string 'category' in place of a CategoricalDtype when you want
the default behavior of the categories being unordered, and equal to the set values present in the array. In other words,
dtype='category' is equivalent to dtype=CategoricalDtype().
Equality semantics
Two instances of CategoricalDtype compare equal whenever they have the same categories and order. When
comparing two unordered categoricals, the order of the categories is not considered.
In [52]: c1 == "category"
Out[52]: True
2.12.3 Description
Using describe() on categorical data will produce similar output to a Series or DataFrame of type string.
In [53]: cat = pd.Categorical(["a", "c", "c", np.nan], categories=["b", "a", "c"])
In [55]: df.describe()
Out[55]:
cat s
count 3 3
unique 2 2
top c c
freq 2 2
In [56]: df["cat"].describe()
Out[56]:
count 3
unique 2
top c
freq 2
Name: cat, dtype: object
Categorical data has a categories and a ordered property, which list their possible values and whether the
ordering matters or not. These properties are exposed as s.cat.categories and s.cat.ordered. If you
don’t manually specify categories and ordering, they are inferred from the passed arguments.
In [57]: s = pd.Series(["a", "b", "c", "a"], dtype="category")
In [58]: s.cat.categories
Out[58]: Index(['a', 'b', 'c'], dtype='object')
In [59]: s.cat.ordered
Out[59]: False
In [61]: s.cat.categories
Out[61]: Index(['c', 'b', 'a'], dtype='object')
In [62]: s.cat.ordered
Out[62]: False
Note: New categorical data are not automatically ordered. You must explicitly pass ordered=True to indicate an
ordered Categorical.
Note: The result of unique() is not always the same as Series.cat.categories, because Series.
unique() has a couple of guarantees, namely that it returns categories in the order of appearance, and it only
In [63]: s = pd.Series(list("babc")).astype(CategoricalDtype(list("abcd")))
In [64]: s
Out[64]:
0 b
1 a
2 b
3 c
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
# categories
In [65]: s.cat.categories
Out[65]: Index(['a', 'b', 'c', 'd'], dtype='object')
# uniques
In [66]: s.unique()
Out[66]:
['b', 'a', 'c']
Categories (3, object): ['b', 'a', 'c']
Renaming categories
Renaming categories is done by assigning new values to the Series.cat.categories property or by using the
rename_categories() method:
In [68]: s
Out[68]:
0 a
1 b
2 c
3 a
dtype: category
Categories (3, object): ['a', 'b', 'c']
In [70]: s
Out[70]:
0 Group a
1 Group b
2 Group c
3 Group a
dtype: category
Categories (3, object): ['Group a', 'Group b', 'Group c']
In [72]: s
Out[72]:
0 1
(continues on next page)
In [74]: s
Out[74]:
0 x
1 y
2 z
3 x
dtype: category
Categories (3, object): ['x', 'y', 'z']
Note: In contrast to R’s factor, categorical data can have categories of other types than string.
Note: Be aware that assigning new categories is an inplace operation, while most other operations under Series.
cat per default return a new Series of dtype category.
In [75]: try:
....: s.cat.categories = [1, 1, 1]
....: except ValueError as e:
....: print("ValueError:", str(e))
....:
ValueError: Categorical categories must be unique
In [76]: try:
....: s.cat.categories = [1, 2, np.nan]
....: except ValueError as e:
....: print("ValueError:", str(e))
....:
ValueError: Categorical categories cannot be null
In [77]: s = s.cat.add_categories([4])
In [78]: s.cat.categories
Out[78]: Index(['x', 'y', 'z', 4], dtype='object')
In [79]: s
(continues on next page)
Removing categories
Removing categories can be done by using the remove_categories() method. Values which are removed are
replaced by np.nan.:
In [80]: s = s.cat.remove_categories([4])
In [81]: s
Out[81]:
0 x
1 y
2 z
3 x
dtype: category
Categories (3, object): ['x', 'y', 'z']
In [83]: s
Out[83]:
0 a
1 b
2 a
dtype: category
Categories (4, object): ['a', 'b', 'c', 'd']
In [84]: s.cat.remove_unused_categories()
Out[84]:
0 a
1 b
2 a
dtype: category
Categories (2, object): ['a', 'b']
Setting categories
If you want to do remove and add new categories in one step (which has some speed advantage), or simply set the
categories to a predefined scale, use set_categories().
In [86]: s
Out[86]:
0 one
1 two
2 four
3 -
dtype: category
Categories (4, object): ['-', 'four', 'one', 'two']
In [88]: s
Out[88]:
0 one
1 two
2 four
3 NaN
dtype: category
Categories (4, object): ['one', 'two', 'three', 'four']
Note: Be aware that Categorical.set_categories() cannot know whether some category is omitted in-
tentionally or because it is misspelled or (under Python3) due to a type difference (e.g., NumPy S1 dtype and Python
strings). This can result in surprising behaviour!
If categorical data is ordered (s.cat.ordered == True), then the order of the categories has a meaning and
certain operations are possible. If the categorical is unordered, .min()/.max() will raise a TypeError.
In [90]: s.sort_values(inplace=True)
In [92]: s.sort_values(inplace=True)
In [93]: s
Out[93]:
0 a
3 a
1 b
2 c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']
You can set categorical data to be ordered by using as_ordered() or unordered by using as_unordered().
These will by default return a new object.
In [95]: s.cat.as_ordered()
Out[95]:
0 a
3 a
1 b
2 c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']
In [96]: s.cat.as_unordered()
Out[96]:
0 a
3 a
1 b
2 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
Sorting will use the order defined by categories, not any lexical order present on the data type. This is even true for
strings and numeric data:
In [99]: s
Out[99]:
0 1
1 2
2 3
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [100]: s.sort_values(inplace=True)
In [101]: s
Out[101]:
1 2
2 3
0 1
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
Reordering
In [105]: s
Out[105]:
0 1
1 2
2 3
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
In [106]: s.sort_values(inplace=True)
In [107]: s
Out[107]:
1 2
2 3
0 1
3 1
dtype: category
Categories (3, int64): [2 < 3 < 1]
Note: Note the difference between assigning new categories and reordering the categories: the first renames categories
and therefore the individual values in the Series, but if the first position was sorted last, the renamed value will still
be sorted last. Reordering means that the way values are sorted is different afterwards, but not that individual values
in the Series are changed.
Note: If the Categorical is not ordered, Series.min() and Series.max() will raise TypeError. Nu-
meric operations like +, -, *, / and operations based on them (e.g. Series.median(), which would need to
compute the mean between two values if the length of an array is even) do not work and raise a TypeError.
A categorical dtyped column will participate in a multi-column sort in a similar manner to other columns. The ordering
of the categorical is determined by the categories of that column.
2.12.6 Comparisons
All other comparisons, especially “non-equality” comparisons of two categoricals with different categories or a cate-
gorical with any list-like object, will raise a TypeError.
Note: Any “non-equality” comparisons of categorical data with a Series, np.array, list or categorical data
with different categories or ordering will raise a TypeError because custom categories ordering could be interpreted
in two ways: one with taking into account the ordering and one without.
In [116]: cat
Out[116]:
0 1
1 2
2 3
dtype: category
Categories (3, int64): [3 < 2 < 1]
In [117]: cat_base
Out[117]:
0 2
1 2
2 2
dtype: category
Categories (3, int64): [3 < 2 < 1]
In [118]: cat_base2
Out[118]:
0 2
1 2
2 2
dtype: category
Categories (1, int64): [2]
Comparing to a categorical with the same categories and ordering or to a scalar works:
Equality comparisons work with any list-like object of same length and scalars:
In [123]: cat == 2
Out[123]:
0 False
1 True
2 False
dtype: bool
This doesn’t work because the categories are not the same:
In [124]: try:
.....: cat > cat_base2
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: Categoricals can only be compared if 'categories' are the same.
If you want to do a “non-equality” comparison of a categorical series with a list-like object which is not categorical
data, you need to be explicit and convert the categorical data back to the original values:
In [126]: try:
.....: cat > base
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: Cannot compare a Categorical for op __gt__ with type <class 'numpy.ndarray
˓→'>.
When you compare two unordered categoricals with the same categories, the order is not considered:
In [130]: c1 == c2
Out[130]: array([ True, True])
2.12.7 Operations
Apart from Series.min(), Series.max() and Series.mode(), the following operations are possible with
categorical data:
Series methods like Series.value_counts() will use all categories, even if some categories are not present
in the data:
In [131]: s = pd.Series(pd.Categorical(["a", "b", "c", "c"], categories=["c", "a", "b
˓→", "d"]))
In [132]: s.value_counts()
Out[132]:
c 2
a 1
b 1
d 0
dtype: int64
In [134]: df = pd.DataFrame(
.....: data=[[1, 2, 3], [4, 5, 6]],
.....: columns=pd.MultiIndex.from_arrays([["A", "B", "B"], columns]),
.....: )
.....:
In [138]: df.groupby("cats").mean()
Out[138]:
values
cats
a 1.0
b 2.0
c 4.0
d NaN
Pivot tables:
In [142]: raw_cat = pd.Categorical(["a", "a", "b", "b"], categories=["a", "b", "c"])
In [143]: df = pd.DataFrame({"A": raw_cat, "B": ["c", "d", "c", "d"], "values": [1, 2,
˓→ 3, 4]})
The optimized pandas data access methods .loc, .iloc, .at, and .iat, work as normal. The only difference is
the return type (for getting) and that only values already in categories can be assigned.
Getting
If the slicing operation returns either a DataFrame or a column of type Series, the category dtype is preserved.
In [145]: idx = pd.Index(["h", "i", "j", "k", "l", "m", "n"])
In [146]: cats = pd.Series(["a", "b", "b", "b", "c", "c", "c"], dtype="category",
˓→index=idx)
An example where the category type is not preserved is if you take one single row: the resulting Series is of dtype
object:
Returning a single item from categorical data will also return the value, not a categorical of length “1”.
In [154]: df.iat[0, 0]
Out[154]: 'a'
Note: The is in contrast to R’s factor function, where factor(c(1,2,3))[1] returns a single value factor.
To get a single value Series of type category, you pass in a list with a single value:
The accessors .dt and .str will work if the s.cat.categories are of an appropriate type:
In [158]: str_s = pd.Series(list("aabb"))
In [160]: str_cat
Out[160]:
0 a
1 a
2 b
3 b
dtype: category
Categories (2, object): ['a', 'b']
In [161]: str_cat.str.contains("a")
Out[161]:
0 True
1 True
2 False
3 False
dtype: bool
In [164]: date_cat
Out[164]:
0 2015-01-01
1 2015-01-02
2 2015-01-03
3 2015-01-04
4 2015-01-05
dtype: category
Categories (5, datetime64[ns]): [2015-01-01, 2015-01-02, 2015-01-03, 2015-01-04, 2015-
˓→01-05]
In [165]: date_cat.dt.day
Out[165]:
0 1
1 2
2 3
3 4
4 5
dtype: int64
Note: The returned Series (or DataFrame) is of the same type as if you used the .str.<method> / .dt.
<method> on a Series of that type (and not of type category!).
That means, that the returned values from methods and properties on the accessors of a Series and the returned
values from methods and properties on the accessors of this Series transformed to one of type category will be
equal:
Note: The work is done on the categories and then a new Series is constructed. This has some performance
implication if you have a Series of type string, where lots of elements are repeated (i.e. the number of unique
elements in the Series is a lot smaller than the length of the Series). In this case it can be faster to convert the
original Series to one of type category and use .str.<method> or .dt.<property> on that.
Setting
Setting values in a categorical column (or Series) works as long as the value is included in the categories:
In [171]: cats = pd.Categorical(["a", "a", "a", "a", "a", "a", "a"], categories=["a",
˓→"b"])
In [175]: df
Out[175]:
cats values
h a 1
i a 1
j b 2
k b 2
l a 1
m a 1
n a 1
In [176]: try:
.....: df.iloc[2:4, :] = [["c", 3], ["c", 3]]
.....: except ValueError as e:
.....: print("ValueError:", str(e))
(continues on next page)
Setting values by assigning categorical data will also check that the categories match:
In [178]: df
Out[178]:
cats values
h a 1
i a 1
j a 2
k a 2
l a 1
m a 1
n a 1
In [179]: try:
.....: df.loc["j":"k", "cats"] = pd.Categorical(["b", "b"], categories=["a", "b
˓→", "c"])
Assigning a Categorical to parts of a column of other types will use the values:
In [180]: df = pd.DataFrame({"a": [1, 1, 1, 1, 1], "b": ["a", "a", "a", "a", "a"]})
In [183]: df
Out[183]:
a b
0 1 a
1 b a
2 b b
3 1 b
4 1 a
In [184]: df.dtypes
Out[184]:
a object
b object
dtype: object
Merging / concatenation
By default, combining Series or DataFrames which contain the same categories results in category dtype,
otherwise results will depend on the dtype of the underlying categories. Merges that result in non-categorical dtypes
will likely have higher memory usage. Use .astype or union_categoricals to ensure category results.
In [185]: from pandas.api.types import union_categoricals
# same categories
In [186]: s1 = pd.Series(["a", "b"], dtype="category")
# different categories
In [189]: s3 = pd.Series(["b", "c"], dtype="category")
See also the section on merge dtypes for notes about preserving merge dtypes and performance.
Unioning
If you want to combine categoricals that do not necessarily have the same categories, the union_categoricals()
function will combine a list-like of categoricals. The new categories will be the union of the categories being combined.
By default, the resulting categories will be ordered as they appear in the data. If you want the categories to be lexsorted,
use sort_categories=True argument.
union_categoricals also works with the “easy” case of combining two categoricals of the same categories and
order information (e.g. what you could also append for).
The below raises TypeError because the categories are ordered and not identical.
Ordered categoricals with different categories or orderings can be combined by using the ignore_ordered=True
argument.
union_categoricals() also works with a CategoricalIndex, or Series containing categorical data, but
note that the resulting array will always be a plain Categorical:
Note: union_categoricals may recode the integer codes for categories when combining categoricals. This is
likely what you want, but if you are relying on the exact numbering of the categories, be aware.
In [212]: c1
Out[212]:
['b', 'c']
Categories (2, object): ['b', 'c']
# "b" is coded to 0
In [213]: c1.codes
Out[213]: array([0, 1], dtype=int8)
In [214]: c2
Out[214]:
['a', 'b']
Categories (2, object): ['a', 'b']
# "b" is coded to 1
In [215]: c2.codes
Out[215]: array([0, 1], dtype=int8)
In [217]: c
(continues on next page)
You can write data that contains category dtypes to a HDFStore. See here for an example and caveats.
It is also possible to write data to and reading data from Stata format files. See here for an example and caveats.
Writing to a CSV file will convert the data, effectively removing any information about the categorical (categories and
ordering). So if you read back the CSV file you have to convert the relevant columns back to category and assign
the right categories and categories ordering.
In [219]: import io
In [225]: df.to_csv(csv)
In [227]: df2.dtypes
Out[227]:
Unnamed: 0 int64
cats object
vals int64
dtype: object
In [228]: df2["cats"]
Out[228]:
0 very good
1 good
2 good
3 very good
4 very good
5 bad
Name: cats, dtype: object
In [230]: df2["cats"].cat.set_categories(
.....: ["very bad", "bad", "medium", "good", "very good"], inplace=True
.....: )
.....:
In [231]: df2.dtypes
Out[231]:
Unnamed: 0 int64
cats category
vals int64
dtype: object
In [232]: df2["cats"]
Out[232]:
0 very good
1 good
2 good
3 very good
4 very good
5 bad
Name: cats, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']
pandas primarily uses the value np.nan to represent missing data. It is by default not included in computations. See
the Missing Data section.
Missing values should not be included in the Categorical’s categories, only in the values. Instead, it is under-
stood that NaN is different, and is always a possibility. When working with the Categorical’s codes, missing values
will always have a code of -1.
In [235]: s.cat.codes
Out[235]:
0 0
1 1
2 -1
3 0
dtype: int8
Methods for working with missing data, e.g. isna(), fillna(), dropna(), all work normally:
In [237]: s
Out[237]:
0 a
1 b
2 NaN
dtype: category
Categories (2, object): ['a', 'b']
In [238]: pd.isna(s)
Out[238]:
0 False
1 False
2 True
dtype: bool
In [239]: s.fillna("a")
Out[239]:
0 a
1 b
2 a
dtype: category
Categories (2, object): ['a', 'b']
2.12.12 Gotchas
Memory usage
The memory usage of a Categorical is proportional to the number of categories plus the length of the data. In
contrast, an object dtype is a constant times the length of the data.
# object dtype
In [241]: s.nbytes
Out[241]: 16000
(continues on next page)
# category dtype
In [242]: s.astype("category").nbytes
Out[242]: 2016
Note: If the number of categories approaches the length of the data, the Categorical will use nearly the same or
more memory than an equivalent object dtype representation.
# object dtype
In [244]: s.nbytes
Out[244]: 16000
# category dtype
In [245]: s.astype("category").nbytes
Out[245]: 20000
Currently, categorical data and the underlying Categorical is implemented as a Python object and not as a low-
level NumPy array dtype. This leads to some problems.
NumPy itself doesn’t know about the new dtype:
In [246]: try:
.....: np.dtype("category")
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: data type 'category' not understood
In [248]: try:
.....: np.dtype(dtype)
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: Cannot interpret 'CategoricalDtype(categories=['a'], ordered=False)' as a
˓→data type
Using NumPy functions on a Series of type category should not work as Categoricals are not numeric data
(even in the case that .categories is numeric).
In [254]: try:
.....: np.sum(s)
.....: except TypeError as e:
.....: print("TypeError:", str(e))
.....:
TypeError: 'Categorical' does not implement reduction 'sum'
dtype in apply
pandas currently does not preserve the dtype in apply functions: If you apply along rows you get a Series of
object dtype (same as getting a row -> getting one element will return a basic type) and applying along columns
will also convert to object. NaN values are unaffected. You can use fillna to handle missing values before applying
a function.
In [255]: df = pd.DataFrame(
.....: {
.....: "a": [1, 2, 3, 4],
.....: "b": ["a", "b", "c", "d"],
.....: "cats": pd.Categorical([1, 2, 3, 2]),
.....: }
.....: )
.....:
Categorical index
CategoricalIndex is a type of index that is useful for supporting indexing with duplicates. This is a container
around a Categorical and allows efficient indexing and storage of an index with a large number of duplicated
elements. See the advanced indexing docs for a more detailed explanation.
Setting the index will create a CategoricalIndex:
In [262]: df.index
Out[262]: CategoricalIndex([1, 2, 3, 4], categories=[4, 2, 3, 1], ordered=False,
˓→dtype='category')
Side effects
Constructing a Series from a Categorical will not copy the input Categorical. This means that changes to
the Series will in most cases change the original Categorical:
In [266]: cat
Out[266]:
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [267]: s.iloc[0:2] = 10
In [268]: cat
Out[268]:
[10, 10, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [269]: df = pd.DataFrame(s)
In [271]: cat
Out[271]:
(continues on next page)
In [274]: cat
Out[274]:
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
In [275]: s.iloc[0:2] = 10
In [276]: cat
Out[276]:
[1, 2, 3, 10]
Categories (5, int64): [1, 2, 3, 4, 10]
Note: This also happens in some cases when you supply a NumPy array instead of a Categorical: using an int
array (e.g. np.array([1,2,3,4])) will exhibit the same behavior, while using a string array (e.g. np.array([
"a","b","c","a"])) will not.
Note: IntegerArray is currently experimental. Its API or implementation may change without warning.
Changed in version 1.0.0: Now uses pandas.NA as the missing value rather than numpy.nan.
In Working with missing data, we saw that pandas primarily uses NaN to represent missing data. Because NaN is a
float, this forces an array of integers with any missing values to become floating point. In some cases, this may not
matter much. But if your integer column is, say, an identifier, casting to float can be problematic. Some integers cannot
even be represented as floating point numbers.
2.13.1 Construction
pandas can represent integer data with possibly missing values using arrays.IntegerArray. This is an extension
types implemented within pandas.
In [1]: arr = pd.array([1, 2, None], dtype=pd.Int64Dtype())
In [2]: arr
Out[2]:
<IntegerArray>
[1, 2, <NA>]
Length: 3, dtype: Int64
Or the string alias "Int64" (note the capital "I", to differentiate from NumPy’s 'int64' dtype:
This array can be stored in a DataFrame or Series like any NumPy array.
In [5]: pd.Series(arr)
Out[5]:
0 1
1 2
2 <NA>
dtype: Int64
You can also pass the list-like object to the Series constructor with the dtype.
Warning: Currently pandas.array() and pandas.Series() use different rules for dtype inference.
pandas.array() will infer a nullable- integer dtype
In [6]: pd.array([1, None])
Out[6]:
<IntegerArray>
[1, <NA>]
Length: 2, dtype: Int64
In the future, we may provide an option for Series to infer a nullable-integer dtype.
2.13.2 Operations
Operations involving an integer array will behave similar to NumPy arrays. Missing values will be propagated, and
the data will be coerced to another dtype if needed.
In [12]: s = pd.Series([1, 2, None], dtype="Int64")
# arithmetic
In [13]: s + 1
Out[13]:
0 2
1 3
2 <NA>
dtype: Int64
# comparison
In [14]: s == 1
Out[14]:
0 True
1 False
2 <NA>
dtype: boolean
# indexing
In [15]: s.iloc[1:3]
Out[15]:
1 2
2 <NA>
dtype: Int64
In [19]: df
Out[19]:
A B C
0 1 1 a
1 2 1 a
2 <NA> 3 b
In [20]: df.dtypes
Out[20]:
A Int64
B int64
C object
dtype: object
In [22]: df["A"].astype(float)
Out[22]:
0 1.0
1 2.0
2 NaN
Name: A, dtype: float64
In [23]: df.sum()
Out[23]:
A 3
B 5
C aab
dtype: object
In [24]: df.groupby("B").A.sum()
Out[24]:
B
1 3
3 0
Name: A, dtype: Int64
arrays.IntegerArray uses pandas.NA as its scalar missing value. Slicing a single element that’s missing will
return pandas.NA
In [26]: a[1]
Out[26]: <NA>
pandas allows indexing with NA values in a boolean array, which are treated as False.
Changed in version 1.0.2.
In [3]: s[mask]
Out[3]:
0 1
dtype: int64
If you would prefer to keep the NA values you can manually fill them with fillna(True).
In [4]: s[mask.fillna(True)]
Out[4]:
0 1
2 3
dtype: int64
arrays.BooleanArray implements Kleene Logic (sometimes called three-value logic) for logical operations like
& (and), | (or) and ^ (exclusive-or).
This table demonstrates the results for every combination. These operations are symmetrical, so flipping the left- and
right-hand side makes no difference in the result.
Expression Result
True & True True
True & False False
True & NA NA
False & False False
False & NA False
NA & NA NA
True | True True
True | False True
True | NA True
False | False False
False | NA NA
NA | NA NA
True ^ True False
True ^ False True
True ^ NA NA
False ^ False False
False ^ NA NA
NA ^ NA NA
When an NA is present in an operation, the output value is NA only if the result cannot be determined solely based on
the other input. For example, True | NA is True, because both True | True and True | False are True.
In that case, we don’t actually need to consider the value of the NA.
On the other hand, True & NA is NA. The result depends on whether the NA really is True or False, since True
& True is True, but True & False is False, so we can’t determine the output.
This differs from how np.nan behaves in logical operations. pandas treated np.nan is always false in the output.
In or
In and
{{ header }}
2.15 Visualization
In [2]: plt.close("all")
We provide the basics in pandas to easily create decent looking plots. See the ecosystem section for visualization
libraries that go beyond the basics documented here.
We will demonstrate the basics, see the cookbook for some advanced strategies.
The plot method on Series and DataFrame is just a simple wrapper around plt.plot():
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-3-b81813eecb9e> in <module>
----> 1 ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000",
˓→periods=1000))
In [4]: ts = ts.cumsum()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-a7771f529bde> in <module>
----> 1 ts = ts.cumsum()
In [5]: ts.plot();
If the index consists of dates, it calls gcf().autofmt_xdate() to try to format the x-axis nicely as per above.
On DataFrame, plot() is a convenience to plot all of the columns with labels:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-bdb033d6082b> in <module>
----> 1 df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list("ABCD
˓→"))
In [7]: df = df.cumsum()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-08208d45ae16> in <module>
----> 1 df = df.cumsum()
In [8]: plt.figure();
In [9]: df.plot();
You can plot one column versus another using the x and y keywords in plot():
Note: For more formatting and styling options, see formatting below.
Plotting methods allow for a handful of plot styles other than the default line plot. These methods can be provided as
the kind keyword argument to plot(), and include:
• ‘bar’ or ‘barh’ for bar plots
• ‘hist’ for histogram
• ‘box’ for boxplot
• ‘kde’ or ‘density’ for density plots
• ‘area’ for area plots
• ‘scatter’ for scatter plots
• ‘hexbin’ for hexagonal bin plots
• ‘pie’ for pie plots
For example, a bar plot can be created the following way:
In [13]: plt.figure();
In [14]: df.iloc[5].plot(kind="bar");
You can also create these other plots using the methods DataFrame.plot.<kind> instead of providing the kind
keyword argument. This makes it easier to discover plot methods and the specific arguments they use:
In [15]: df = pd.DataFrame()
In addition to these kind s, there are the DataFrame.hist(), and DataFrame.boxplot() methods, which use a separate
interface.
Finally, there are several plotting functions in pandas.plotting that take a Series or DataFrame as an argu-
ment. These include:
• Scatter Matrix
• Andrews Curves
• Parallel Coordinates
• Lag Plot
• Autocorrelation Plot
• Bootstrap Plot
• RadViz
Plots may also be adorned with errorbars or tables.
Bar plots
For labeled, non-time series data, you may wish to produce a bar plot:
In [17]: plt.figure();
In [18]: df.iloc[5].plot.bar();
In [21]: df2.plot.bar();
In [22]: df2.plot.bar(stacked=True);
In [23]: df2.plot.barh(stacked=True);
Histograms
In [26]: df4.plot.hist(alpha=0.5);
A histogram can be stacked using stacked=True. Bin size can be changed using the bins keyword.
In [27]: plt.figure();
You can pass other keywords supported by matplotlib hist. For example, horizontal and cumulative histograms can
be drawn by orientation='horizontal' and cumulative=True.
In [29]: plt.figure();
See the hist method and the matplotlib hist documentation for more.
The existing interface DataFrame.hist to plot histogram still can be used.
In [31]: plt.figure();
In [32]: df["A"].diff().hist();
In [33]: plt.figure();
Box plots
In [38]: df.plot.box();
Boxplot can be colorized by passing color keyword. You can pass a dict whose keys are boxes, whiskers,
medians and caps. If some keys are missing in the dict, default colors are used for the corresponding artists.
Also, boxplot has sym keyword to specify fliers style.
When you pass other type of arguments via color keyword, it will be directly passed to matplotlib for all the boxes,
whiskers, medians and caps colorization.
The colors are applied to every boxes to be drawn. If you want more complicated colorization, you can get each drawn
artists by passing return_type.
In [39]: color = {
....: "boxes": "DarkGreen",
....: "whiskers": "DarkOrange",
....: "medians": "DarkBlue",
....: "caps": "Gray",
....: }
....:
Also, you can pass other keywords supported by matplotlib boxplot. For example, horizontal and custom-positioned
boxplot can be drawn by vert=False and positions keywords.
See the boxplot method and the matplotlib boxplot documentation for more.
The existing interface DataFrame.boxplot to plot boxplot still can be used.
In [43]: plt.figure();
In [44]: bp = df.boxplot()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-44-5b6d837d4b1a> in <module>
----> 1 bp = df.boxplot()
You can create a stratified boxplot using the by keyword argument to create groupings. For instance,
In [46]: df["X"] = pd.Series(["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-46-aeeedf5a45e4> in <module>
----> 1 df["X"] = pd.Series(["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"])
In [47]: plt.figure();
In [48]: bp = df.boxplot(by="X")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-48-17950b5ba245> in <module>
----> 1 bp = df.boxplot(by="X")
(continues on next page)
You can also pass a subset of columns to plot, as well as group by multiple columns:
In [50]: df["X"] = pd.Series(["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-50-aeeedf5a45e4> in <module>
----> 1 df["X"] = pd.Series(["A", "A", "A", "A", "A", "B", "B", "B", "B", "B"])
In [51]: df["Y"] = pd.Series(["A", "B", "A", "B", "A", "B", "A", "B", "A", "B"])
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
(continues on next page)
In [52]: plt.figure();
In boxplot, the return type can be controlled by the return_type, keyword. The valid choices are {"axes",
"dict", "both", None}. Faceting, created by DataFrame.boxplot with the by keyword, will affect the
output type as well:
In [54]: np.random.seed(1234)
In [58]: bp = df_box.boxplot(by="g")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-58-0d35fe0544cf> in <module>
----> 1 bp = df_box.boxplot(by="g")
The subplots above are split by the numeric columns first, then the value of the g column. Below the subplots are first
split by the value of g, then by the numeric columns.
In [59]: bp = df_box.groupby("g").boxplot()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-59-ad6f23eb653c> in <module>
----> 1 bp = df_box.groupby("g").boxplot()
Area plot
You can create area plots with Series.plot.area() and DataFrame.plot.area(). Area plots are stacked
by default. To produce stacked area plot, each column must be either all positive or all negative values.
When input data contains NaN, it will be automatically filled by 0. If you want to drop or fill by different values, use
dataframe.dropna() or dataframe.fillna() before calling plot.
In [61]: df.plot.area();
To produce an unstacked plot, pass stacked=False. Alpha value is set to 0.5 unless otherwise specified:
In [62]: df.plot.area(stacked=False);
Scatter plot
Scatter plot can be drawn by using the DataFrame.plot.scatter() method. Scatter plot requires numeric
columns for the x and y axes. These can be specified by the x and y keywords.
To plot multiple column groups in a single axes, repeat plot method specifying target ax. It is recommended to
specify color and label keywords to distinguish each groups.
The keyword c may be given as the name of a column to provide colors for each point:
You can pass other keywords supported by matplotlib scatter. The example below shows a bubble chart using a
column of the DataFrame as the bubble size.
See the scatter method and the matplotlib scatter documentation for more.
You can create hexagonal bin plots with DataFrame.plot.hexbin(). Hexbin plots can be a useful alternative
to scatter plots if your data are too dense to plot each point individually.
A useful keyword argument is gridsize; it controls the number of hexagons in the x-direction, and defaults to 100.
A larger gridsize means more, smaller bins.
By default, a histogram of the counts around each (x, y) point is computed. You can specify alternative aggregations
by passing values to the C and reduce_C_function arguments. C specifies the value at each (x, y) point and
reduce_C_function is a function of one argument that reduces all the values in a bin to a single number (e.g.
mean, max, sum, std). In this example the positions are given by columns a and b, while the value is given by
column z. The bins are aggregated with NumPy’s max function.
See the hexbin method and the matplotlib hexbin documentation for more.
Pie plot
You can create a pie plot with DataFrame.plot.pie() or Series.plot.pie(). If your data includes any
NaN, they will be automatically filled with 0. A ValueError will be raised if there are any negative values in your
data.
In [76]: series = pd.Series(3 * np.random.rand(4), index=["a", "b", "c", "d"], name=
˓→"series")
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-76-fe5b5f68e0f9> in <module>
(continues on next page)
For pie plots it’s best to use square figures, i.e. a figure aspect ratio 1. You can create the figure with equal width and
height, or force the aspect ratio to be equal after plotting by calling ax.set_aspect('equal') on the returned
axes object.
Note that pie plot with DataFrame requires that you either specify a target column by the y argument or
subplots=True. When y is specified, pie plot of selected column will be drawn. If subplots=True is spec-
ified, pie plots for each column are drawn as subplots. A legend will be drawn in each pie plots by default; specify
legend=False to hide it.
In [78]: df = pd.DataFrame(
....: 3 * np.random.rand(4, 2), index=["a", "b", "c", "d"], columns=["x", "y"]
....: )
....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-78-0069481ad544> in <module>
----> 1 df = pd.DataFrame(
(continues on next page)
You can use the labels and colors keywords to specify the labels and colors of each wedge.
Warning: Most pandas plots use the label and color arguments (note the lack of “s” on those). To be
consistent with matplotlib.pyplot.pie() you must use labels and colors.
If you want to hide wedge labels, specify labels=None. If fontsize is specified, the value will be applied to
wedge labels. Also, other keywords supported by matplotlib.pyplot.pie() can be used.
In [80]: series.plot.pie(
....: labels=["AA", "BB", "CC", "DD"],
....: colors=["r", "g", "b", "c"],
....: autopct="%.2f",
....: fontsize=20,
....: figsize=(6, 6),
....: );
(continues on next page)
If you pass values whose sum total is less than 1.0, matplotlib draws a semicircle.
pandas tries to be pragmatic about plotting DataFrames or Series that contain missing data. Missing values are
dropped, left out, or filled depending on the plot type.
If any of these defaults are not what you want, or if you want to be explicit about how missing values are handled,
consider using fillna() or dropna() before plotting.
These functions can be imported from pandas.plotting and take a Series or DataFrame as an argument.
You can create a scatter plot matrix using the scatter_matrix method in pandas.plotting:
Density plot
You can create density plots using the Series.plot.kde() and DataFrame.plot.kde() methods.
In [87]: ser.plot.kde();
Andrews curves
Andrews curves allow one to plot multivariate data as a large number of curves that are created using the attributes
of samples as coefficients for Fourier series, see the Wikipedia entry for more information. By coloring these curves
differently for each class it is possible to visualize data clustering. Curves belonging to samples of the same class will
usually be closer together and form larger structures.
Note: The “Iris” dataset is available here.
In [90]: plt.figure();
Parallel coordinates
Parallel coordinates is a plotting technique for plotting multivariate data, see the Wikipedia entry for an introduction.
Parallel coordinates allows one to see clusters in data and to estimate other statistics visually. Using parallel coordinates
points are represented as connected line segments. Each vertical line represents one attribute. One set of connected
line segments represents one data point. Points that tend to cluster will appear closer together.
In [94]: plt.figure();
Lag plot
Lag plots are used to check if a data set or time series is random. Random data should not exhibit any structure in the
lag plot. Non-random structure implies that the underlying data are not random. The lag argument may be passed,
and when lag=1 the plot is essentially data[:-1] vs. data[1:].
In [97]: plt.figure();
In [100]: lag_plot(data);
Autocorrelation plot
Autocorrelation plots are often used for checking randomness in time series. This is done by computing autocorrela-
tions for data values at varying time lags. If time series is random, such autocorrelations should be near zero for any
and all time-lag separations. If time series is non-random then one or more of the autocorrelations will be significantly
non-zero. The horizontal lines displayed in the plot correspond to 95% and 99% confidence bands. The dashed line is
99% confidence band. See the Wikipedia entry for more about autocorrelation plots.
In [102]: plt.figure();
In [105]: autocorrelation_plot(data);
Bootstrap plot
Bootstrap plots are used to visually assess the uncertainty of a statistic, such as mean, median, midrange, etc. A
random subset of a specified size is selected from a data set, the statistic in question is computed for this subset and
the process is repeated a specified number of times. Resulting plots and histograms are what constitutes the bootstrap
plot.
RadViz
RadViz is a way of visualizing multi-variate data. It is based on a simple spring tension minimization algorithm.
Basically you set up a bunch of points in a plane. In our case they are equally spaced on a unit circle. Each point
represents a single attribute. You then pretend that each sample in the data set is attached to each of these points
by a spring, the stiffness of which is proportional to the numerical value of that attribute (they are normalized to
unit interval). The point in the plane, where our sample settles to (where the forces acting on our sample are at an
equilibrium) is where a dot representing our sample will be drawn. Depending on which class that sample belongs it
will be colored differently. See the R package Radviz for more information.
Note: The “Iris” dataset is available here.
In [111]: plt.figure();
From version 1.5 and up, matplotlib offers a range of pre-configured plotting styles. Setting the style can be
used to easily give plots the general look that you want. Setting the style is as easy as calling matplotlib.
style.use(my_plot_style) before creating your plot. For example you could write matplotlib.style.
use('ggplot') for ggplot-style plots.
You can see the various available style names at matplotlib.style.available and it’s very easy to try them
out.
Most plotting methods have a set of keyword arguments that control the layout and formatting of the returned plot:
In [113]: plt.figure();
For each kind of plot (e.g. line, bar, scatter) any additional arguments keywords are passed along to the
corresponding matplotlib function (ax.plot(), ax.bar(), ax.scatter()). These can be used to control
additional styling, beyond what pandas provides.
You may set the legend argument to False to hide the legend, which is shown by default.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-115-bdb033d6082b> in <module>
----> 1 df = pd.DataFrame(np.random.randn(1000, 4), index=ts.index, columns=list("ABCD
˓→"))
In [116]: df = df.cumsum()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-116-08208d45ae16> in <module>
----> 1 df = df.cumsum()
In [117]: df.plot(legend=False);
In [118]: df.plot();
Scales
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-120-b81813eecb9e> in <module>
----> 1 ts = pd.Series(np.random.randn(1000), index=pd.date_range("1/1/2000",
˓→periods=1000))
(continues on next page)
In [121]: ts = np.exp(ts.cumsum())
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-121-a60c32c780a6> in <module>
----> 1 ts = np.exp(ts.cumsum())
In [122]: ts.plot(logy=True);
In [123]: df["A"].plot();
To plot some columns in a DataFrame, give the column names to the secondary_y keyword:
In [125]: plt.figure();
Note that the columns plotted on the secondary y-axis is automatically marked with “(right)” in the legend. To turn off
the automatic marking, use the mark_right=False keyword:
In [129]: plt.figure();
pandas includes automatic tick resolution adjustment for regular frequency time-series data. For limited cases where
pandas cannot infer the frequency information (e.g., in an externally created twinx), you can choose to suppress this
behavior for alignment purposes.
Here is the default behavior, notice how the x-axis tick labeling is performed:
In [131]: plt.figure();
In [132]: df["A"].plot();
In [133]: plt.figure();
In [134]: df["A"].plot(x_compat=True);
If you have more than one plot that needs to be suppressed, the use method in pandas.plotting.
plot_params can be used in a with statement:
In [135]: plt.figure();
TimedeltaIndex now uses the native matplotlib tick locator methods, it is useful to call the automatic date tick
adjustment from matplotlib for figures whose ticklabels overlap.
See the autofmt_xdate method and the matplotlib documentation for more.
Subplots
Each Series in a DataFrame can be plotted on a different axis with the subplots keyword:
The layout of subplots can be specified by the layout keyword. It can accept (rows, columns). The layout
keyword can be used in hist and boxplot also. If the input is invalid, a ValueError will be raised.
The number of axes which can be contained by rows x columns specified by layout must be larger than the number
of required subplots. If layout can contain more axes than required, blank axes are not drawn. Similar to a NumPy
array’s reshape method, you can use -1 for one dimension to automatically calculate the number of rows or columns
needed, given the other.
The required number of columns (3) is inferred from the number of series to plot and the given number of rows (2).
You can pass multiple axes created beforehand as list-like via ax keyword. This allows more complicated layouts.
The passed axes must be the same number as the subplots being drawn.
When multiple axes are passed via the ax keyword, layout, sharex and sharey keywords don’t affect to the
output. You should explicitly pass sharex=False and sharey=False, otherwise you will see a warning.
• As raw values (list, tuple, or np.ndarray). Must be the same length as the plotting
DataFrame/Series.
Asymmetrical error bars are also supported, however raw error values must be provided in this case. For a N
length Series, a 2xN array should be provided indicating lower and upper (or left and right) errors. For a MxN
DataFrame, asymmetrical errors should be in a Mx2xN array.
Here is an example of one way to easily plot group means with standard deviations from the raw data.
# Group by index labels and take the means and standard deviations
# for each group
In [158]: gp3 = df3.groupby(level=("letter", "word"))
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-158-1cd2f319f896> in <module>
----> 1 gp3 = df3.groupby(level=("letter", "word"))
In [161]: means
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-161-88030acd958e> in <module>
----> 1 means
In [162]: errors
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-162-ab14c7b75346> in <module>
----> 1 errors
# Plot
In [163]: fig, ax = plt.subplots()
Plotting tables
Plotting with matplotlib table is now supported in DataFrame.plot() and Series.plot() with a table
keyword. The table keyword can accept bool, DataFrame or Series. The simple way to draw a table is to
specify table=True. Data will be transposed to meet matplotlib’s default layout.
Also, you can pass a different DataFrame or Series to the table keyword. The data will be drawn as displayed
in print method (not transposed automatically). If required, it should be transposed manually as seen in the example
below.
There also exists a helper function pandas.plotting.table, which creates a table from DataFrame or
Series, and adds it to an matplotlib.Axes instance. This function can accept keywords which the matplotlib
table has.
Note: You can get table instances on the axes using axes.tables property for further decorations. See the mat-
plotlib table documentation for more.
Colormaps
A potential issue when plotting a large number of columns is that it can be difficult to distinguish some series due to
repetition in the default colors. To remedy this, DataFrame plotting supports the use of the colormap argument,
which accepts either a Matplotlib colormap or a string that is a name of a colormap registered with Matplotlib. A
visualization of the default matplotlib colormaps is available here.
As matplotlib does not directly support colormaps for line-based plots, the colors are selected based on an even spacing
determined by the number of columns in the DataFrame. There is no consideration made for background color, so
some colormaps will produce lines that are not easily visible.
To use the cubehelix colormap, we can pass colormap='cubehelix'.
In [178]: plt.figure();
In [179]: df.plot(colormap="cubehelix");
In [181]: plt.figure();
In [182]: df.plot(colormap=cm.cubehelix);
Colormaps can also be used other plot types, like bar charts:
In [184]: dd = dd.cumsum()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-184-cf596e929dc1> in <module>
----> 1 dd = dd.cumsum()
In [185]: plt.figure();
In [186]: dd.plot.bar(colormap="Greens");
In [187]: plt.figure();
In [189]: plt.figure();
In some situations it may still be preferable or necessary to prepare plots directly with matplotlib, for instance when a
certain type of plot or customization is not (yet) supported by pandas. Series and DataFrame objects behave like
arrays and can therefore be passed directly to matplotlib functions without explicit casts.
pandas also automatically registers formatters and locators that recognize date indices, thereby extending date and
time support to practically all plot types available in matplotlib. Although this formatting does not provide the same
level of refinement you would get when plotting via pandas, it can be faster when plotting a large number of points.
In [191]: price = pd.Series(
.....: np.random.randn(150).cumsum(),
.....: index=pd.date_range("2000-1-1", periods=150, freq="B"),
.....: )
.....:
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-191-2e210b660cbc> in <module>
----> 1 price = pd.Series(
2 np.random.randn(150).cumsum(),
3 index=pd.date_range("2000-1-1", periods=150, freq="B"),
4 )
In [192]: ma = price.rolling(20).mean()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-192-7dcf1e53fe5c> in <module>
----> 1 ma = price.rolling(20).mean()
In [194]: plt.figure();
Starting in version 0.25, pandas can be extended with third-party plotting backends. The main idea is letting users
select a plotting backend different than the provided one based on Matplotlib.
This can be done by passsing ‘backend.module’ as the argument backend in plot function. For example:
Alternatively, you can also set this option globally, do you don’t need to specify the keyword in each plot call. For
example:
Or:
The backend module can then use other visualization tools (Bokeh, Altair, hvplot,. . . ) to generate the plots. Some
libraries implementing a backend for pandas are listed on the ecosystem ecosystem.visualization page.
Developers guide can be found at https://pandas.pydata.org/docs/dev/development/extending.html#plotting-backends
Percent change
Series and DataFrame have a method pct_change() to compute the percent change over a given number of
periods (using fill_method to fill NA/null values before computing the percent change).
In [2]: ser.pct_change()
Out[2]:
0 NaN
1 -1.602976
2 4.334938
3 -0.247456
4 -2.067345
5 -1.142903
6 -1.688214
7 -9.759729
dtype: float64
In [4]: df.pct_change(periods=3)
Out[4]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 -0.218320 -1.054001 1.987147 -0.510183
4 -0.439121 -1.816454 0.649715 -4.822809
5 -0.127833 -3.042065 -5.866604 -1.776977
6 -2.596833 -1.959538 -2.111697 -3.798900
7 -0.117826 -2.169058 0.036094 -0.067696
8 2.492606 -1.357320 -1.205802 -1.558697
9 -1.012977 2.324558 -1.003744 -0.371806
Covariance
Series.cov() can be used to compute covariance between series (excluding missing values).
In [5]: s1 = pd.Series(np.random.randn(1000))
In [6]: s2 = pd.Series(np.random.randn(1000))
In [7]: s1.cov(s2)
Out[7]: 0.0006801088174310875
Analogously, DataFrame.cov() to compute pairwise covariances among the series in the DataFrame, also exclud-
ing NA/null values.
Note: Assuming the missing data are missing at random this results in an estimate for the covariance matrix which
is unbiased. However, for many applications this estimate may not be acceptable because the estimated covariance
matrix is not guaranteed to be positive semi-definite. This could lead to estimated correlations having absolute values
which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices for more
details.
In [9]: frame.cov()
Out[9]:
a b c d e
a 1.000882 -0.003177 -0.002698 -0.006889 0.031912
b -0.003177 1.024721 0.000191 0.009212 0.000857
c -0.002698 0.000191 0.950735 -0.031743 -0.005087
d -0.006889 0.009212 -0.031743 1.002983 -0.047952
e 0.031912 0.000857 -0.005087 -0.047952 1.042487
DataFrame.cov also supports an optional min_periods keyword that specifies the required minimum number
of observations for each column pair in order to have a valid result.
In [13]: frame.cov()
Out[13]:
a b c
a 1.123670 -0.412851 0.018169
b -0.412851 1.154141 0.305260
c 0.018169 0.305260 1.301149
In [14]: frame.cov(min_periods=12)
Out[14]:
a b c
a 1.123670 NaN 0.018169
b NaN 1.154141 0.305260
c 0.018169 0.305260 1.301149
Correlation
Correlation may be computed using the corr() method. Using the method parameter, several methods for com-
puting correlations are provided:
All of these are currently computed using pairwise complete observations. Wikipedia has articles covering the above
correlation coefficients:
• Pearson correlation coefficient
• Kendall rank correlation coefficient
• Spearman’s rank correlation coefficient
Note: Please see the caveats associated with this method of calculating correlation matrices in the covariance section.
Note that non-numeric columns will be automatically excluded from the correlation calculation.
Like cov, corr also supports the optional min_periods keyword:
In [20]: frame = pd.DataFrame(np.random.randn(20, 3), columns=["a", "b", "c"])
In [23]: frame.corr()
Out[23]:
(continues on next page)
In [24]: frame.corr(min_periods=12)
Out[24]:
a b c
a 1.000000 NaN 0.069544
b NaN 1.000000 0.051742
c 0.069544 0.051742 1.000000
....:
In [26]: frame.corr(method=histogram_intersection)
Out[26]:
a b c
a 1.000000 -6.404882 -2.058431
b -6.404882 1.000000 -19.255743
c -2.058431 -19.255743 1.000000
A related method corrwith() is implemented on DataFrame to compute the correlation between like-labeled Series
contained in different DataFrame objects.
In [27]: index = ["a", "b", "c", "d", "e"]
In [31]: df1.corrwith(df2)
Out[31]:
one -0.125501
two -0.493244
three 0.344056
four 0.004183
dtype: float64
Data ranking
The rank() method produces a data ranking with ties being assigned the mean of the ranks (by default) for the group:
In [33]: s = pd.Series(np.random.randn(5), index=list("abcde"))
In [35]: s.rank()
Out[35]:
a 5.0
b 2.5
c 1.0
d 2.5
e 4.0
dtype: float64
rank() is also a DataFrame method and can rank either the rows (axis=0) or the columns (axis=1). NaN values
are excluded from the ranking.
In [36]: df = pd.DataFrame(np.random.randn(10, 6))
In [38]: df
Out[38]:
0 1 2 3 4 5
0 -0.904948 -1.163537 -1.457187 0.135463 -1.457187 0.294650
1 -0.976288 -0.244652 -0.748406 -0.999601 -0.748406 -0.800809
2 0.401965 1.460840 1.256057 1.308127 1.256057 0.876004
3 0.205954 0.369552 -0.669304 0.038378 -0.669304 1.140296
4 -0.477586 -0.730705 -1.129149 -0.601463 -1.129149 -0.211196
5 -1.092970 -0.689246 0.908114 0.204848 NaN 0.463347
6 0.376892 0.959292 0.095572 -0.593740 NaN -0.069180
7 -1.002601 1.957794 -0.120708 0.094214 NaN -1.467422
8 -0.547231 0.664402 -0.519424 -0.073254 NaN -1.263544
9 -0.250277 -0.237428 -1.056443 0.419477 NaN 1.375064
In [39]: df.rank(1)
Out[39]:
0 1 2 3 4 5
0 4.0 3.0 1.5 5.0 1.5 6.0
1 2.0 6.0 4.5 1.0 4.5 3.0
2 1.0 6.0 3.5 5.0 3.5 2.0
3 4.0 5.0 1.5 3.0 1.5 6.0
4 5.0 3.0 1.5 4.0 1.5 6.0
5 1.0 2.0 5.0 3.0 NaN 4.0
6 4.0 5.0 3.0 1.0 NaN 2.0
7 2.0 5.0 3.0 4.0 NaN 1.0
8 2.0 5.0 3.0 4.0 NaN 1.0
9 2.0 3.0 1.0 4.0 NaN 5.0
rank optionally takes a parameter ascending which by default is true; when false, data is reverse-ranked, with
larger values assigned a smaller rank.
rank supports different tie-breaking methods, specified with the method parameter:
• average : average rank of tied group
• min : lowest rank in the group
• max : highest rank in the group
• first : ranks assigned in the order they appear in the array
Windowing functions
See the window operations user guide for an overview of windowing functions.
By “group by” we are referring to a process involving one or more of the following steps:
• Splitting the data into groups based on some criteria.
• Applying a function to each group independently.
• Combining the results into a data structure.
Out of these, the split step is the most straightforward. In fact, in many situations we may wish to split the data set into
groups and do something with those groups. In the apply step, we might wish to do one of the following:
• Aggregation: compute a summary statistic (or statistics) for each group. Some examples:
– Compute group sums or means.
– Compute group sizes / counts.
• Transformation: perform some group-specific computations and return a like-indexed object. Some examples:
– Standardize data (zscore) within a group.
– Filling NAs within groups with a value derived from each group.
• Filtration: discard some groups, according to a group-wise computation that evaluates True or False. Some
examples:
– Discard data that belongs to groups with only a few members.
– Filter out data based on the group sum or mean.
• Some combination of the above: GroupBy will examine the results of the apply step and try to return a sensibly
combined result if it doesn’t fit into either of the above two categories.
Since the set of object instance methods on pandas data structures are generally rich and expressive, we often simply
want to invoke, say, a DataFrame function on each group. The name GroupBy should be quite familiar to those who
have used a SQL-based tool (or itertools), in which you can write code like:
We aim to make operations like this natural and easy to express using pandas. We’ll address each area of GroupBy
functionality then provide some non-trivial examples / use cases.
See the cookbook for some advanced strategies.
pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels
to group names. To create a GroupBy object (more on what the GroupBy object is later), you may do the following:
In [1]: df = pd.DataFrame(
...: [
...: ("bird", "Falconiformes", 389.0),
...: ("bird", "Psittaciformes", 24.0),
...: ("mammal", "Carnivora", 80.2),
...: ("mammal", "Primates", np.nan),
...: ("mammal", "Carnivora", 58),
...: ],
...: index=["falcon", "parrot", "lion", "monkey", "leopard"],
...: columns=("class", "order", "max_speed"),
...: )
...:
In [2]: df
Out[2]:
class order max_speed
falcon bird Falconiformes 389.0
parrot bird Psittaciformes 24.0
lion mammal Carnivora 80.2
monkey mammal Primates NaN
leopard mammal Carnivora 58.0
# default is axis=0
In [3]: grouped = df.groupby("class")
Note: A string passed to groupby may refer to either a column or an index level. If a string matches both a column
name and an index level name, a ValueError will be raised.
In [6]: df = pd.DataFrame(
...: {
...: "A": ["foo", "bar", "foo", "bar", "foo", "bar", "foo", "foo"],
...: "B": ["one", "one", "two", "three", "two", "two", "one", "three"],
...: "C": np.random.randn(8),
(continues on next page)
In [7]: df
Out[7]:
A B C D
0 foo one 0.469112 -0.861849
1 bar one -0.282863 -2.104569
2 foo two -1.509059 -0.494929
3 bar three -1.135632 1.071804
4 foo two 1.212112 0.721555
5 bar two -0.173215 -0.706771
6 foo one 0.119209 -1.039575
7 foo three -1.044236 0.271860
On a DataFrame, we obtain a GroupBy object by calling groupby(). We could naturally group by either the A or B
columns, or both:
In [8]: grouped = df.groupby("A")
In [12]: grouped.sum()
Out[12]:
C D
A
bar -1.591710 -1.739537
foo -0.752861 -1.402938
These will split the DataFrame on its index (rows). We could also split by the columns:
In [13]: def get_letter_type(letter):
....: if letter.lower() in 'aeiou':
....: return 'vowel'
....: else:
....: return 'consonant'
....:
pandas Index objects support duplicate values. If a non-unique index is used as the group key in a groupby operation,
all values for the same index value will be considered to be in one group and thus the output of aggregation functions
will only contain unique index values:
In [15]: lst = [1, 2, 3, 1, 2, 3]
In [18]: grouped.first()
Out[18]:
1 1
2 2
3 3
dtype: int64
In [19]: grouped.last()
Out[19]:
1 10
2 20
3 30
dtype: int64
In [20]: grouped.sum()
Out[20]:
1 11
2 22
3 33
dtype: int64
Note that no splitting occurs until it’s needed. Creating the GroupBy object only verifies that you’ve passed a valid
mapping.
Note: Many kinds of complicated data manipulations can be expressed in terms of GroupBy operations (though can’t
be guaranteed to be the most efficient). You can get quite creative with the label mapping functions.
GroupBy sorting
By default the group keys are sorted during the groupby operation. You may however pass sort=False for
potential speedups:
In [21]: df2 = pd.DataFrame({"X": ["B", "B", "A", "A"], "Y": [1, 2, 3, 4]})
In [22]: df2.groupby(["X"]).sum()
Out[22]:
Y
X
A 7
B 3
Note that groupby will preserve the order in which observations are sorted within each group. For example, the
groups created by groupby() below are in the order they appeared in the original DataFrame:
In [24]: df3 = pd.DataFrame({"X": ["A", "B", "A", "B"], "Y": [1, 4, 3, 2]})
In [25]: df3.groupby(["X"]).get_group("A")
Out[25]:
X Y
0 A 1
2 A 3
In [26]: df3.groupby(["X"]).get_group("B")
Out[26]:
X Y
1 B 4
3 B 2
GroupBy dropna
By default NA values are excluded from group keys during the groupby operation. However, in case you want to
include NA values in group keys, you could pass dropna=False to achieve it.
In [27]: df_list = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
In [29]: df_dropna
Out[29]:
a b c
0 1 2.0 3
1 1 NaN 4
2 2 1.0 3
3 1 2.0 2
The default setting of dropna argument is True which means NA are not included in group keys.
The groups attribute is a dict whose keys are the computed unique groups and corresponding values being the axis
labels belonging to each group. In the above example we have:
In [32]: df.groupby("A").groups
Out[32]: {'bar': [1, 3, 5], 'foo': [0, 2, 4, 6, 7]}
Calling the standard Python len function on the GroupBy object just returns the length of the groups dict, so it is
largely just a convenience:
In [35]: grouped.groups
Out[35]: {('bar', 'one'): [1], ('bar', 'three'): [3], ('bar', 'two'): [5], ('foo',
˓→'one'): [0, 6], ('foo', 'three'): [7], ('foo', 'two'): [2, 4]}
In [36]: len(grouped)
Out[36]: 6
In [37]: df
Out[37]:
height weight gender
2000-01-01 42.849980 157.500553 male
2000-01-02 49.607315 177.340407 male
2000-01-03 56.293531 171.524640 male
2000-01-04 48.421077 144.251986 female
2000-01-05 46.556882 152.526206 male
2000-01-06 68.448851 168.272968 female
2000-01-07 70.757698 136.431469 male
2000-01-08 58.909500 176.499753 female
2000-01-09 76.435631 174.094104 female
2000-01-10 45.306120 177.540920 male
In [38]: gb = df.groupby("gender")
With hierarchically-indexed data, it’s quite natural to group by one of the levels of the hierarchy.
Let’s create a Series with a two-level MultiIndex.
In [40]: arrays = [
....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
....: ["one", "two", "one", "two", "one", "two", "one", "two"],
....: ]
....:
In [43]: s
Out[43]:
first second
bar one -0.919854
two -0.042379
baz one 1.247642
two -0.009920
foo one 0.290213
two 0.495767
qux one 0.362949
two 1.548106
dtype: float64
In [45]: grouped.sum()
Out[45]:
first
bar -0.962232
baz 1.237723
foo 0.785980
qux 1.911055
dtype: float64
If the MultiIndex has names specified, these can be passed instead of the level number:
In [46]: s.groupby(level="second").sum()
Out[46]:
second
one 0.980950
two 1.991575
dtype: float64
The aggregation functions such as sum will take the level parameter directly. Additionally, the resulting index will be
named according to the chosen level:
In [47]: s.sum(level="second")
Out[47]:
second
one 0.980950
(continues on next page)
In [48]: s
Out[48]:
first second third
bar doo one -1.131345
two -0.089329
baz bee one 0.337863
two -0.945867
foo bop one -0.932132
two 1.956030
qux bop one 0.017587
two -0.016692
dtype: float64
A DataFrame may be grouped by a combination of columns and index levels by specifying the column names as
strings and the index levels as pd.Grouper objects.
In [51]: arrays = [
....: ["bar", "bar", "baz", "baz", "foo", "foo", "qux", "qux"],
....: ["one", "two", "one", "two", "one", "two", "one", "two"],
....: ]
....:
In [54]: df
Out[54]:
A B
first second
bar one 1 0
two 1 1
baz one 1 2
two 1 3
foo one 2 4
two 2 5
qux one 3 6
two 3 7
The following example groups df by the second index level and the A column.
Once you have created the GroupBy object from a DataFrame, you might want to do something different for each of
the columns. Thus, using [] similar to getting a column from a DataFrame, you can do:
This is mainly syntactic sugar for the alternative and much more verbose:
In [61]: df["C"].groupby(df["A"])
Out[61]: <pandas.core.groupby.generic.SeriesGroupBy object at 0x7fa941fbb370>
Additionally this method avoids recomputing the internal grouping information derived from the passed key.
With the GroupBy object in hand, iterating through the grouped data is very natural and functions similarly to
itertools.groupby():
In the case of grouping by multiple keys, the group name will be a tuple:
2.17.4 Aggregation
Once the GroupBy object has been created, several methods are available to perform a computation on the grouped
data. These operations are similar to the aggregating API, window API, and resample API.
An obvious one is aggregation via the aggregate() or equivalently agg() method:
In [67]: grouped = df.groupby("A")
In [68]: grouped.aggregate(np.sum)
Out[68]:
C D
A
bar 0.392940 1.732707
foo -1.796421 2.824590
In [70]: grouped.aggregate(np.sum)
Out[70]:
C D
(continues on next page)
As you can see, the result of the aggregation will have the group names as the new index along the grouped axis. In
the case of multiple keys, the result is a MultiIndex by default, though this can be changed by using the as_index
option:
In [72]: grouped.aggregate(np.sum)
Out[72]:
A B C D
0 bar one 0.254161 1.511763
1 bar three 0.215897 -0.990582
2 bar two -0.077118 1.211526
3 foo one -0.983776 1.614581
4 foo three -0.862495 0.024580
5 foo two 0.049851 1.185429
Note that you could use the reset_index DataFrame function to achieve the same result as the column names are
stored in the resulting MultiIndex:
Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size
method. It returns a Series whose index are the group names and whose values are the sizes of each group.
In [75]: grouped.size()
Out[75]:
A B size
0 bar one 1
1 bar three 1
2 bar two 1
3 foo one 2
4 foo three 1
5 foo two 2
In [76]: grouped.describe()
Out[76]:
C ... D
˓→
[6 rows x 16 columns]
Another aggregation example is to compute the number of unique values of each group. This is similar to the
value_counts function, except that it only counts unique values.
In [77]: ll = [['foo', 1], ['foo', 2], ['foo', 2], ['bar', 1], ['bar', 1]]
In [79]: df4
Out[79]:
A B
0 foo 1
1 foo 2
2 foo 2
3 bar 1
4 bar 1
In [80]: df4.groupby("A")["B"].nunique()
Out[80]:
A
bar 1
foo 2
Name: B, dtype: int64
Note: Aggregation functions will not return the groups that you are aggregating over if they are named columns,
when as_index=True, the default. The grouped columns will be the indices of the returned object.
Passing as_index=False will return the groups that you are aggregating over, if they are named columns.
Aggregating functions are the ones that reduce the dimension of the returned objects. Some common aggregating
functions are tabulated below:
Function Description
mean() Compute mean of groups
sum() Compute sum of group values
size() Compute group sizes
count() Compute count of group
std() Standard deviation of groups
var() Compute variance of groups
sem() Standard error of the mean of groups
describe() Generates descriptive statistics
first() Compute first of group values
last() Compute last of group values
nth() Take nth value, or a subset if n is a list
min() Compute min of group values
max() Compute max of group values
The aggregating functions above will exclude NA values. Any function which reduces a Series to a scalar value is
an aggregation function and will work, a trivial example is df.groupby('A').agg(lambda ser: 1). Note
that nth() can act as a reducer or a filter, see here.
With grouped Series you can also pass a list or dict of functions to do aggregation with, outputting a DataFrame:
In [81]: grouped = df.groupby("A")
On a grouped DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated
result with a hierarchical index:
In [83]: grouped.agg([np.sum, np.mean, np.std])
Out[83]:
C D
sum mean std sum mean std
A
bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a
chained operation for a Series like this:
In [84]: (
....: grouped["C"]
....: .agg([np.sum, np.mean, np.std])
....: .rename(columns={"sum": "foo", "mean": "bar", "std": "baz"})
....: )
....:
Out[84]:
foo bar baz
(continues on next page)
In [85]: (
....: grouped.agg([np.sum, np.mean, np.std]).rename(
....: columns={"sum": "foo", "mean": "bar", "std": "baz"}
....: )
....: )
....:
Out[85]:
C D
foo bar baz foo bar baz
A
bar 0.392940 0.130980 0.181231 1.732707 0.577569 1.366330
foo -1.796421 -0.359284 0.912265 2.824590 0.564918 0.884785
Note: In general, the output column names should be unique. You can’t apply the same function (or two functions
with the same name) to the same column.
pandas does allow you to provide multiple lambdas. In this case, pandas will mangle the name of the (nameless)
lambda functions, appending _<i> to each subsequent lambda.
Out[87]:
<lambda_0> <lambda_1>
A
bar 0.331279 0.084917
foo 2.337259 -0.215962
Named aggregation
In [89]: animals
Out[89]:
kind height weight
0 cat 9.1 7.9
1 dog 6.0 7.5
2 cat 9.5 9.9
3 dog 34.0 198.0
In [90]: animals.groupby("kind").agg(
....: min_height=pd.NamedAgg(column="height", aggfunc="min"),
....: max_height=pd.NamedAgg(column="height", aggfunc="max"),
....: average_weight=pd.NamedAgg(column="weight", aggfunc=np.mean),
....: )
....:
Out[90]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
In [91]: animals.groupby("kind").agg(
....: min_height=("height", "min"),
....: max_height=("height", "max"),
....: average_weight=("weight", np.mean),
....: )
....:
Out[91]:
min_height max_height average_weight
kind
cat 9.1 9.5 8.90
dog 6.0 34.0 102.75
If your desired output column names are not valid Python keywords, construct a dictionary and unpack the keyword
arguments
In [92]: animals.groupby("kind").agg(
....: **{
....: "total weight": pd.NamedAgg(column="weight", aggfunc=sum)
....: }
....: )
....:
Out[92]:
total weight
kind
cat 17.8
dog 205.5
Additional keyword arguments are not passed through to the aggregation functions. Only pairs of (column,
aggfunc) should be passed as **kwargs. If your aggregation functions requires additional arguments, partially
apply them with functools.partial().
Note: For Python 3.5 and earlier, the order of **kwargs in a functions was not preserved. This means that the
output column ordering would not be consistent. To ensure consistent ordering, the keys (and so output columns) will
always be sorted for Python 3.5.
Named aggregation is also valid for Series groupby aggregations. In this case there’s no column selection, so the
values are just the functions.
In [93]: animals.groupby("kind").height.agg(
....: min_height="min",
....: max_height="max",
....: )
....:
Out[93]:
min_height max_height
kind
cat 9.1 9.5
dog 6.0 34.0
By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:
In [94]: grouped.agg({"C": np.sum, "D": lambda x: np.std(x, ddof=1)})
Out[94]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
The function names can also be strings. In order for a string to be valid it must be either implemented on GroupBy or
available via dispatching:
In [95]: grouped.agg({"C": "sum", "D": "std"})
Out[95]:
C D
A
bar 0.392940 1.366330
foo -1.796421 0.884785
Some common aggregations, currently only sum, mean, std, and sem, have optimized Cython implementations:
In [96]: df.groupby("A").sum()
Out[96]:
C D
A
bar 0.392940 1.732707
foo -1.796421 2.824590
(continues on next page)
Of course sum and mean are implemented on pandas objects, so the above code would work even without the special
versions via dispatching (see below).
2.17.5 Transformation
The transform method returns an object that is indexed the same (same size) as the one being grouped. The
transform function must:
• Return a result that is either the same size as the group chunk or broadcastable to the size of the group chunk
(e.g., a scalar, grouped.transform(lambda x: x.iloc[-1])).
• Operate column-by-column on the group chunk. The transform is applied to the first group chunk using
chunk.apply.
• Not perform in-place operations on the group chunk. Group chunks should be treated as immutable, and changes
to a group chunk may produce unexpected results. For example, when using fillna, inplace must be
False (grouped.transform(lambda x: x.fillna(inplace=False))).
• (Optionally) operates on the entire group chunk. If this is supported, a fast path is used starting from the second
chunk.
For example, suppose we wished to standardize the data within each group:
In [101]: ts.head()
Out[101]:
2000-01-08 0.779333
2000-01-09 0.778852
2000-01-10 0.786476
2000-01-11 0.782797
2000-01-12 0.798110
Freq: D, dtype: float64
In [102]: ts.tail()
Out[102]:
2002-09-30 0.660294
2002-10-01 0.631095
2002-10-02 0.673601
2002-10-03 0.709213
(continues on next page)
We would expect the result to now have mean 0 and standard deviation 1 within each group, which we can easily
check:
# Original Data
In [104]: grouped = ts.groupby(lambda x: x.year)
In [105]: grouped.mean()
Out[105]:
2000 0.442441
2001 0.526246
2002 0.459365
dtype: float64
In [106]: grouped.std()
Out[106]:
2000 0.131752
2001 0.210945
2002 0.128753
dtype: float64
# Transformed Data
In [107]: grouped_trans = transformed.groupby(lambda x: x.year)
In [108]: grouped_trans.mean()
Out[108]:
2000 1.167126e-15
2001 2.190637e-15
2002 1.088580e-15
dtype: float64
In [109]: grouped_trans.std()
Out[109]:
2000 1.0
2001 1.0
2002 1.0
dtype: float64
We can also visually compare the original and transformed data sets.
In [111]: compare.plot()
Out[111]: <AxesSubplot:>
Transformation functions that have lower dimension outputs are broadcast to match the shape of the input array.
Alternatively, the built-in methods could be used to produce the same outputs.
Another common data transform is to replace missing data with the group mean.
In [116]: data_df
Out[116]:
A B C
0 1.539708 -1.166480 0.533026
1 1.302092 -0.505754 NaN
2 -0.371983 1.104803 -0.651520
3 -1.309622 1.118697 -1.161657
4 -1.924296 0.396437 0.812436
.. ... ... ...
995 -0.093110 0.683847 -0.774753
996 -0.185043 1.438572 NaN
997 -0.394469 -0.642343 0.011374
998 -1.174126 1.857148 NaN
999 0.234564 0.517098 0.393534
We can verify that the group means have not changed in the transformed data and that the transformed data contains
no NAs.
Note: Some functions will automatically transform the input when applied to a GroupBy object, but returning an
object of the same shape as the original. Passing as_index=False will not affect these transformation methods.
For example: fillna, ffill, bfill, shift..
In [128]: grouped.ffill()
Out[128]:
A B C
0 1.539708 -1.166480 0.533026
1 1.302092 -0.505754 0.533026
2 -0.371983 1.104803 -0.651520
3 -1.309622 1.118697 -1.161657
4 -1.924296 0.396437 0.812436
.. ... ... ...
995 -0.093110 0.683847 -0.774753
996 -0.185043 1.438572 -0.774753
997 -0.394469 -0.642343 0.011374
998 -1.174126 1.857148 -0.774753
(continues on next page)
In [130]: df_re
Out[130]:
A B
0 1 0
1 1 1
2 1 2
3 1 3
4 1 4
.. .. ..
15 5 15
16 5 16
17 5 17
18 5 18
19 5 19
In [131]: df_re.groupby("A").rolling(4).B.mean()
Out[131]:
A
1 0 NaN
1 NaN
2 NaN
3 1.5
4 2.5
...
5 15 13.5
16 14.5
17 15.5
18 16.5
19 17.5
Name: B, Length: 20, dtype: float64
The expanding() method will accumulate a given operation (sum() in the example) for all the members of each
particular group.
In [132]: df_re.groupby("A").expanding().sum()
Out[132]:
A B
A
1 0 1.0 0.0
(continues on next page)
Suppose you want to use the resample() method to get a daily frequency in each group of your dataframe and wish
to complete the missing values with the ffill() method.
In [134]: df_re
Out[134]:
group val
date
2016-01-03 1 5
2016-01-10 1 6
2016-01-17 2 7
2016-01-24 2 8
In [135]: df_re.groupby("group").resample("1D").ffill()
Out[135]:
group val
group date
1 2016-01-03 1 5
2016-01-04 1 5
2016-01-05 1 5
2016-01-06 1 5
2016-01-07 1 5
... ... ...
2 2016-01-20 2 7
2016-01-21 2 7
2016-01-22 2 7
2016-01-23 2 7
2016-01-24 2 8
2.17.6 Filtration
The filter method returns a subset of the original object. Suppose we want to take only elements that belong to
groups with a group sum greater than 2.
The argument of filter must be a function that, applied to the group as a whole, returns True or False.
Another useful operation is filtering out elements that belong to groups with only a couple members.
Alternatively, instead of dropping the offending groups, we can return a like-indexed objects where the groups that do
not pass the filter are filled with NaNs.
For DataFrames with multiple columns, filters should explicitly specify a column as the filter criterion.
Note: Some functions when applied to a groupby object will act as a filter on the input, returning a reduced shape of
the original (and potentially eliminating groups), but with the index unchanged. Passing as_index=False will not
In [143]: dff.groupby("B").head(2)
Out[143]:
A B C
0 0 a 0
1 1 a 1
2 2 b 2
3 3 b 3
6 6 c 6
7 7 c 7
When doing an aggregation or transformation, you might just want to call an instance method on each data group.
This is pretty easy to do by passing lambda functions:
But, it’s rather verbose and can be untidy if you need to pass additional arguments. Using a bit of metaprogramming
cleverness, GroupBy now has the ability to “dispatch” method calls to the groups:
In [146]: grouped.std()
Out[146]:
C D
A
bar 0.181231 1.366330
foo 0.912265 0.884785
What is actually happening here is that a function wrapper is being generated. When invoked, it takes any passed
arguments and invokes the function with any arguments on each group (in the above example, the std function). The
results are then combined together much in the style of agg and transform (it actually uses apply to infer the
gluing, documented next). This enables some operations to be carried out rather succinctly:
In [150]: grouped.fillna(method="pad")
(continues on next page)
In this example, we chopped the collection of time series into yearly chunks then independently called fillna on the
groups.
The nlargest and nsmallest methods work on Series style groupbys:
In [152]: g = pd.Series(list("abababab"))
In [153]: gb = s.groupby(g)
In [154]: gb.nlargest(3)
Out[154]:
a 4 19.0
0 9.0
2 7.0
b 1 8.0
3 5.0
7 3.3
dtype: float64
In [155]: gb.nsmallest(3)
Out[155]:
a 6 4.2
2 7.0
0 9.0
b 5 1.0
7 3.3
3 5.0
dtype: float64
Some operations on the grouped data might not fit into either the aggregate or transform categories. Or, you may simply
want GroupBy to infer how to combine the results. For these, use the apply function, which can be substituted for
both aggregate and transform in many standard use cases. However, apply can handle some exceptional use
cases, for example:
In [156]: df
Out[156]:
A B C D
0 foo one -0.575247 1.346061
1 bar one 0.254161 1.511763
2 foo two -1.143704 1.627081
3 bar three 0.215897 -0.990582
4 foo two 1.193555 -0.441652
5 bar two -0.077118 1.211526
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
In [161]: grouped.apply(f)
Out[161]:
original demeaned
0 -0.575247 -0.215962
1 0.254161 0.123181
2 -1.143704 -0.784420
3 0.215897 0.084917
4 1.193555 1.552839
5 -0.077118 -0.208098
6 -0.408530 -0.049245
7 -0.862495 -0.503211
apply on a Series can operate on a returned value from the applied function, that is itself a series, and possibly upcast
the result to a DataFrame:
In [163]: s = pd.Series(np.random.rand(5))
In [164]: s
Out[164]:
0 0.321438
1 0.493496
2 0.139505
3 0.910103
4 0.194158
dtype: float64
In [165]: s.apply(f)
Out[165]:
x x^2
0 0.321438 0.103323
1 0.493496 0.243538
2 0.139505 0.019462
3 0.910103 0.828287
4 0.194158 0.037697
Note: apply can act as a reducer, transformer, or filter function, depending on exactly what is passed to it. So
depending on the path taken, and exactly what you are grouping. Thus the grouped columns(s) may be included in the
output as well as set the indices.
Warning: When using engine='numba', there will be no “fall back” behavior internally. The group data
and group index will be passed as NumPy arrays to the JITed user defined function, and no alternative execution
attempts will be tried.
Note: In terms of performance, the first time a function is run using the Numba engine will be slow as Numba
will have some function compilation overhead. However, the compiled functions are cached, and subsequent calls will
be fast. In general, the Numba engine is performant with a larger amount of data points (e.g. 1+ million).
In [1]: N = 10 ** 3
In [166]: df
Out[166]:
A B C D
0 foo one -0.575247 1.346061
1 bar one 0.254161 1.511763
2 foo two -1.143704 1.627081
3 bar three 0.215897 -0.990582
4 foo two 1.193555 -0.441652
5 bar two -0.077118 1.211526
6 foo one -0.408530 0.268520
7 foo three -0.862495 0.024580
Suppose we wish to compute the standard deviation grouped by the A column. There is a slight problem, namely that
we don’t care about the data in column B. We refer to this as a “nuisance” column. If the passed aggregation function
can’t be applied to some columns, the troublesome columns will be (silently) dropped. Thus, this does not pose any
problems:
In [167]: df.groupby("A").std()
Out[167]:
C D
A
bar 0.181231 1.366330
foo 0.912265 0.884785
Note: Any object column, also if it contains numerical values such as Decimal objects, is considered as a “nuisance”
columns. They are excluded from aggregate functions automatically in groupby.
If you do wish to include decimal or object columns in an aggregation with other non-nuisance data types, you must
do so explicitly.
# ...but cannot be combined with standard data types or they will be excluded
In [171]: df_dec.groupby(["id"])[["int_column", "dec_column"]].sum()
Out[171]:
int_column
id
1 4
2 6
# Use .agg function to aggregate over standard and "nuisance" data types
# at the same time
In [172]: df_dec.groupby(["id"]).agg({"int_column": "sum", "dec_column": "sum"})
(continues on next page)
When using a Categorical grouper (as a single grouper, or as part of multiple groupers), the observed keyword
controls whether to return a cartesian product of all possible groupers values (observed=False) or only those that
are observed groupers (observed=True).
Show all values:
The returned dtype of the grouped will always include all of the categories that were grouped.
In [175]: s = (
.....: pd.Series([1, 1, 1])
.....: .groupby(pd.Categorical(["a", "a", "a"], categories=["a", "b"]),
˓→observed=False)
.....: .count()
.....: )
.....:
In [176]: s.index.dtype
Out[176]: CategoricalDtype(categories=['a', 'b'], ordered=False)
If there are any NaN or NaT values in the grouping key, these will be automatically excluded. In other words, there will
never be an “NA group” or “NaT group”. This was not the case in older versions of pandas, but users were generally
discarding the NA group anyway (and supporting it was an implementation headache).
Categorical variables represented as instance of pandas’s Categorical class can be used as group keys. If so, the
order of the levels will be preserved:
In [177]: data = pd.Series(np.random.randn(100))
In [179]: data.groupby(factor).mean()
Out[179]:
(-2.645, -0.523] -1.362896
(-0.523, 0.0296] -0.260266
(0.0296, 0.654] 0.361802
(0.654, 2.21] 1.073801
dtype: float64
You may need to specify a bit more data to properly group. You can use the pd.Grouper to provide this local
control.
In [180]: import datetime
In [181]: df = pd.DataFrame(
.....: {
.....: "Branch": "A A A A A A A B".split(),
.....: "Buyer": "Carl Mark Carl Carl Joe Joe Joe Carl".split(),
.....: "Quantity": [1, 3, 5, 1, 8, 1, 9, 3],
.....: "Date": [
.....: datetime.datetime(2013, 1, 1, 13, 0),
.....: datetime.datetime(2013, 1, 1, 13, 5),
.....: datetime.datetime(2013, 10, 1, 20, 0),
.....: datetime.datetime(2013, 10, 2, 10, 0),
.....: datetime.datetime(2013, 10, 1, 20, 0),
.....: datetime.datetime(2013, 10, 2, 10, 0),
.....: datetime.datetime(2013, 12, 2, 12, 0),
.....: datetime.datetime(2013, 12, 2, 14, 0),
.....: ],
.....: }
.....: )
.....:
In [182]: df
Out[182]:
Branch Buyer Quantity Date
0 A Carl 1 2013-01-01 13:00:00
1 A Mark 3 2013-01-01 13:05:00
2 A Carl 5 2013-10-01 20:00:00
(continues on next page)
Groupby a specific column with the desired frequency. This is like resampling.
You have an ambiguous specification in that you have a named index and a column that could be potential groupers.
In [184]: df = df.set_index("Date")
Just like for a DataFrame or Series you can call head and tail on a groupby:
In [189]: df
Out[189]:
A B
0 1 2
1 1 4
2 5 6
(continues on next page)
In [190]: g = df.groupby("A")
In [191]: g.head(1)
Out[191]:
A B
0 1 2
2 5 6
In [192]: g.tail(1)
Out[192]:
A B
1 1 4
2 5 6
To select from a DataFrame or Series the nth item, use nth(). This is a reduction method, and will return a single
row (or no row) per group if you pass an int for n:
In [194]: g = df.groupby("A")
In [195]: g.nth(0)
Out[195]:
B
A
1 NaN
5 6.0
In [196]: g.nth(-1)
Out[196]:
B
A
1 4.0
5 6.0
In [197]: g.nth(1)
Out[197]:
B
A
1 4.0
If you want to select the nth not-null item, use the dropna kwarg. For a DataFrame this should be either 'any' or
'all' just like you would pass to dropna:
In [199]: g.first()
Out[199]:
B
A
1 4.0
5 6.0
In [201]: g.last()
Out[201]:
B
A
1 4.0
5 6.0
As with other methods, passing as_index=False, will achieve a filtration, which returns the grouped row.
In [203]: df = pd.DataFrame([[1, np.nan], [1, 4], [5, 6]], columns=["A", "B"])
In [205]: g.nth(0)
Out[205]:
A B
0 1 NaN
2 5 6.0
In [206]: g.nth(-1)
Out[206]:
A B
1 1 4.0
2 5 6.0
You can also select multiple rows from each group by specifying multiple nth values as a list of ints.
In [207]: business_dates = pd.date_range(start="4/1/2014", end="6/30/2014", freq="B")
# get the first, 4th, and last date index for each month
In [209]: df.groupby([df.index.year, df.index.month]).nth([0, 3, -1])
(continues on next page)
To see the order in which each row appears within its group, use the cumcount method:
In [211]: dfg
Out[211]:
A
0 a
1 a
2 a
3 b
4 b
5 a
In [212]: dfg.groupby("A").cumcount()
Out[212]:
0 0
1 1
2 2
3 0
4 1
5 3
dtype: int64
In [213]: dfg.groupby("A").cumcount(ascending=False)
Out[213]:
0 3
1 2
2 1
3 1
4 0
5 0
dtype: int64
Enumerate groups
To see the ordering of the groups (as opposed to the order of rows within a group given by cumcount) you can use
ngroup().
Note that the numbers given to the groups match the order in which the groups would be seen when iterating over the
groupby object, not the order they are first observed.
In [215]: dfg
Out[215]:
A
0 a
1 a
2 a
3 b
4 b
5 a
In [216]: dfg.groupby("A").ngroup()
Out[216]:
0 0
1 0
2 0
3 1
4 1
5 0
dtype: int64
In [217]: dfg.groupby("A").ngroup(ascending=False)
Out[217]:
0 1
1 1
2 1
3 0
4 0
5 1
dtype: int64
Plotting
Groupby also works with some plotting methods. For example, suppose we suspect that some features in a DataFrame
may differ by group, in this case, the values in column 1 where the group is “B” are 3 higher on average.
In [218]: np.random.seed(1234)
In [222]: df.groupby("g").boxplot()
Out[222]:
A AxesSubplot(0.1,0.15;0.363636x0.75)
B AxesSubplot(0.536364,0.15;0.363636x0.75)
dtype: object
The result of calling boxplot is a dictionary whose keys are the values of our grouping column g (“A” and “B”).
The values of the resulting dictionary can be controlled by the return_type keyword of boxplot. See the
visualization documentation for more.
Similar to the functionality provided by DataFrame and Series, functions that take GroupBy objects can be
chained together using a pipe method to allow for a cleaner, more readable syntax. To read about .pipe in general
terms, see here.
Combining .groupby and .pipe is often useful when you need to reuse GroupBy objects.
As an example, imagine having a DataFrame with columns for stores, products, revenue and quantity sold. We’d
like to do a groupwise calculation of prices (i.e. revenue/quantity) per store and per product. We could do this in a
multi-step operation, but expressing it in terms of piping can make the code more readable. First we set the data:
In [223]: n = 1000
In [224]: df = pd.DataFrame(
.....: {
.....: "Store": np.random.choice(["Store_1", "Store_2"], n),
.....: "Product": np.random.choice(["Product_1", "Product_2"], n),
.....: "Revenue": (np.random.random(n) * 50 + 10).round(2),
.....: "Quantity": np.random.randint(1, 10, size=n),
.....: }
.....: )
.....:
In [225]: df.head(2)
Out[225]:
Store Product Revenue Quantity
0 Store_2 Product_1 26.12 1
1 Store_2 Product_1 28.86 1
In [226]: (
.....: df.groupby(["Store", "Product"])
.....: .pipe(lambda grp: grp.Revenue.sum() / grp.Quantity.sum())
.....: .unstack()
.....: .round(2)
.....: )
.....:
Out[226]:
Product Product_1 Product_2
Store
Store_1 6.82 7.05
Store_2 6.30 6.64
Piping can also be expressive when you want to deliver a grouped object to some arbitrary function, for example:
where mean takes a GroupBy object and finds the mean of the Revenue and Quantity columns respectively for each
Store-Product combination. The mean function can be any function that takes in a GroupBy object; the .pipe will
pass the GroupBy object as a parameter into the function you specify.
2.17.11 Examples
Regrouping by factor
Regroup columns of a DataFrame according to their sum, and sum the aggregated ones.
In [229]: df = pd.DataFrame({"a": [1, 0, 0], "b": [0, 1, 0], "c": [1, 0, 0], "d": [2,
˓→3, 4]})
In [230]: df
Out[230]:
a b c d
0 1 0 1 2
1 0 1 0 3
2 0 0 0 4
Multi-column factorization
By using ngroup(), we can extract information about the groups in a way similar to factorize() (as described
further in the reshaping API) but which applies naturally to multiple columns of mixed type and different sources.
This can be useful as an intermediate categorical-like step in processing, when the relationships between the group
rows are more important than their content, or as input to an algorithm which only accepts the integer encoding.
(For more information about support in pandas for full categorical data, see the Categorical introduction and the API
documentation.)
In [232]: dfg = pd.DataFrame({"A": [1, 1, 2, 3, 2], "B": list("aaaba")})
In [233]: dfg
Out[233]:
A B
0 1 a
1 1 a
2 2 a
3 3 b
4 2 a
Resampling produces new hypothetical samples (resamples) from already existing observed data or from a model that
generates data. These new samples are similar to the pre-existing samples.
In order to resample to work on indices that are non-datetimelike, the following procedure can be utilized.
In the following examples, df.index // 5 returns a binary array which is used to determine what gets selected for the
groupby operation.
Note: The below example shows how we can downsample by consolidation of samples into fewer samples. Here by
using df.index // 5, we are aggregating the samples in bins. By applying std() function, we aggregate the information
contained in many samples into a small subset of values which is their standard deviation thereby reducing the number
of samples.
In [237]: df
Out[237]:
0 1
0 -0.793893 0.321153
1 0.342250 1.618906
2 -0.975807 1.918201
3 -0.810847 -1.405919
4 -1.977759 0.461659
5 0.730057 -1.316938
6 -0.751328 0.528290
7 -0.257759 -1.081009
8 0.505895 -1.701948
9 -1.006349 0.020208
In [238]: df.index // 5
Out[238]: Int64Index([0, 0, 0, 0, 0, 1, 1, 1, 1, 1], dtype='int64')
Group DataFrame columns, compute a set of metrics and return a named Series. The Series name is used as the name
for the column index. This is especially useful in conjunction with reshaping operations such as stacking in which the
column index name will be used as the name of the inserted column:
In [240]: df = pd.DataFrame(
.....: {
.....: "a": [0, 0, 0, 0, 1, 1, 1, 1, 2, 2, 2, 2],
.....: "b": [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1],
.....: "c": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
.....: "d": [0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1],
.....: }
.....: )
.....:
In [243]: result
Out[243]:
metrics b_sum c_mean
a
0 2.0 0.5
1 2.0 0.5
2 2.0 0.5
In [244]: result.stack()
Out[244]:
a metrics
0 b_sum 2.0
c_mean 0.5
1 b_sum 2.0
c_mean 0.5
2 b_sum 2.0
c_mean 0.5
dtype: float64
pandas contains a compact set of APIs for performing windowing operations - an operation that performs an aggre-
gation over a sliding partition of values. The API functions similarly to the groupby API in that Series and
DataFrame call the windowing method with necessary parameters and then subsequently call the aggregation func-
tion.
In [1]: s = pd.Series(range(5))
In [2]: s.rolling(window=2).sum()
Out[2]:
0 NaN
1 1.0
(continues on next page)
The windows are comprised by looking back the length of the window from the current observation. The result above
can be derived by taking the sum of the following windowed partitions of data:
In [3]: for window in s.rolling(window=2):
...: print(window)
...:
0 0
dtype: int64
0 0
1 1
dtype: int64
1 1
2 2
dtype: int64
2 2
3 3
dtype: int64
3 3
4 4
dtype: int64
2.18.1 Overview
As noted above, some operations support specifying a window based on a time offset:
In [4]: s = pd.Series(range(5), index=pd.date_range('2020-01-01', periods=5, freq='1D
˓→'))
In [5]: s.rolling(window='2D').sum()
Out[5]:
2020-01-01 0.0
2020-01-02 1.0
(continues on next page)
Additionally, some methods support chaining a groupby operation with a windowing operation which will first group
the data by the specified keys and then perform a windowing operation per group.
In [6]: df = pd.DataFrame({'A': ['a', 'b', 'a', 'b', 'a'], 'B': range(5)})
In [7]: df.groupby('A').expanding().sum()
Out[7]:
B
A
a 0 0.0
2 2.0
4 6.0
b 1 1.0
3 4.0
Note: Windowing operations currently only support numeric data (integer and float) and will always return float64
values.
Warning: Some windowing aggregation, mean, sum, var and std methods may suffer from numerical im-
precision due to the underlying windowing algorithms accumulating sums. When values differ with magnitude
1/𝑛𝑝.𝑓 𝑖𝑛𝑓 𝑜(𝑛𝑝.𝑑𝑜𝑢𝑏𝑙𝑒).𝑒𝑝𝑠 this results in truncation. It must be noted, that large values may have an impact on
windows, which do not include these values. Kahan summation is used to compute the rolling sums to preserve
accuracy as much as possible.
All windowing operations support a min_periods argument that dictates the minimum amount of non-np.nan
values a window must have; otherwise, the resulting value is np.nan. min_peridos defaults to 1 for time-based
windows and window for fixed windows
In [8]: s = pd.Series([np.nan, 1, 2, np.nan, np.nan, 3])
# Equivalent to min_periods=3
In [11]: s.rolling(window=3, min_periods=None).sum()
Out[11]:
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
dtype: float64
Additionally, all windowing operations supports the aggregate method for returning a result of multiple aggrega-
tions applied to a window.
In [12]: df = pd.DataFrame({"A": range(5), "B": range(10, 15)})
Generic rolling windows support specifying windows as a fixed number of observations or variable number of obser-
vations based on an offset. If a time based offset is provided, the corresponding time based index must be monotonic.
In [14]: times = ['2020-01-01', '2020-01-03', '2020-01-04', '2020-01-05', '2020-01-29
˓→']
In [16]: s
Out[16]:
2020-01-01 0
2020-01-03 1
2020-01-04 2
2020-01-05 3
2020-01-29 4
dtype: int64
Centering windows
By default the labels are set to the right edge of the window, but a center keyword is available so the labels can be
set at the center.
In [19]: s = pd.Series(range(10))
In [20]: s.rolling(window=5).mean()
Out[20]:
0 NaN
1 NaN
2 NaN
3 NaN
4 2.0
5 3.0
6 4.0
7 5.0
8 6.0
9 7.0
dtype: float64
The inclusion of the interval endpoints in rolling window calculations can be specified with the closed parameter:
Value Behavior
right' close right endpoint
'left' close left endpoint
'both' close both endpoints
'neither' open endpoints
For example, having the right endpoint open is useful in many problems that require that there is no contamination
from present information back to past information. This allows the rolling window to compute statistics “up to that
point in time”, but not including that point in time.
In [22]: df = pd.DataFrame(
....: {"x": 1},
....: index=[
....: pd.Timestamp("20130101 09:00:01"),
....: pd.Timestamp("20130101 09:00:02"),
....: pd.Timestamp("20130101 09:00:03"),
....: pd.Timestamp("20130101 09:00:04"),
....: pd.Timestamp("20130101 09:00:06"),
....: ],
....: )
....:
In [27]: df
Out[27]:
x right both left neither
2013-01-01 09:00:01 1 1.0 1.0 NaN NaN
2013-01-01 09:00:02 1 2.0 2.0 1.0 1.0
2013-01-01 09:00:03 1 2.0 3.0 2.0 1.0
2013-01-01 09:00:04 1 2.0 3.0 2.0 1.0
2013-01-01 09:00:06 1 1.0 2.0 1.0 NaN
In [29]: use_expanding
Out[29]: [True, False, True, False, True]
In [31]: df
Out[31]:
values
0 0
1 1
2 2
3 3
4 4
and we want to use an expanding window where use_expanding is True otherwise a window of size 1, we can
create the following BaseIndexer subclass:
In [2]: from pandas.api.indexers import BaseIndexer
...:
...: class CustomIndexer(BaseIndexer):
...:
...: def get_window_bounds(self, num_values, min_periods, center, closed):
...: start = np.empty(num_values, dtype=np.int64)
...: end = np.empty(num_values, dtype=np.int64)
...: for i in range(num_values):
...: if self.use_expanding[i]:
...: start[i] = 0
...: end[i] = i + 1
...: else:
...: start[i] = i
...: end[i] = i + self.window_size
...: return start, end
...:
In [4]: df.rolling(indexer).sum()
Out[4]:
values
0 0.0
1 1.0
2 3.0
3 3.0
4 10.0
In [36]: df
Out[36]:
0
2020-01-01 0
2020-01-02 1
2020-01-03 2
2020-01-04 3
2020-01-05 4
2020-01-06 5
2020-01-07 6
2020-01-08 7
2020-01-09 8
2020-01-10 9
In [37]: df.rolling(indexer).sum()
Out[37]:
0
2020-01-01 0.0
2020-01-02 1.0
2020-01-03 2.0
2020-01-04 3.0
2020-01-05 7.0
2020-01-06 12.0
2020-01-07 6.0
2020-01-08 7.0
2020-01-09 8.0
2020-01-10 9.0
For some problems knowledge of the future is available for analysis. For example, this occurs when each data point is a
full time series read from an experiment, and the task is to extract underlying conditions. In these cases it can be useful
to perform forward-looking rolling window computations. FixedForwardWindowIndexer class is available for
this purpose. This BaseIndexer subclass implements a closed fixed-width forward-looking rolling window, and
we can use it as follows:
Rolling apply
The apply() function takes an extra func argument and performs generic rolling computations. The func ar-
gument should be a single function that produces a single value from an ndarray input. raw specifies whether the
windows are cast as Series objects (raw=False) or ndarray objects (raw=True).
In [39]: s = pd.Series(range(10))
Numba engine
Note: In terms of performance, the first time a function is run using the Numba engine will be slow as Numba
will have some function compilation overhead. However, the compiled functions are cached, and subsequent calls will
be fast. In general, the Numba engine is performant with a larger amount of data points (e.g. 1+ million).
cov() and corr() can compute moving window statistics about two Series or any combination of
DataFrame/Series or DataFrame/DataFrame. Here is the behavior in each case:
• two Series: compute the statistic for the pairing.
• DataFrame/Series: compute the statistics for each column of the DataFrame with the passed Series, thus
returning a DataFrame.
• DataFrame/DataFrame: by default compute the statistic for matching column names, returning a
DataFrame. If the keyword argument pairwise=True is passed then computes the statistic for each pair
of columns, returning a MultiIndexed DataFrame whose index are the dates in question (see the next
section).
For example:
In [41]: df = pd.DataFrame(
....: np.random.randn(10, 4),
....: index=pd.date_range("2020-01-01", periods=10),
....: columns=["A", "B", "C", "D"],
....: )
....:
In [42]: df = df.cumsum()
In [44]: df2.rolling(window=2).corr(df2["B"])
Out[44]:
A B C D
2020-01-01 NaN NaN NaN NaN
2020-01-02 -1.0 1.0 -1.0 1.0
2020-01-03 1.0 1.0 1.0 -1.0
2020-01-04 -1.0 1.0 1.0 -1.0
In financial data analysis and other fields it’s common to compute covariance and correlation matrices for a collection
of time series. Often one is also interested in moving-window covariance and correlation matrices. This can be done
by passing the pairwise keyword argument, which in the case of DataFrame inputs will yield a MultiIndexed
DataFrame whose index are the dates in question. In the case of a single DataFrame argument the pairwise
argument can even be omitted:
Note: Missing values are ignored and each entry is computed using the pairwise complete observations. Please see
the covariance section for caveats associated with this method of calculating covariance and correlation matrices.
In [45]: covs = (
....: df[["B", "C", "D"]]
....: .rolling(window=4)
....: .cov(df[["A", "B", "C"]], pairwise=True)
....: )
....:
In [46]: covs
(continues on next page)
The win_type argument in .rolling generates a weighted windows that are commonly used in filtering and
spectral estimation. win_type must be string that corresponds to a scipy.signal window function. Scipy must be
installed in order to use these windows, and supplementary arguments that the Scipy window methods take must be
specified in the aggregation function.
In [47]: s = pd.Series(range(10))
In [48]: s.rolling(window=5).mean()
Out[48]:
0 NaN
1 NaN
2 NaN
3 NaN
4 2.0
5 3.0
6 4.0
7 5.0
8 6.0
9 7.0
dtype: float64
An expanding window yields the value of an aggregation statistic with all the data available up to that point in time.
Since these calculations are a special case of rolling statistics, they are implemented in pandas such that the following
two calls are equivalent:
In [51]: df = pd.DataFrame(range(5))
In [53]: df.expanding(min_periods=1).mean()
Out[53]:
0
0 0.0
1 0.5
2 1.0
3 1.5
4 2.0
An exponentially weighted window is similar to an expanding window but with each prior point being exponentially
weighted down relative to the current point.
In general, a weighted moving average is calculated as
∑︀𝑡
𝑖=0 𝑤𝑖 𝑥𝑡−𝑖
𝑦𝑡 = ∑︀ 𝑡 ,
𝑖=0 𝑤𝑖
where 𝑥𝑡 is the input, 𝑦𝑡 is the result and the 𝑤𝑖 are the weights.
𝑦0 = 𝑥0
𝑦𝑡 = (1 − 𝛼)𝑦𝑡−1 + 𝛼𝑥𝑡 ,
𝑦𝑡 = 𝛼′ 𝑦𝑡−1 + (1 − 𝛼′ )𝑥𝑡 .
The difference between the above two variants arises because we are dealing with series which have finite history.
Consider a series of infinite history, with adjust=True:
which is the same expression as adjust=False above and therefore shows the equivalence of the two variants for
infinite series. When adjust=False, we have 𝑦0 = 𝑥0 and 𝑦𝑡 = 𝛼𝑥𝑡 + (1 − 𝛼)𝑦𝑡−1 . Therefore, there is an
assumption that 𝑥0 is not an ordinary value but rather an exponentially weighted moment of the infinite series up to
that point.
One must have 0 < 𝛼 ≤ 1, and while it is possible to pass 𝛼 directly, it’s often easier to think about either the span,
center of mass (com) or half-life of an EW moment:
⎧
2
⎨ 𝑠+1 ,
⎪ for span 𝑠 ≥ 1
1
𝛼 = 1+𝑐 , for center of mass 𝑐 ≥ 0
⎪ log 0.5
1 − exp ℎ , for half-life ℎ > 0
⎩
One must specify precisely one of span, center of mass, half-life and alpha to the EW functions:
• Span corresponds to what is commonly called an “N-day EW moving average”.
• Center of mass has a more physical interpretation and can be thought of in terms of span: 𝑐 = (𝑠 − 1)/2.
• Half-life is the period of time for the exponential weight to reduce to one half.
• Alpha specifies the smoothing factor directly.
New in version 1.1.0.
You can also specify halflife in terms of a timedelta convertible unit to specify the amount of time it takes for an
observation to decay to half its value when also specifying a sequence of times.
In [54]: df = pd.DataFrame({"B": [0, 1, 2, np.nan, 4]})
In [55]: df
Out[55]:
B
0 0.0
1 1.0
2 2.0
3 NaN
4 4.0
The following formula is used to compute exponentially weighted mean with an input vector of times:
∑︀𝑡 𝑡𝑡 −𝑡𝑖
0.5 𝜆 𝑥𝑡−𝑖
𝑦𝑡 = 𝑖=0 𝑡𝑡 −𝑡𝑖 ,
0.5 𝜆
ExponentialMovingWindow also has an ignore_na argument, which determines how intermediate null values affect
the calculation of the weights. When ignore_na=False (the default), weights are calculated based on absolute
positions, so that intermediate null values affect the result. When ignore_na=True, weights are calculated by
ignoring intermediate null values. For example, assuming adjust=True, if ignore_na=False, the weighted
average of 3, NaN, 5 would be calculated as
(1 − 𝛼)2 · 3 + 1 · 5
.
(1 − 𝛼)2 + 1
Whereas if ignore_na=True, the weighted average would be calculated as
(1 − 𝛼) · 3 + 1 · 5
.
(1 − 𝛼) + 1
The var(), std(), and cov() functions have a bias argument, specifying whether the result should con-
tain biased or unbiased statistics. For example, if bias=True, ewmvar(x) is calculated as ewmvar(x) =
ewma(x**2) - ewma(x)**2; whereas if bias=False (the default), the biased variance statistics are scaled
by debiasing factors
(︁∑︀ )︁2
𝑡
𝑖=0 𝑤 𝑖
(︁∑︀ )︁2 ∑︀ .
𝑡 𝑡 2
𝑖=0 𝑖𝑤 − 𝑤
𝑖=0 𝑖
(For 𝑤𝑖 = 1, this reduces to the usual 𝑁/(𝑁 − 1) factor, with 𝑁 = 𝑡 + 1.) See Weighted Sample Variance on
Wikipedia for further details.
pandas contains extensive capabilities and features for working with time series data for all domains. Using the
NumPy datetime64 and timedelta64 dtypes, pandas has consolidated a large number of features from other
Python libraries like scikits.timeseries as well as created a tremendous amount of new functionality for
manipulating time series data.
For example, pandas supports:
Parsing time series information from various sources and formats
In [3]: dti
Out[3]: DatetimeIndex(['2018-01-01', '2018-01-01', '2018-01-01'], dtype=
˓→'datetime64[ns]', freq=None)
In [5]: dti
Out[5]:
DatetimeIndex(['2018-01-01 00:00:00', '2018-01-01 01:00:00',
'2018-01-01 02:00:00'],
dtype='datetime64[ns]', freq='H')
In [7]: dti
Out[7]:
DatetimeIndex(['2018-01-01 00:00:00+00:00', '2018-01-01 01:00:00+00:00',
'2018-01-01 02:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='H')
In [8]: dti.tz_convert("US/Pacific")
Out[8]:
DatetimeIndex(['2017-12-31 16:00:00-08:00', '2017-12-31 17:00:00-08:00',
'2017-12-31 18:00:00-08:00'],
dtype='datetime64[ns, US/Pacific]', freq='H')
In [11]: ts
Out[11]:
2018-01-01 00:00:00 0
(continues on next page)
In [12]: ts.resample("2H").mean()
Out[12]:
2018-01-01 00:00:00 0.5
2018-01-01 02:00:00 2.5
2018-01-01 04:00:00 4.0
Freq: 2H, dtype: float64
Performing date and time arithmetic with absolute or relative time increments
In [14]: friday.day_name()
Out[14]: 'Friday'
# Add 1 day
In [15]: saturday = friday + pd.Timedelta("1 day")
In [16]: saturday.day_name()
Out[16]: 'Saturday'
In [18]: monday.day_name()
Out[18]: 'Monday'
pandas provides a relatively compact and self-contained set of tools for performing the above tasks and more.
2.19.1 Overview
Con- Scalar Array Class pandas Data Type Primary Creation Method
cept Class
Date Timestamp DatetimeIndex
datetime64[ns] or to_datetime or
times datetime64[ns, tz] date_range
Time Timedelta TimedeltaIndex
timedelta64[ns] to_timedelta or
deltas timedelta_range
Time Period PeriodIndex period[freq] Period or period_range
spans
Date DateOffsetNone None DateOffset
offsets
For time series data, it’s conventional to represent the time component in the index of a Series or DataFrame so
manipulations can be performed with respect to the time element.
However, Series and DataFrame can directly also support the time component as data itself.
Series and DataFrame have extended data type support and functionality for datetime, timedelta and
Period data when passed into those constructors. DateOffset data however will be stored as object data.
Lastly, pandas represents null date times, time deltas, and time spans as NaT which is useful for representing missing
or null date like values and behaves similar as np.nan does for float data.
In [24]: pd.Timestamp(pd.NaT)
Out[24]: NaT
In [25]: pd.Timedelta(pd.NaT)
Out[25]: NaT
In [26]: pd.Period(pd.NaT)
Out[26]: NaT
Timestamped data is the most basic type of time series data that associates values with points in time. For pandas
objects it means using the points in time.
In [28]: pd.Timestamp(datetime.datetime(2012, 5, 1))
Out[28]: Timestamp('2012-05-01 00:00:00')
In [29]: pd.Timestamp("2012-05-01")
Out[29]: Timestamp('2012-05-01 00:00:00')
In [30]: pd.Timestamp(2012, 5, 1)
Out[30]: Timestamp('2012-05-01 00:00:00')
However, in many cases it is more natural to associate things like change variables with a time span instead. The span
represented by Period can be specified explicitly, or inferred from datetime string format.
For example:
In [31]: pd.Period("2011-01")
Out[31]: Period('2011-01', 'M')
Timestamp and Period can serve as an index. Lists of Timestamp and Period are automatically coerced to
DatetimeIndex and PeriodIndex respectively.
In [33]: dates = [
....: pd.Timestamp("2012-05-01"),
....: pd.Timestamp("2012-05-02"),
....: pd.Timestamp("2012-05-03"),
....: ]
....:
In [35]: type(ts.index)
Out[35]: pandas.core.indexes.datetimes.DatetimeIndex
In [36]: ts.index
Out[36]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype=
˓→'datetime64[ns]', freq=None)
In [37]: ts
Out[37]:
2012-05-01 0.469112
2012-05-02 -0.282863
2012-05-03 -1.509059
dtype: float64
In [40]: type(ts.index)
Out[40]: pandas.core.indexes.period.PeriodIndex
In [41]: ts.index
Out[41]: PeriodIndex(['2012-01', '2012-02', '2012-03'], dtype='period[M]', freq='M')
In [42]: ts
Out[42]:
2012-01 -1.135632
2012-02 1.212112
2012-03 -0.173215
Freq: M, dtype: float64
pandas allows you to capture both representations and convert between them. Under the hood, pandas represents
timestamps using instances of Timestamp and sequences of timestamps using instances of DatetimeIndex. For
regular time spans, pandas uses Period objects for scalar values and PeriodIndex for sequences of spans. Better
support for irregular intervals with arbitrary start and end points are forth-coming in future releases.
To convert a Series or list-like object of date-like objects e.g. strings, epochs, or a mixture, you can use the
to_datetime function. When passed a Series, this returns a Series (with the same index), while a list-like is
converted to a DatetimeIndex:
If you use dates which start with the day first (i.e. European style), you can pass the dayfirst flag:
Warning: You see in the above example that dayfirst isn’t strict, so if a date can’t be parsed with the day
being first it will be parsed as if dayfirst were False.
If you pass a single string to to_datetime, it returns a single Timestamp. Timestamp can also accept string
input, but it doesn’t accept string parsing options like dayfirst or format, so use to_datetime if these are
required.
In [47]: pd.to_datetime("2010/11/12")
Out[47]: Timestamp('2010-11-12 00:00:00')
In [48]: pd.Timestamp("2010/11/12")
Out[48]: Timestamp('2010-11-12 00:00:00')
The string ‘infer’ can be passed in order to set the frequency of the index as the inferred frequency upon creation:
In addition to the required datetime string, a format argument can be passed to ensure specific parsing. This could
also potentially speed up the conversion considerably.
For more information on the choices available when specifying the format option, see the Python datetime docu-
mentation.
You can also pass a DataFrame of integer or string columns to assemble into a Series of Timestamps.
In [53]: df = pd.DataFrame(
....: {"year": [2015, 2016], "month": [2, 3], "day": [4, 5], "hour": [2, 3]}
....: )
....:
In [54]: pd.to_datetime(df)
Out[54]:
0 2015-02-04 02:00:00
1 2016-03-05 03:00:00
dtype: datetime64[ns]
You can pass only the columns that you need to assemble.
In [55]: pd.to_datetime(df[["year", "month", "day"]])
Out[55]:
0 2015-02-04
1 2016-03-05
dtype: datetime64[ns]
pd.to_datetime looks for standard designations of the datetime component in the column names, including:
• required: year, month, day
• optional: hour, minute, second, millisecond, microsecond, nanosecond
Invalid data
Epoch timestamps
pandas supports converting integer or float epoch times to Timestamp and DatetimeIndex. The default unit is
nanoseconds, since that is how Timestamp objects are stored internally. However, epochs are often stored in another
unit which can be specified. These are computed from the starting point specified by the origin parameter.
In [58]: pd.to_datetime(
....: [1349720105, 1349806505, 1349892905, 1349979305, 1350065705], unit="s"
....: )
....:
Out[58]:
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
'2012-10-10 18:15:05', '2012-10-11 18:15:05',
'2012-10-12 18:15:05'],
dtype='datetime64[ns]', freq=None)
In [59]: pd.to_datetime(
....: [1349720105100, 1349720105200, 1349720105300, 1349720105400,
˓→1349720105500],
....: unit="ms",
....: )
....:
Out[59]:
DatetimeIndex(['2012-10-08 18:15:05.100000', '2012-10-08 18:15:05.200000',
'2012-10-08 18:15:05.300000', '2012-10-08 18:15:05.400000',
(continues on next page)
Note: The unit parameter does not use the same strings as the format parameter that was discussed above). The
available units are listed on the documentation for pandas.to_datetime().
In [60]: pd.Timestamp(1262347200000000000).tz_localize("US/Pacific")
Out[60]: Timestamp('2010-01-01 12:00:00-0800', tz='US/Pacific')
In [61]: pd.DatetimeIndex([1262347200000000000]).tz_localize("US/Pacific")
Out[61]: DatetimeIndex(['2010-01-01 12:00:00-08:00'], dtype='datetime64[ns, US/
˓→Pacific]', freq=None)
Warning: Conversion of float epoch times can lead to inaccurate and unexpected results. Python floats have
about 15 digits precision in decimal. Rounding during conversion from float to high precision Timestamp is
unavoidable. The only way to achieve exact precision is to use a fixed-width types (e.g. an int64).
In [62]: pd.to_datetime([1490195805.433, 1490195805.433502912], unit="s")
Out[62]: DatetimeIndex(['2017-03-22 15:16:45.433000088', '2017-03-22 15:16:45.
˓→433502913'], dtype='datetime64[ns]', freq=None)
See also:
Using the origin Parameter
To invert the operation from above, namely, to convert from a Timestamp to a ‘unix’ epoch:
In [65]: stamps
Out[65]:
DatetimeIndex(['2012-10-08 18:15:05', '2012-10-09 18:15:05',
'2012-10-10 18:15:05', '2012-10-11 18:15:05'],
dtype='datetime64[ns]', freq='D')
We subtract the epoch (midnight at January 1, 1970 UTC) and then floor divide by the “unit” (1 second).
Using the origin parameter, one can specify an alternative starting point for creation of a DatetimeIndex. For
example, to use 1960-01-01 as the starting date:
The default is set at origin='unix', which defaults to 1970-01-01 00:00:00. Commonly called ‘unix
epoch’ or POSIX time.
To generate an index with timestamps, you can use either the DatetimeIndex or Index constructor and pass in a
list of datetime objects:
In [69]: dates = [
....: datetime.datetime(2012, 5, 1),
....: datetime.datetime(2012, 5, 2),
....: datetime.datetime(2012, 5, 3),
....: ]
....:
In [71]: index
Out[71]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype=
˓→'datetime64[ns]', freq=None)
In [73]: index
Out[73]: DatetimeIndex(['2012-05-01', '2012-05-02', '2012-05-03'], dtype=
˓→'datetime64[ns]', freq=None)
In practice this becomes very cumbersome because we often need a very long index with a large number of timestamps.
If we need timestamps on a regular frequency, we can use the date_range() and bdate_range() functions
to create a DatetimeIndex. The default frequency for date_range is a calendar day while the default for
bdate_range is a business day:
In [77]: index
Out[77]:
DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03', '2011-01-04',
'2011-01-05', '2011-01-06', '2011-01-07', '2011-01-08',
'2011-01-09', '2011-01-10',
...
'2011-12-23', '2011-12-24', '2011-12-25', '2011-12-26',
'2011-12-27', '2011-12-28', '2011-12-29', '2011-12-30',
'2011-12-31', '2012-01-01'],
dtype='datetime64[ns]', length=366, freq='D')
In [79]: index
Out[79]:
DatetimeIndex(['2011-01-03', '2011-01-04', '2011-01-05', '2011-01-06',
'2011-01-07', '2011-01-10', '2011-01-11', '2011-01-12',
'2011-01-13', '2011-01-14',
...
'2011-12-19', '2011-12-20', '2011-12-21', '2011-12-22',
'2011-12-23', '2011-12-26', '2011-12-27', '2011-12-28',
'2011-12-29', '2011-12-30'],
dtype='datetime64[ns]', length=260, freq='B')
Convenience functions like date_range and bdate_range can utilize a variety of frequency aliases:
date_range and bdate_range make it easy to generate a range of dates using various combinations of parame-
ters like start, end, periods, and freq. The start and end dates are strictly inclusive, so dates outside of those
specified will not be generated:
Specifying start, end, and periods will generate a range of evenly spaced dates from start to end inclusively,
with periods number of elements in the resulting DatetimeIndex:
bdate_range can also generate a range of custom frequency dates by using the weekmask and holidays pa-
rameters. These parameters will only be used if a custom frequency string is passed.
See also:
Custom business days
Since pandas represents timestamps in nanosecond resolution, the time span that can be represented using a 64-bit
integer is limited to approximately 584 years:
In [92]: pd.Timestamp.min
Out[92]: Timestamp('1677-09-21 00:12:43.145225')
In [93]: pd.Timestamp.max
Out[93]: Timestamp('2262-04-11 23:47:16.854775807')
See also:
Representing out-of-bounds spans
2.19.6 Indexing
One of the main uses for DatetimeIndex is as an index for pandas objects. The DatetimeIndex class contains
many time series related optimizations:
• A large range of dates for various offsets are pre-computed and cached under the hood in order to make gener-
ating subsequent date ranges very fast (just have to grab a slice).
• Fast shifting using the shift method on pandas objects.
• Unioning of overlapping DatetimeIndex objects with the same frequency is very fast (important for fast
data alignment).
• Quick access to date fields via properties such as year, month, etc.
• Regularization functions like snap and very fast asof logic.
DatetimeIndex objects have all the basic functionality of regular Index objects, and a smorgasbord of advanced
time series specific methods for easy frequency processing.
See also:
Reindexing methods
Note: While pandas does not force you to have a sorted date index, some of these methods may have unexpected or
incorrect behavior if the dates are unsorted.
DatetimeIndex can be used like a regular index and offers all of its intelligent functionality like selection, slicing,
etc.
In [96]: ts.index
Out[96]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31', '2011-06-30', '2011-07-29', '2011-08-31',
'2011-09-30', '2011-10-31', '2011-11-30', '2011-12-30'],
dtype='datetime64[ns]', freq='BM')
In [97]: ts[:5].index
Out[97]:
DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31', '2011-04-29',
'2011-05-31'],
dtype='datetime64[ns]', freq='BM')
In [98]: ts[::2].index
Out[98]:
DatetimeIndex(['2011-01-31', '2011-03-31', '2011-05-31', '2011-07-29',
'2011-09-30', '2011-11-30'],
dtype='datetime64[ns]', freq='2BM')
Dates and strings that parse to timestamps can be passed as indexing parameters:
In [99]: ts["1/31/2011"]
Out[99]: 0.11920871129693428
In [101]: ts["10/31/2011":"12/31/2011"]
Out[101]:
2011-10-31 0.271860
2011-11-30 -0.424972
2011-12-30 0.567020
Freq: BM, dtype: float64
To provide convenience for accessing longer time series, you can also pass in the year or year and month as strings:
In [102]: ts["2011"]
Out[102]:
2011-01-31 0.119209
2011-02-28 -1.044236
2011-03-31 -0.861849
2011-04-29 -2.104569
2011-05-31 -0.494929
2011-06-30 1.071804
2011-07-29 0.721555
2011-08-31 -0.706771
2011-09-30 -1.039575
2011-10-31 0.271860
2011-11-30 -0.424972
2011-12-30 0.567020
Freq: BM, dtype: float64
In [103]: ts["2011-6"]
Out[103]:
2011-06-30 1.071804
Freq: BM, dtype: float64
This type of slicing will work on a DataFrame with a DatetimeIndex as well. Since the partial string selection
is a form of label slicing, the endpoints will be included. This would include matching times on an included date:
Warning: Indexing DataFrame rows with a single string with getitem (e.g. frame[dtstring]) is depre-
cated starting with pandas 1.2.0 (given the ambiguity whether it is indexing the rows or selecting a column) and will
be removed in a future version. The equivalent with .loc (e.g. frame.loc[dtstring]) is still supported.
In [105]: dft
Out[105]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043
In [106]: dft.loc["2013"]
Out[106]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-03-11 10:35:00 -0.747967
2013-03-11 10:36:00 -0.034523
2013-03-11 10:37:00 -0.201754
2013-03-11 10:38:00 -1.509067
2013-03-11 10:39:00 -1.693043
This starts on the very first time in the month, and includes the last date and time for the month:
In [107]: dft["2013-1":"2013-2"]
Out[107]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-28 23:55:00 0.850929
2013-02-28 23:56:00 0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517
This specifies a stop time that includes all of the times on the last day:
In [108]: dft["2013-1":"2013-2-28"]
Out[108]:
A
2013-01-01 00:00:00 0.276232
2013-01-01 00:01:00 -1.087401
2013-01-01 00:02:00 -0.673690
2013-01-01 00:03:00 0.113648
2013-01-01 00:04:00 -1.478427
... ...
2013-02-28 23:55:00 0.850929
2013-02-28 23:56:00 0.976712
2013-02-28 23:57:00 -2.693884
2013-02-28 23:58:00 -1.575535
2013-02-28 23:59:00 -1.573517
This specifies an exact stop time (and is not the same as the above):
In [112]: dft2
Out[112]:
A
2013-01-01 00:00:00 a -0.298694
b 0.823553
2013-01-01 12:00:00 a 0.943285
b -1.479399
2013-01-02 00:00:00 a -1.643342
... ...
2013-01-04 12:00:00 b 0.069036
2013-01-05 00:00:00 a 0.122297
b 1.422060
2013-01-05 12:00:00 a 0.370079
b 1.016331
In [113]: dft2.loc["2013-01-05"]
Out[113]:
A
2013-01-05 00:00:00 a 0.122297
b 1.422060
2013-01-05 12:00:00 a 0.370079
b 1.016331
In [118]: df
Out[118]:
0
2019-01-01 00:00:00-08:00 0
The same string used as an indexing parameter can be treated either as a slice or as an exact match depending on the
resolution of the index. If the string is less accurate than the index, it will be treated as a slice, otherwise as an exact
match.
Consider a Series object with a minute resolution index:
.....: ),
.....: )
.....:
In [121]: series_minute.index.resolution
Out[121]: 'minute'
A timestamp string with minute resolution (or more accurate), gives a scalar instead, i.e. it is not casted to a slice.
.....: ),
.....: )
.....:
In [126]: series_second.index.resolution
Out[126]: 'second'
If the timestamp string is treated as a slice, it can be used to index DataFrame with .loc[] as well.
Warning: However, if the string is treated as an exact match, the selection in DataFrame’s [] will be column-
wise and not row-wise, see Indexing Basics. For example dft_minute['2011-12-31 23:59'] will raise
KeyError as '2012-12-31 23:59' has the same resolution as the index and there is no column with such
name:
To always have unambiguous selection, whether the row is treated as a slice or a single selection, use .loc.
In [130]: dft_minute.loc["2011-12-31 23:59"]
Out[130]:
a 1
b 4
Name: 2011-12-31 23:59:00, dtype: int64
Note also that DatetimeIndex resolution cannot be less precise than day.
In [132]: series_monthly.index.resolution
Out[132]: 'day'
Exact indexing
As discussed in previous section, indexing a DatetimeIndex with a partial string depends on the “accuracy” of the
period, in other words how specific the interval is in relation to the resolution of the index. In contrast, indexing with
Timestamp or datetime objects is exact, because the objects have exact meaning. These also follow the semantics
of including both endpoints.
These Timestamp and datetime objects have exact hours, minutes, and seconds, even though they were
not explicitly specified (they are 0).
With no defaults.
In [135]: dft[
.....: datetime.datetime(2013, 1, 1, 10, 12, 0): datetime.datetime(
.....: 2013, 2, 28, 10, 12, 0
.....: )
.....: ]
.....:
Out[135]:
A
2013-01-01 10:12:00 0.565375
2013-01-01 10:13:00 0.068184
2013-01-01 10:14:00 0.788871
2013-01-01 10:15:00 -0.280343
2013-01-01 10:16:00 0.931536
... ...
2013-02-28 10:08:00 0.148098
2013-02-28 10:09:00 -0.388138
2013-02-28 10:10:00 0.139348
2013-02-28 10:11:00 0.085288
2013-02-28 10:12:00 0.950146
A truncate() convenience function is provided that is similar to slicing. Note that truncate assumes a 0 value
for any unspecified date component in a DatetimeIndex in contrast to slicing which returns any partially matching
dates:
In [139]: ts2["2011-11":"2011-12"]
Out[139]:
2011-11-06 0.437823
2011-11-13 -0.293083
2011-11-20 -0.059881
2011-11-27 1.252450
2011-12-04 0.046611
2011-12-11 0.059478
2011-12-18 -0.286539
2011-12-25 0.841669
Freq: W-SUN, dtype: float64
Even complicated fancy indexing that breaks the DatetimeIndex frequency regularity will result in a
DatetimeIndex, although frequency is lost:
There are several time/date properties that one can access from Timestamp or a collection of timestamps like a
DatetimeIndex.
Property Description
year The year of the datetime
month The month of the datetime
day The days of the datetime
hour The hour of the datetime
minute The minutes of the datetime
second The seconds of the datetime
microsecond The microseconds of the datetime
nanosecond The nanoseconds of the datetime
date Returns datetime.date (does not contain timezone information)
time Returns datetime.time (does not contain timezone information)
timetz Returns datetime.time as local time with timezone information
dayofyear The ordinal day of year
day_of_year The ordinal day of year
weekofyear The week ordinal of the year
week The week ordinal of the year
dayofweek The number of the day of the week with Monday=0, Sunday=6
day_of_week The number of the day of the week with Monday=0, Sunday=6
weekday The number of the day of the week with Monday=0, Sunday=6
quarter Quarter of the date: Jan-Mar = 1, Apr-Jun = 2, etc.
days_in_month The number of days in the month of the datetime
is_month_start Logical indicating if first day of month (defined by frequency)
is_month_end Logical indicating if last day of month (defined by frequency)
is_quarter_start Logical indicating if first day of quarter (defined by frequency)
is_quarter_end Logical indicating if last day of quarter (defined by frequency)
is_year_start Logical indicating if first day of year (defined by frequency)
is_year_end Logical indicating if last day of year (defined by frequency)
is_leap_year Logical indicating if the date belongs to a leap year
Furthermore, if you have a Series with datetimelike values, then you can access these properties via the .dt
accessor, as detailed in the section on .dt accessors.
New in version 1.1.0.
You may obtain the year, week and day components of the ISO year from the ISO 8601 standard:
In [142]: idx.isocalendar()
Out[142]:
year week day
2019-12-29 2019 52 7
2019-12-30 2020 1 1
2019-12-31 2020 1 2
2020-01-01 2020 1 3
In [143]: idx.to_series().dt.isocalendar()
Out[143]:
year week day
2019-12-29 2019 52 7
2019-12-30 2020 1 1
2019-12-31 2020 1 2
2020-01-01 2020 1 3
In the preceding examples, frequency strings (e.g. 'D') were used to specify a frequency that defined:
• how the date times in DatetimeIndex were spaced when using date_range()
• the frequency of a Period or PeriodIndex
These frequency strings map to a DateOffset object and its subclasses. A DateOffset is similar to a
Timedelta that represents a duration of time but follows specific calendar duration rules. For example, a
Timedelta day will always increment datetimes by 24 hours, while a DateOffset day will increment
datetimes to the same time the next day whether a day represents 23, 24 or 25 hours due to daylight savings
time. However, all DateOffset subclasses that are an hour or smaller (Hour, Minute, Second, Milli, Micro,
Nano) behave like Timedelta and respect absolute time.
The basic DateOffset acts similar to dateutil.relativedelta (relativedelta documentation) that shifts a
date time by the corresponding calendar duration specified. The arithmetic operator (+) or the apply method can be
used to perform the shift.
In [148]: friday.day_name()
Out[148]: 'Friday'
In [150]: two_business_days.apply(friday)
Out[150]: Timestamp('2018-01-09 00:00:00')
Most DateOffsets have associated frequencies strings, or offset aliases, that can be passed into freq keyword
arguments. The available date offsets and associated frequency strings can be found below:
DateOffsets additionally have rollforward() and rollback() methods for moving a date forward or back-
ward respectively to a valid offset date relative to the offset. For example, business offsets will roll dates that land on
the weekends (Saturday and Sunday) forward to Monday since business offsets operate on the weekdays.
In [153]: ts = pd.Timestamp("2018-01-06 00:00:00")
In [154]: ts.day_name()
Out[154]: 'Saturday'
# Date is brought to the closest offset date first and then the hour is added
In [157]: ts + offset
Out[157]: Timestamp('2018-01-08 10:00:00')
These operations preserve time (hour, minute, etc) information by default. To reset time to midnight, use
normalize() before or after applying the operation (depending on whether you want the time information included
in the operation).
In [158]: ts = pd.Timestamp("2014-01-01 09:00")
In [160]: day.apply(ts)
Out[160]: Timestamp('2014-01-02 09:00:00')
In [161]: day.apply(ts).normalize()
Out[161]: Timestamp('2014-01-02 00:00:00')
In [164]: hour.apply(ts)
Out[164]: Timestamp('2014-01-01 23:00:00')
In [165]: hour.apply(ts).normalize()
Out[165]: Timestamp('2014-01-01 00:00:00')
Parametric offsets
Some of the offsets can be “parameterized” when created to result in different behaviors. For example, the Week
offset for generating weekly data accepts a weekday parameter which results in the generated dates always lying on
a particular day of the week:
In [167]: d = datetime.datetime(2008, 8, 18, 9, 0)
In [168]: d
Out[168]: datetime.datetime(2008, 8, 18, 9, 0)
In [169]: d + pd.offsets.Week()
Out[169]: Timestamp('2008-08-25 09:00:00')
In [170]: d + pd.offsets.Week(weekday=4)
Out[170]: Timestamp('2008-08-22 09:00:00')
In [172]: d - pd.offsets.Week()
Out[172]: Timestamp('2008-08-11 09:00:00')
In [173]: d + pd.offsets.Week(normalize=True)
Out[173]: Timestamp('2008-08-25 00:00:00')
In [174]: d - pd.offsets.Week(normalize=True)
Out[174]: Timestamp('2008-08-11 00:00:00')
In [175]: d + pd.offsets.YearEnd()
Out[175]: Timestamp('2008-12-31 09:00:00')
In [176]: d + pd.offsets.YearEnd(month=6)
Out[176]: Timestamp('2009-06-30 09:00:00')
Offsets can be used with either a Series or DatetimeIndex to apply the offset to each element.
In [178]: s = pd.Series(rng)
In [179]: rng
Out[179]: DatetimeIndex(['2012-01-01', '2012-01-02', '2012-01-03'], dtype=
˓→'datetime64[ns]', freq='D')
In [181]: s + pd.DateOffset(months=2)
Out[181]:
0 2012-03-01
1 2012-03-02
2 2012-03-03
dtype: datetime64[ns]
In [182]: s - pd.DateOffset(months=2)
Out[182]:
0 2011-11-01
1 2011-11-02
2 2011-11-03
dtype: datetime64[ns]
If the offset class maps directly to a Timedelta (Day, Hour, Minute, Second, Micro, Milli, Nano) it can be
used exactly like a Timedelta - see the Timedelta section for more examples.
In [183]: s - pd.offsets.Day(2)
Out[183]:
0 2011-12-30
1 2011-12-31
2 2012-01-01
dtype: datetime64[ns]
In [185]: td
Out[185]:
0 3 days
1 3 days
2 3 days
dtype: timedelta64[ns]
In [186]: td + pd.offsets.Minute(15)
Out[186]:
0 3 days 00:15:00
1 3 days 00:15:00
2 3 days 00:15:00
dtype: timedelta64[ns]
Note that some offsets (such as BQuarterEnd) do not have a vectorized implementation. They can still be used but
may calculate significantly slower and will show a PerformanceWarning
The CDay or CustomBusinessDay class provides a parametric BusinessDay class which can be used to create
customized business day calendars which account for local holidays and local weekend conventions.
As an interesting example, let’s look at Egypt where a Friday-Saturday weekend is observed.
Out[194]:
2013-04-30 Tue
2013-05-02 Thu
2013-05-05 Sun
2013-05-06 Mon
2013-05-07 Tue
Freq: C, dtype: object
Holiday calendars can be used to provide the list of holidays. See the holiday calendar section for more information.
Monthly offsets that respect a certain holiday calendar can be defined in the usual way.
In [201]: dt + bmth_us
Out[201]: Timestamp('2014-01-02 00:00:00')
Note: The frequency string ‘C’ is used to indicate that a CustomBusinessDay DateOffset is used, it is important to
note that since CustomBusinessDay is a parameterised type, instances of CustomBusinessDay may differ and this is
not detectable from the ‘C’ frequency string. The user therefore needs to ensure that the ‘C’ frequency string is used
Business hour
The BusinessHour class provides a business hour representation on BusinessDay, allowing to use specific start
and end times.
By default, BusinessHour uses 9:00 - 17:00 as business hours. Adding BusinessHour will increment
Timestamp by hourly frequency. If target Timestamp is out of business hours, move to the next business hour
then increment it. If the result exceeds the business hours end, the remaining hours are added to the next business day.
In [203]: bh = pd.offsets.BusinessHour()
In [204]: bh
Out[204]: <BusinessHour: BH=09:00-17:00>
# 2014-08-01 is Friday
In [205]: pd.Timestamp("2014-08-01 10:00").weekday()
Out[205]: 4
# If the results is on the end time, move to the next business day
In [208]: pd.Timestamp("2014-08-01 16:00") + bh
Out[208]: Timestamp('2014-08-04 09:00:00')
You can also specify start and end time by keywords. The argument must be a str with an hour:minute
representation or a datetime.time instance. Specifying seconds, microseconds and nanoseconds as business hour
results in ValueError.
In [212]: bh = pd.offsets.BusinessHour(start="11:00", end=datetime.time(20, 0))
In [213]: bh
Out[213]: <BusinessHour: BH=11:00-20:00>
Passing start time later than end represents midnight business hour. In this case, business hour exceeds midnight
and overlap to the next day. Valid business hours are distinguished by whether it started from valid BusinessDay.
In [218]: bh
Out[218]: <BusinessHour: BH=17:00-09:00>
Applying BusinessHour.rollforward and rollback to out of business hours results in the next business
hour start or previous day’s end. Different from other offsets, BusinessHour.rollforward may output different
results from apply by definition.
This is because one day’s business hour end is equal to next day’s business hour start. For example, under the default
business hours (9:00 - 17:00), there is no gap (0 minutes) between 2014-08-01 17:00 and 2014-08-04 09:
00.
BusinessHour regards Saturday and Sunday as holidays. To use arbitrary holidays, you can use
CustomBusinessHour offset, as explained in the following subsection.
In [231]: dt + bhour_us
Out[231]: Timestamp('2014-01-17 16:00:00')
You can use keyword arguments supported by either BusinessHour and CustomBusinessDay.
# Monday is skipped because it's a holiday, business hour starts from 10:00
In [234]: dt + bhour_mon * 2
Out[234]: Timestamp('2014-01-21 10:00:00')
Offset aliases
A number of string aliases are given to useful common time series frequencies. We will refer to these aliases as offset
aliases.
Alias Description
B business day frequency
C custom business day frequency
D calendar day frequency
W weekly frequency
M month end frequency
SM semi-month end frequency (15th and end of month)
BM business month end frequency
CBM custom business month end frequency
MS month start frequency
SMS semi-month start frequency (1st and 15th)
BMS business month start frequency
CBMS custom business month start frequency
Q quarter end frequency
BQ business quarter end frequency
QS quarter start frequency
BQS business quarter start frequency
A, Y year end frequency
BA, BY business year end frequency
AS, YS year start frequency
BAS, BYS business year start frequency
BH business hour frequency
H hourly frequency
T, min minutely frequency
S secondly frequency
L, ms milliseconds
U, us microseconds
N nanoseconds
Combining aliases
As we have seen previously, the alias and the offset instance are fungible in most functions:
Anchored offsets
Alias Description
W-SUN weekly frequency (Sundays). Same as ‘W’
W-MON weekly frequency (Mondays)
W-TUE weekly frequency (Tuesdays)
W-WED weekly frequency (Wednesdays)
W-THU weekly frequency (Thursdays)
W-FRI weekly frequency (Fridays)
W-SAT weekly frequency (Saturdays)
(B)Q(S)- quarterly frequency, year ends in December. Same as ‘Q’
DEC
(B)Q(S)- quarterly frequency, year ends in January
JAN
(B)Q(S)- quarterly frequency, year ends in February
FEB
(B)Q(S)- quarterly frequency, year ends in March
MAR
(B)Q(S)- quarterly frequency, year ends in April
APR
(B)Q(S)- quarterly frequency, year ends in May
MAY
(B)Q(S)- quarterly frequency, year ends in June
JUN
(B)Q(S)- quarterly frequency, year ends in July
JUL
(B)Q(S)- quarterly frequency, year ends in August
AUG
(B)Q(S)- quarterly frequency, year ends in September
SEP
(B)Q(S)- quarterly frequency, year ends in October
OCT
(B)Q(S)- quarterly frequency, year ends in November
NOV
(B)A(S)- annual frequency, anchored end of December. Same as ‘A’
DEC
continues on next page
These can be used as arguments to date_range, bdate_range, constructors for DatetimeIndex, as well as
various other timeseries-related functions in pandas.
For those offsets that are anchored to the start or end of specific frequency (MonthEnd, MonthBegin, WeekEnd,
etc), the following rules apply to rolling forward and backwards.
When n is not 0, if the given date is not on an anchor point, it snapped to the next(previous) anchor point, and moved
|n|-1 additional steps forwards or backwards.
If the given date is on an anchor point, it is moved |n| points forwards or backwards.
For the case when n=0, the date is not moved if on an anchor point, otherwise it is rolled forward to the next anchor
point.
Holidays and calendars provide a simple way to define holiday rules to be used with CustomBusinessDay or
in other analysis that requires a predefined set of holidays. The AbstractHolidayCalendar class provides all
the necessary methods to return a list of holidays and only rules need to be defined in a specific holiday calendar
class. Furthermore, the start_date and end_date class attributes determine over what date range holidays are
generated. These should be overwritten on the AbstractHolidayCalendar class to have the range apply to all
calendar subclasses. USFederalHolidayCalendar is the only calendar that exists and primarily serves as an
example for developing other calendars.
For holidays that occur on fixed dates (e.g., US Memorial Day or July 4th) an observance rule determines when that
holiday is observed if it falls on a weekend or some other non-observed day. Defined observance rules are:
Rule Description
nearest_workday move Saturday to Friday and Sunday to Monday
sunday_to_monday move Sunday to following Monday
next_monday_or_tuesday
move Saturday to Monday and Sunday/Monday to Tuesday
previous_friday move Saturday and Sunday to previous Friday”
next_monday move Saturday and Sunday to following Monday
In [259]: pd.date_range(
.....: start="7/1/2012", end="7/10/2012", freq=pd.offsets.CDay(calendar=cal)
.....: ).to_pydatetime()
.....:
Out[259]:
array([datetime.datetime(2012, 7, 2, 0, 0),
datetime.datetime(2012, 7, 3, 0, 0),
datetime.datetime(2012, 7, 5, 0, 0),
datetime.datetime(2012, 7, 6, 0, 0),
datetime.datetime(2012, 7, 9, 0, 0),
datetime.datetime(2012, 7, 10, 0, 0)], dtype=object)
Ranges are defined by the start_date and end_date class attributes of AbstractHolidayCalendar. The
defaults are shown below.
In [265]: AbstractHolidayCalendar.start_date
Out[265]: Timestamp('1970-01-01 00:00:00')
In [266]: AbstractHolidayCalendar.end_date
Out[266]: Timestamp('2200-12-31 00:00:00')
In [269]: cal.holidays()
Out[269]: DatetimeIndex(['2012-05-28', '2012-07-04', '2012-10-08'], dtype=
˓→'datetime64[ns]', freq=None)
Every calendar class is accessible by name using the get_calendar function which returns a holiday class instance.
Any imported calendar class will automatically be available by this function. Also, HolidayCalendarFactory
provides an easy interface to create calendars that are combinations of calendars or calendars with additional rules.
In [272]: cal.rules
Out[272]:
[Holiday: Memorial Day (month=5, day=31, offset=<DateOffset: weekday=MO(-1)>),
Holiday: July 4th (month=7, day=4, observance=<function nearest_workday at
˓→0x7fa919c66f70>),
In [274]: new_cal.rules
Out[274]:
[Holiday: Labor Day (month=9, day=1, offset=<DateOffset: weekday=MO(+1)>),
Holiday: Memorial Day (month=5, day=31, offset=<DateOffset: weekday=MO(-1)>),
Holiday: July 4th (month=7, day=4, observance=<function nearest_workday at
˓→0x7fa919c66f70>),
Shifting / lagging
One may want to shift or lag the values in a time series back and forward in time. The method for this is shift(),
which is available on all of the pandas objects.
In [276]: ts = ts[:5]
In [277]: ts.shift(1)
Out[277]:
2012-01-01 NaN
2012-01-02 0.0
2012-01-03 1.0
Freq: D, dtype: float64
The shift method accepts an freq argument which can accept a DateOffset class or other timedelta-like
object or also an offset alias.
When freq is specified, shift method changes all the dates in the index rather than changing the alignment of the
data and the index:
Note that with when freq is specified, the leading entry is no longer NaN because the data is not being realigned.
Frequency conversion
The primary function for changing frequencies is the asfreq() method. For a DatetimeIndex, this is basically
just a thin, but convenient wrapper around reindex() which generates a date_range and calls reindex.
In [283]: ts
Out[283]:
2010-01-01 1.494522
2010-01-06 -0.778425
2010-01-11 -0.253355
Freq: 3B, dtype: float64
In [284]: ts.asfreq(pd.offsets.BDay())
Out[284]:
2010-01-01 1.494522
2010-01-04 NaN
2010-01-05 NaN
2010-01-06 -0.778425
2010-01-07 NaN
2010-01-08 NaN
2010-01-11 -0.253355
Freq: B, dtype: float64
asfreq provides a further convenience so you can specify an interpolation method for any gaps that may appear after
the frequency conversion.
Related to asfreq and reindex is fillna(), which is documented in the missing data section.
DatetimeIndex can be converted to an array of Python native datetime.datetime objects using the
to_pydatetime method.
2.19.10 Resampling
pandas has a simple, powerful, and efficient functionality for performing resampling operations during frequency
conversion (e.g., converting secondly data into 5-minutely data). This is extremely common in, but not limited to,
financial applications.
resample() is a time-based groupby, followed by a reduction method on each of its groups. See some cookbook
examples for some advanced strategies.
The resample() method can be used directly from DataFrameGroupBy objects, see the groupby docs.
Basics
In [288]: ts.resample("5Min").sum()
Out[288]:
2012-01-01 25103
Freq: 5T, dtype: int64
The resample function is very flexible and allows you to specify many different parameters to control the frequency
conversion and resampling operation.
Any function available via dispatching is available as a method of the returned object, including sum, mean, std,
sem, max, min, median, first, last, ohlc:
In [289]: ts.resample("5Min").mean()
Out[289]:
2012-01-01 251.03
Freq: 5T, dtype: float64
In [290]: ts.resample("5Min").ohlc()
Out[290]:
open high low close
2012-01-01 308 460 9 205
In [291]: ts.resample("5Min").max()
Out[291]:
2012-01-01 460
Freq: 5T, dtype: int64
For downsampling, closed can be set to ‘left’ or ‘right’ to specify which end of the interval is closed:
In [292]: ts.resample("5Min", closed="right").mean()
Out[292]:
2011-12-31 23:55:00 308.000000
2012-01-01 00:00:00 250.454545
Freq: 5T, dtype: float64
Parameters like label are used to manipulate the resulting labels. label specifies whether the result is labeled with
the beginning or the end of the interval.
Warning: The default values for label and closed is ‘left’ for all frequency offsets except for ‘M’, ‘A’, ‘Q’,
‘BM’, ‘BA’, ‘BQ’, and ‘W’ which all have a default of ‘right’.
This might unintendedly lead to looking ahead, where the value for a later time is pulled back to a previous time
as in the following example with the BusinessDay frequency:
In [296]: s = pd.date_range("2000-01-01", "2000-01-05").to_series()
In [298]: s.dt.day_name()
Out[298]:
2000-01-01 Saturday
2000-01-02 Sunday
2000-01-03 NaN
2000-01-04 Tuesday
2000-01-05 Wednesday
Freq: D, dtype: object
Notice how the value for Sunday got pulled back to the previous Friday. To get the behavior where the value for
Sunday is pushed to Monday, use instead
In [300]: s.resample("B", label="right", closed="right").last().dt.day_name()
Out[300]:
2000-01-03 Sunday
2000-01-04 Tuesday
2000-01-05 Wednesday
Freq: B, dtype: object
The axis parameter can be set to 0 or 1 and allows you to resample the specified axis for a DataFrame.
kind can be set to ‘timestamp’ or ‘period’ to convert the resulting index to/from timestamp and time span represen-
tations. By default resample retains the input representation.
convention can be set to ‘start’ or ‘end’ when resampling period data (detail below). It specifies how low frequency
periods are converted to higher frequency periods.
Upsampling
For upsampling, you can specify a way to upsample and the limit parameter to interpolate over the gaps that are
created:
In [302]: ts[:2].resample("250L").ffill()
Out[302]:
2012-01-01 00:00:00.000 308
2012-01-01 00:00:00.250 308
2012-01-01 00:00:00.500 308
2012-01-01 00:00:00.750 308
2012-01-01 00:00:01.000 204
Freq: 250L, dtype: int64
In [303]: ts[:2].resample("250L").ffill(limit=2)
Out[303]:
2012-01-01 00:00:00.000 308.0
2012-01-01 00:00:00.250 308.0
2012-01-01 00:00:00.500 308.0
2012-01-01 00:00:00.750 NaN
2012-01-01 00:00:01.000 204.0
Freq: 250L, dtype: float64
Sparse resampling
Sparse timeseries are the ones where you have a lot fewer points relative to the amount of time you are looking to
resample. Naively upsampling a sparse series can potentially generate lots of intermediate values. When you don’t
want to use a method to fill these values, e.g. fill_method is None, then intermediate values will be filled with
NaN.
Since resample is a time-based groupby, the following is a method to efficiently resample only the groups that are
not all NaN.
In [306]: ts.resample("3T").sum()
Out[306]:
2014-01-01 00:00:00 0
2014-01-01 00:03:00 0
2014-01-01 00:06:00 0
2014-01-01 00:09:00 0
2014-01-01 00:12:00 0
(continues on next page)
We can instead only resample those groups where we have points as follows:
Aggregation
Similar to the aggregating API, groupby API, and the window API, a Resampler can be selectively resampled.
Resampling a DataFrame, the default will be to act on all columns with the same function.
In [311]: df = pd.DataFrame(
.....: np.random.randn(1000, 3),
.....: index=pd.date_range("1/1/2012", freq="S", periods=1000),
.....: columns=["A", "B", "C"],
.....: )
.....:
In [312]: r = df.resample("3T")
In [313]: r.mean()
Out[313]:
A B C
2012-01-01 00:00:00 -0.033823 -0.121514 -0.081447
2012-01-01 00:03:00 0.056909 0.146731 -0.024320
2012-01-01 00:06:00 -0.058837 0.047046 -0.052021
(continues on next page)
In [314]: r["A"].mean()
Out[314]:
2012-01-01 00:00:00 -0.033823
2012-01-01 00:03:00 0.056909
2012-01-01 00:06:00 -0.058837
2012-01-01 00:09:00 0.063123
2012-01-01 00:12:00 0.186340
2012-01-01 00:15:00 -0.085954
Freq: 3T, Name: A, dtype: float64
You can pass a list or dict of functions to do aggregation with, outputting a DataFrame:
On a resampled DataFrame, you can pass a list of functions to apply to each column, which produces an aggregated
result with a hierarchical index:
By passing a dict to aggregate you can apply a different aggregation to the columns of a DataFrame:
The function names can also be strings. In order for a string to be valid it must be implemented on the resampled
object:
Furthermore, you can also specify multiple aggregation functions for each column separately.
If a DataFrame does not have a datetimelike index, but instead you want to resample based on datetimelike column
in the frame, it can passed to the on keyword.
In [321]: df = pd.DataFrame(
.....: {"date": pd.date_range("2015-01-01", freq="W", periods=5), "a": np.
˓→arange(5)},
.....: index=pd.MultiIndex.from_arrays(
.....: [[1, 2, 3, 4, 5], pd.date_range("2015-01-01", freq="W", periods=5)],
.....: names=["v", "d"],
.....: ),
.....: )
.....:
In [322]: df
Out[322]:
date a
v d
1 2015-01-04 2015-01-04 0
2 2015-01-11 2015-01-11 1
3 2015-01-18 2015-01-18 2
4 2015-01-25 2015-01-25 3
5 2015-02-01 2015-02-01 4
(continues on next page)
Similarly, if you instead want to resample by a datetimelike level of MultiIndex, its name or location can be passed
to the level keyword.
In [324]: df.resample("M", level="d").sum()
Out[324]:
a
d
2015-01-31 6
2015-02-28 4
With the Resampler object in hand, iterating through the grouped data is very natural and functions similarly to
itertools.groupby():
In [325]: small = pd.Series(
.....: range(6),
.....: index=pd.to_datetime(
.....: [
.....: "2017-01-01T00:00:00",
.....: "2017-01-01T00:30:00",
.....: "2017-01-01T00:31:00",
.....: "2017-01-01T01:00:00",
.....: "2017-01-01T03:00:00",
.....: "2017-01-01T03:05:00",
.....: ]
.....: ),
.....: )
.....:
In [332]: ts
Out[332]:
2000-10-01 23:30:00 0
2000-10-01 23:37:00 3
2000-10-01 23:44:00 6
2000-10-01 23:51:00 9
2000-10-01 23:58:00 12
2000-10-02 00:05:00 15
2000-10-02 00:12:00 18
2000-10-02 00:19:00 21
2000-10-02 00:26:00 24
Freq: 7T, dtype: int64
Here we can see that, when using origin with its default value ('start_day'), the result after '2000-10-02
00:00:00' are not identical depending on the start of time series:
In [333]: ts.resample("17min", origin="start_day").sum()
Out[333]:
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
(continues on next page)
Here we can see that, when setting origin to 'epoch', the result after '2000-10-02 00:00:00' are identical
depending on the start of time series:
If needed you can just adjust the bins with an offset Timedelta that would be added to the default origin. Those
two examples are equivalent for this time series:
Note the use of 'start' for origin on the last example. In that case, origin will be set to the first value of the
timeseries.
Regular intervals of time are represented by Period objects in pandas while sequences of Period objects are
collected in a PeriodIndex, which can be created with the convenience function period_range.
Period
A Period represents a span of time (e.g., a day, a month, a quarter, etc). You can specify the span via freq keyword
using a frequency alias like below. Because freq represents a span of Period, it cannot be negative like “-3D”.
In [341]: pd.Period("2012", freq="A-DEC")
Out[341]: Period('2012', 'A-DEC')
Adding and subtracting integers from periods shifts the period by its own frequency. Arithmetic is not allowed between
Period with different freq (span).
In [345]: p = pd.Period("2012", freq="A-DEC")
In [346]: p + 1
Out[346]: Period('2013', 'A-DEC')
In [347]: p - 3
Out[347]: Period('2009', 'A-DEC')
In [349]: p + 2
Out[349]: Period('2012-05', '2M')
In [350]: p - 1
Out[350]: Period('2011-11', '2M')
/pandas/pandas/_libs/tslibs/period.pyx in pandas._libs.tslibs.period._Period.__
˓→richcmp__()
If Period freq is daily or higher (D, H, T, S, L, U, N), offsets and timedelta-like can be added if the result can
have the same freq. Otherwise, ValueError will be raised.
In [352]: p = pd.Period("2014-07-01 09:00", freq="H")
In [353]: p + pd.offsets.Hour(2)
Out[353]: Period('2014-07-01 11:00', 'H')
In [354]: p + datetime.timedelta(minutes=120)
Out[354]: Period('2014-07-01 11:00', 'H')
In [1]: p + pd.offsets.Minute(5)
Traceback
...
ValueError: Input has different freq from Period(freq=H)
If Period has other frequencies, only the same offsets can be added. Otherwise, ValueError will be raised.
In [356]: p = pd.Period("2014-07", freq="M")
In [357]: p + pd.offsets.MonthEnd(3)
Out[357]: Period('2014-10', 'M')
In [1]: p + pd.offsets.MonthBegin(3)
Traceback
...
ValueError: Input has different freq from Period(freq=M)
Taking the difference of Period instances with the same frequency will return the number of frequency units between
them:
In [358]: pd.Period("2012", freq="A-DEC") - pd.Period("2002", freq="A-DEC")
Out[358]: <10 * YearEnds: month=12>
Regular sequences of Period objects can be collected in a PeriodIndex, which can be constructed using the
period_range convenience function:
In [359]: prng = pd.period_range("1/1/2011", "1/1/2012", freq="M")
In [360]: prng
Out[360]:
PeriodIndex(['2011-01', '2011-02', '2011-03', '2011-04', '2011-05', '2011-06',
(continues on next page)
Passing multiplied frequency outputs a sequence of Period which has multiplied span.
If start or end are Period objects, they will be used as anchor endpoints for a PeriodIndex with frequency
matching that of the PeriodIndex constructor.
In [363]: pd.period_range(
.....: start=pd.Period("2017Q1", freq="Q"), end=pd.Period("2017Q2", freq="Q"),
˓→freq="M"
.....: )
.....:
Out[363]: PeriodIndex(['2017-03', '2017-04', '2017-05', '2017-06'], dtype='period[M]',
˓→ freq='M')
Just like DatetimeIndex, a PeriodIndex can also be used to index pandas objects:
In [365]: ps
Out[365]:
2011-01 -2.916901
2011-02 0.514474
2011-03 1.346470
2011-04 0.816397
2011-05 2.258648
2011-06 0.494789
2011-07 0.301239
2011-08 0.464776
2011-09 -1.393581
2011-10 0.056780
2011-11 0.197035
2011-12 2.261385
2012-01 -0.329583
Freq: M, dtype: float64
PeriodIndex supports addition and subtraction with the same rule as Period.
In [367]: idx
Out[367]:
PeriodIndex(['2014-07-01 09:00', '2014-07-01 10:00', '2014-07-01 11:00',
'2014-07-01 12:00', '2014-07-01 13:00'],
dtype='period[H]', freq='H')
(continues on next page)
In [370]: idx
Out[370]: PeriodIndex(['2014-07', '2014-08', '2014-09', '2014-10', '2014-11'], dtype=
˓→'period[M]', freq='M')
PeriodIndex has its own dtype named period, refer to Period Dtypes.
Period dtypes
PeriodIndex has a custom period dtype. This is a pandas extension dtype similar to the timezone aware dtype
(datetime64[ns, tz]).
The period dtype holds the freq attribute and is represented with period[freq] like period[D] or
period[M], using frequency strings.
In [373]: pi
Out[373]: PeriodIndex(['2016-01', '2016-02', '2016-03'], dtype='period[M]', freq='M')
In [374]: pi.dtype
Out[374]: period[M]
The period dtype can be used in .astype(...). It allows one to change the freq of a PeriodIndex like
.asfreq() and convert a DatetimeIndex to PeriodIndex like to_period():
# convert to DatetimeIndex
In [376]: pi.astype("datetime64[ns]")
Out[376]: DatetimeIndex(['2016-01-01', '2016-02-01', '2016-03-01'], dtype=
˓→'datetime64[ns]', freq='MS')
# convert to PeriodIndex
In [377]: dti = pd.date_range("2011-01-01", freq="M", periods=3)
In [378]: dti
Out[378]: DatetimeIndex(['2011-01-31', '2011-02-28', '2011-03-31'], dtype=
˓→'datetime64[ns]', freq='M')
In [382]: ps["10/31/2011":"12/31/2011"]
Out[382]:
2011-10 0.056780
2011-11 0.197035
2011-12 2.261385
Freq: M, dtype: float64
Passing a string representing a lower frequency than PeriodIndex returns partial sliced data.
In [383]: ps["2011"]
Out[383]:
2011-01 -2.916901
2011-02 0.514474
2011-03 1.346470
2011-04 0.816397
2011-05 2.258648
2011-06 0.494789
2011-07 0.301239
2011-08 0.464776
2011-09 -1.393581
2011-10 0.056780
2011-11 0.197035
2011-12 2.261385
Freq: M, dtype: float64
In [385]: dfp
Out[385]:
A
(continues on next page)
As with DatetimeIndex, the endpoints will be included in the result. The example below slices data starting from
10:00 to 11:59.
The frequency of Period and PeriodIndex can be converted via the asfreq method. Let’s start with the fiscal
year 2011, ending in December:
In [388]: p = pd.Period("2011", freq="A-DEC")
In [389]: p
Out[389]: Period('2011', 'A-DEC')
We can convert it to a monthly frequency. Using the how parameter, we can specify whether to return the starting or
ending month:
In [390]: p.asfreq("M", how="start")
Out[390]: Period('2011-01', 'M')
Converting to a “super-period” (e.g., annual frequency is a super-period of quarterly frequency) automatically returns
the super-period that includes the input period:
In [394]: p = pd.Period("2011-12", freq="M")
In [395]: p.asfreq("A-NOV")
Out[395]: Period('2012', 'A-NOV')
Note that since we converted to an annual frequency that ends the year in November, the monthly period of December
2011 is actually in the 2012 A-NOV period.
Period conversions with anchored frequencies are particularly useful for working with various quarterly data common
to economics, business, and other fields. Many organizations define quarters relative to the month in which their
fiscal year starts and ends. Thus, first quarter of 2011 could start in 2010 or a few months into 2011. Via anchored
frequencies, pandas works for all quarterly frequencies Q-JAN through Q-DEC.
Q-DEC define regular calendar quarters:
In [396]: p = pd.Period("2012Q1", freq="Q-DEC")
Timestamped data can be converted to PeriodIndex-ed data using to_period and vice-versa using
to_timestamp:
In [404]: ts
Out[404]:
2012-01-31 1.931253
2012-02-29 -0.184594
2012-03-31 0.249656
2012-04-30 -0.978151
2012-05-31 -0.873389
Freq: M, dtype: float64
In [405]: ps = ts.to_period()
In [406]: ps
Out[406]:
2012-01 1.931253
2012-02 -0.184594
2012-03 0.249656
2012-04 -0.978151
2012-05 -0.873389
Freq: M, dtype: float64
In [407]: ps.to_timestamp()
Out[407]:
2012-01-01 1.931253
2012-02-01 -0.184594
2012-03-01 0.249656
2012-04-01 -0.978151
2012-05-01 -0.873389
Freq: MS, dtype: float64
Remember that ‘s’ and ‘e’ can be used to return the timestamps at the start or end of the period:
Converting between period and timestamp enables some convenient arithmetic functions to be used. In the following
example, we convert a quarterly frequency with year ending in November to 9am of the end of the month following
the quarter end:
In [412]: ts.head()
Out[412]:
1990-03-01 09:00 -0.109291
1990-06-01 09:00 -0.637235
1990-09-01 09:00 -1.735925
1990-12-01 09:00 2.096946
1991-03-01 09:00 -1.039926
Freq: H, dtype: float64
If you have data that is outside of the Timestamp bounds, see Timestamp limitations, then you can use a
PeriodIndex and/or Series of Periods to do computations.
In [414]: span
Out[414]:
PeriodIndex(['1215-01-01', '1215-01-02', '1215-01-03', '1215-01-04',
'1215-01-05', '1215-01-06', '1215-01-07', '1215-01-08',
'1215-01-09', '1215-01-10',
...
'1380-12-23', '1380-12-24', '1380-12-25', '1380-12-26',
'1380-12-27', '1380-12-28', '1380-12-29', '1380-12-30',
'1380-12-31', '1381-01-01'],
dtype='period[D]', length=60632, freq='D')
In [416]: s
Out[416]:
0 20121231
1 20141130
2 99991231
dtype: int64
.....:
In [418]: s.apply(conv)
Out[418]:
0 2012-12-31
1 2014-11-30
(continues on next page)
In [419]: s.apply(conv)[2]
Out[419]: Period('9999-12-31', 'D')
In [421]: span
Out[421]: PeriodIndex(['2012-12-31', '2014-11-30', '9999-12-31'], dtype='period[D]',
˓→freq='D')
pandas provides rich support for working with timestamps in different time zones using the pytz and dateutil
libraries or datetime.timezone objects from the standard library.
To localize these dates to a time zone (assign a particular time zone to a naive date), you can use the tz_localize
method or the tz keyword argument in date_range(), Timestamp, or DatetimeIndex. You can either pass
pytz or dateutil time zone objects or Olson time zone database strings. Olson time zone strings will return pytz
time zone objects by default. To return dateutil time zone objects, append dateutil/ before the string.
• In pytz you can find a list of common (and less common) time zones using from pytz import
common_timezones, all_timezones.
• dateutil uses the OS time zones so there isn’t a fixed list available. For common zones, the names are the
same as pytz.
In [424]: import dateutil
# pytz
In [425]: rng_pytz = pd.date_range("3/6/2012 00:00", periods=3, freq="D", tz="Europe/
˓→London")
In [426]: rng_pytz.tz
Out[426]: <DstTzInfo 'Europe/London' LMT-1 day, 23:59:00 STD>
# dateutil
In [427]: rng_dateutil = pd.date_range("3/6/2012 00:00", periods=3, freq="D")
In [429]: rng_dateutil.tz
(continues on next page)
In [431]: rng_utc.tz
Out[431]: tzutc()
# datetime.timezone
In [432]: rng_utc = pd.date_range(
.....: "3/6/2012 00:00",
.....: periods=3,
.....: freq="D",
.....: tz=datetime.timezone.utc,
.....: )
.....:
In [433]: rng_utc.tz
Out[433]: datetime.timezone.utc
Note that the UTC time zone is a special case in dateutil and should be constructed explicitly as an instance of
dateutil.tz.tzutc. You can also construct other time zones objects explicitly first.
# pytz
In [435]: tz_pytz = pytz.timezone("Europe/London")
# dateutil
In [439]: tz_dateutil = dateutil.tz.gettz("Europe/London")
To convert a time zone aware pandas object from one time zone to another, you can use the tz_convert method.
In [442]: rng_pytz.tz_convert("US/Eastern")
Out[442]:
(continues on next page)
Note: When using pytz time zones, DatetimeIndex will construct a different time zone object than a
Timestamp for the same time zone input. A DatetimeIndex can hold a collection of Timestamp objects
that may have different UTC offsets and cannot be succinctly represented by one pytz time zone instance while one
Timestamp represents one point in time with a specific UTC offset.
In [443]: dti = pd.date_range("2019-01-01", periods=3, freq="D", tz="US/Pacific")
In [444]: dti.tz
Out[444]: <DstTzInfo 'US/Pacific' LMT-1 day, 16:07:00 STD>
In [446]: ts.tz
Out[446]: <DstTzInfo 'US/Pacific' PST-1 day, 16:00:00 STD>
Warning: Be wary of conversions between libraries. For some time zones, pytz and dateutil have different
definitions of the zone. This is more of a problem for unusual time zones than for ‘standard’ zones like US/
Eastern.
Warning: Be aware that a time zone definition across versions of time zone libraries may not be considered equal.
This may cause problems when working with stored data that is localized using one version and operated on with
a different version. See here for how to handle such a situation.
Warning: For pytz time zones, it is incorrect to pass a time zone object directly into the datetime.
datetime constructor (e.g., datetime.datetime(2011, 1, 1, tz=pytz.timezone('US/
Eastern')). Instead, the datetime needs to be localized using the localize method on the pytz time zone
object.
Warning: Be aware that for times in the future, correct conversion between time zones (and UTC) cannot be
guaranteed by any time zone library because a timezone’s offset from UTC may be changed by the respective
government.
Warning: If you are using dates beyond 2038-01-18, due to current deficiencies in the underlying libraries caused
by the year 2038 problem, daylight saving time (DST) adjustments to timezone aware dates will not be applied. If
and when the underlying libraries are fixed, the DST transitions will be applied.
For example, for two dates that are in British Summer Time (and so would normally be GMT+1), both the following
asserts evaluate as true:
In [447]: d_2037 = "2037-03-31T010101"
Under the hood, all timestamps are stored in UTC. Values from a time zone aware DatetimeIndex or Timestamp
will have their fields (day, hour, minute, etc.) localized to the time zone. However, timestamps with the same UTC
value are still considered to be equal even if they are in different time zones:
In [452]: rng_eastern = rng_utc.tz_convert("US/Eastern")
In [454]: rng_eastern[2]
Out[454]: Timestamp('2012-03-07 19:00:00-0500', tz='US/Eastern', freq='D')
In [455]: rng_berlin[2]
Out[455]: Timestamp('2012-03-08 01:00:00+0100', tz='Europe/Berlin', freq='D')
Operations between Series in different time zones will yield UTC Series, aligning the data on the UTC times-
tamps:
In [457]: ts_utc = pd.Series(range(3), pd.date_range("20130101", periods=3, tz="UTC"))
In [461]: result
Out[461]:
2013-01-01 00:00:00+00:00 0
2013-01-02 00:00:00+00:00 2
2013-01-03 00:00:00+00:00 4
Freq: D, dtype: int64
In [462]: result.index
Out[462]:
DatetimeIndex(['2013-01-01 00:00:00+00:00', '2013-01-02 00:00:00+00:00',
'2013-01-03 00:00:00+00:00'],
dtype='datetime64[ns, UTC]', freq='D')
In [465]: didx.tz_localize(None)
Out[465]:
DatetimeIndex(['2014-08-01 09:00:00', '2014-08-01 10:00:00',
'2014-08-01 11:00:00'],
dtype='datetime64[ns]', freq=None)
In [466]: didx.tz_convert(None)
Out[466]:
DatetimeIndex(['2014-08-01 13:00:00', '2014-08-01 14:00:00',
'2014-08-01 15:00:00'],
dtype='datetime64[ns]', freq='H')
Fold
In [469]: pd.Timestamp(
.....: year=2019,
.....: month=10,
.....: day=27,
.....: hour=1,
.....: minute=30,
(continues on next page)
tz_localize may not be able to determine the UTC offset of a timestamp because daylight savings time (DST)
in a local time zone causes some times to occur twice within one day (“clocks fall back”). The following options are
available:
• 'raise': Raises a pytz.AmbiguousTimeError (the default behavior)
• 'infer': Attempt to determine the correct offset base on the monotonicity of the timestamps
• 'NaT': Replaces ambiguous times with NaT
• bool: True represents a DST time, False represents non-DST time. An array-like of bool values is sup-
ported for a sequence of times.
.....: )
.....:
In [2]: rng_hourly.tz_localize('US/Eastern')
AmbiguousTimeError: Cannot infer dst time from Timestamp('2011-11-06 01:00:00'), try
˓→using the 'ambiguous' argument
A DST transition may also shift the local time ahead by 1 hour creating nonexistent local times (“clocks spring
forward”). The behavior of localizing a timeseries with nonexistent times can be controlled by the nonexistent
argument. The following options are available:
• 'raise': Raises a pytz.NonExistentTimeError (the default behavior)
• 'NaT': Replaces nonexistent times with NaT
• 'shift_forward': Shifts nonexistent times forward to the closest real time
• 'shift_backward': Shifts nonexistent times backward to the closest real time
• timedelta object: Shifts nonexistent times by the timedelta duration
In [2]: dti.tz_localize('Europe/Warsaw')
NonExistentTimeError: 2015-03-29 02:30:00
In [475]: dti
Out[475]:
DatetimeIndex(['2015-03-29 02:30:00', '2015-03-29 03:30:00',
'2015-03-29 04:30:00'],
dtype='datetime64[ns]', freq='H')
A Series with time zone naive values is represented with a dtype of datetime64[ns].
In [480]: s_naive = pd.Series(pd.date_range("20130101", periods=3))
In [481]: s_naive
Out[481]:
0 2013-01-01
1 2013-01-02
2 2013-01-03
dtype: datetime64[ns]
A Series with a time zone aware values is represented with a dtype of datetime64[ns, tz] where tz is the
time zone
In [482]: s_aware = pd.Series(pd.date_range("20130101", periods=3, tz="US/Eastern"))
In [483]: s_aware
Out[483]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
Both of these Series time zone information can be manipulated via the .dt accessor, see the dt accessor section.
For example, to localize and convert a naive stamp to time zone aware.
In [484]: s_naive.dt.tz_localize("UTC").dt.tz_convert("US/Eastern")
Out[484]:
0 2012-12-31 19:00:00-05:00
1 2013-01-01 19:00:00-05:00
2 2013-01-02 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]
Time zone information can also be manipulated using the astype method. This method can localize and convert
time zone naive timestamps or convert time zone aware timestamps.
# localize and convert a naive time zone
In [485]: s_naive.astype("datetime64[ns, US/Eastern]")
Out[485]:
0 2012-12-31 19:00:00-05:00
1 2013-01-01 19:00:00-05:00
2 2013-01-02 19:00:00-05:00
dtype: datetime64[ns, US/Eastern]
Note: Using Series.to_numpy() on a Series, returns a NumPy array of the data. NumPy does not currently
support time zones (even though it is printing in the local time zone!), therefore an object array of Timestamps is
returned for time zone aware data:
In [488]: s_naive.to_numpy()
Out[488]:
array(['2013-01-01T00:00:00.000000000', '2013-01-02T00:00:00.000000000',
'2013-01-03T00:00:00.000000000'], dtype='datetime64[ns]')
In [489]: s_aware.to_numpy()
Out[489]:
array([Timestamp('2013-01-01 00:00:00-0500', tz='US/Eastern', freq='D'),
Timestamp('2013-01-02 00:00:00-0500', tz='US/Eastern', freq='D'),
Timestamp('2013-01-03 00:00:00-0500', tz='US/Eastern', freq='D')],
dtype=object)
By converting to an object array of Timestamps, it preserves the time zone information. For example, when converting
back to a Series:
In [490]: pd.Series(s_aware.to_numpy())
Out[490]:
0 2013-01-01 00:00:00-05:00
1 2013-01-02 00:00:00-05:00
2 2013-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]
However, if you want an actual NumPy datetime64[ns] array (with the values converted to UTC) instead of an
array of objects, you can specify the dtype argument:
In [491]: s_aware.to_numpy(dtype="datetime64[ns]")
Out[491]:
array(['2013-01-01T05:00:00.000000000', '2013-01-02T05:00:00.000000000',
'2013-01-03T05:00:00.000000000'], dtype='datetime64[ns]')
Timedeltas are differences in times, expressed in difference units, e.g. days, hours, minutes, seconds. They can be
both positive and negative.
Timedelta is a subclass of datetime.timedelta, and behaves in a similar manner, but allows compatibility
with np.timedelta64 types as well as a host of custom representation, parsing, and attributes.
2.20.1 Parsing
You can construct a Timedelta scalar through various arguments, including ISO 8601 Duration strings.
# strings
In [2]: pd.Timedelta("1 days")
Out[2]: Timedelta('1 days 00:00:00')
# like datetime.timedelta
# note: these MUST be specified as keyword arguments
In [6]: pd.Timedelta(days=1, seconds=1)
Out[6]: Timedelta('1 days 00:00:01')
# from a datetime.timedelta/np.timedelta64
In [8]: pd.Timedelta(datetime.timedelta(days=1, seconds=1))
Out[8]: Timedelta('1 days 00:00:01')
# a NaT
In [11]: pd.Timedelta("nan")
Out[11]: NaT
In [12]: pd.Timedelta("nat")
Out[12]: NaT
In [14]: pd.Timedelta("P0DT0H0M0.000000123S")
Out[14]: Timedelta('0 days 00:00:00.000000123')
DateOffsets (Day, Hour, Minute, Second, Milli, Micro, Nano) can also be used in construction.
In [15]: pd.Timedelta(pd.offsets.Second(2))
Out[15]: Timedelta('0 days 00:00:02')
....: "00:00:00.000123"
....: )
....:
Out[16]: Timedelta('2 days 00:00:02.000123')
to_timedelta
Using the top-level pd.to_timedelta, you can convert a scalar, array, list, or Series from a recognized timedelta
format / value into a Timedelta type. It will construct Series if the input is a Series, a scalar if the input is scalar-like,
otherwise it will output a TimedeltaIndex.
You can parse a single string to a Timedelta:
In [18]: pd.to_timedelta("15.5us")
Out[18]: Timedelta('0 days 00:00:00.000015500')
or a list/array of strings:
Timedelta limitations
pandas represents Timedeltas in nanosecond resolution using 64 bit integers. As such, the 64 bit integer limits
determine the Timedelta limits.
In [22]: pd.Timedelta.min
Out[22]: Timedelta('-106752 days +00:12:43.145224193')
In [23]: pd.Timedelta.max
Out[23]: Timedelta('106751 days 23:47:16.854775807')
2.20.2 Operations
You can operate on Series/DataFrames and construct timedelta64[ns] Series through subtraction operations on
datetime64[ns] Series, or Timestamps.
In [24]: s = pd.Series(pd.date_range("2012-1-1", periods=3, freq="D"))
In [27]: df
Out[27]:
A B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days
In [29]: df
Out[29]:
A B C
0 2012-01-01 0 days 2012-01-01
1 2012-01-02 1 days 2012-01-03
2 2012-01-03 2 days 2012-01-05
In [30]: df.dtypes
Out[30]:
A datetime64[ns]
B timedelta64[ns]
C datetime64[ns]
dtype: object
In [31]: s - s.max()
Out[31]:
0 -2 days
1 -1 days
2 0 days
dtype: timedelta64[ns]
In [32]: s - datetime.datetime(2011, 1, 1, 3, 5)
Out[32]:
0 364 days 20:55:00
1 365 days 20:55:00
2 366 days 20:55:00
dtype: timedelta64[ns]
In [33]: s + datetime.timedelta(minutes=5)
Out[33]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
In [34]: s + pd.offsets.Minute(5)
Out[34]:
0 2012-01-01 00:05:00
(continues on next page)
In [36]: y = s - s[0]
In [37]: y
Out[37]:
0 0 days
1 1 days
2 2 days
dtype: timedelta64[ns]
In [38]: y = s - s.shift()
In [39]: y
Out[39]:
0 NaT
1 1 days
2 1 days
dtype: timedelta64[ns]
In [41]: y
Out[41]:
0 NaT
1 NaT
2 1 days
dtype: timedelta64[ns]
Operands can also appear in a reversed order (a singular object operated with a Series):
In [42]: s.max() - s
Out[42]:
0 2 days
1 1 days
2 0 days
dtype: timedelta64[ns]
In [43]: datetime.datetime(2011, 1, 1, 3, 5) - s
Out[43]:
0 -365 days +03:05:00
(continues on next page)
In [44]: datetime.timedelta(minutes=5) + s
Out[44]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
min, max and the corresponding idxmin, idxmax operations are supported on frames:
In [45]: A = s - pd.Timestamp("20120101") - pd.Timedelta("00:05:05")
In [48]: df
Out[48]:
A B
0 -1 days +23:54:55 -1 days
1 0 days 23:54:55 -1 days
2 1 days 23:54:55 -1 days
In [49]: df.min()
Out[49]:
A -1 days +23:54:55
B -1 days +00:00:00
dtype: timedelta64[ns]
In [50]: df.min(axis=1)
Out[50]:
0 -1 days
1 -1 days
2 -1 days
dtype: timedelta64[ns]
In [51]: df.idxmin()
Out[51]:
A 0
B 0
dtype: int64
In [52]: df.idxmax()
Out[52]:
A 2
B 0
dtype: int64
min, max, idxmin, idxmax operations are supported on Series as well. A scalar result will be a Timedelta.
In [53]: df.min().max()
Out[53]: Timedelta('-1 days +23:54:55')
In [54]: df.min(axis=1).min()
(continues on next page)
In [55]: df.min().idxmax()
Out[55]: 'A'
In [56]: df.min(axis=1).idxmin()
Out[56]: 0
In [57]: y.fillna(pd.Timedelta(0))
Out[57]:
0 0 days
1 0 days
2 1 days
dtype: timedelta64[ns]
In [61]: td1
Out[61]: Timedelta('-2 days +21:59:57')
In [62]: -1 * td1
Out[62]: Timedelta('1 days 02:00:03')
In [63]: -td1
Out[63]: Timedelta('1 days 02:00:03')
In [64]: abs(td1)
Out[64]: Timedelta('1 days 02:00:03')
2.20.3 Reductions
Numeric reduction operation for timedelta64[ns] will return Timedelta objects. As usual NaT are skipped
during evaluation.
In [65]: y2 = pd.Series(
....: pd.to_timedelta(["-1 days +00:00:05", "nat", "-1 days +00:00:05", "1 days
˓→"])
....: )
....:
In [66]: y2
Out[66]:
0 -1 days +00:00:05
1 NaT
2 -1 days +00:00:05
3 1 days 00:00:00
dtype: timedelta64[ns]
In [67]: y2.mean()
Out[67]: Timedelta('-1 days +16:00:03.333333334')
In [68]: y2.median()
Out[68]: Timedelta('-1 days +00:00:05')
In [69]: y2.quantile(0.1)
Out[69]: Timedelta('-1 days +00:00:05')
In [70]: y2.sum()
Out[70]: Timedelta('-1 days +00:00:10')
Timedelta Series, TimedeltaIndex, and Timedelta scalars can be converted to other ‘frequencies’ by dividing
by another timedelta, or by astyping to a specific timedelta type. These operations yield Series and propagate NaT ->
nan. Note that division by the NumPy scalar is true division, while astyping is equivalent of floor division.
In [71]: december = pd.Series(pd.date_range("20121201", periods=4))
In [76]: td
Out[76]:
0 31 days 00:00:00
1 31 days 00:00:00
2 31 days 00:05:03
3 NaT
dtype: timedelta64[ns]
# to days
(continues on next page)
In [78]: td.astype("timedelta64[D]")
Out[78]:
0 31.0
1 31.0
2 31.0
3 NaN
dtype: float64
# to seconds
In [79]: td / np.timedelta64(1, "s")
Out[79]:
0 2678400.0
1 2678400.0
2 2678703.0
3 NaN
dtype: float64
In [80]: td.astype("timedelta64[s]")
Out[80]:
0 2678400.0
1 2678400.0
2 2678703.0
3 NaN
dtype: float64
In [82]: td * -1
Out[82]:
0 -31 days +00:00:00
1 -31 days +00:00:00
2 -32 days +23:54:57
3 NaT
dtype: timedelta64[ns]
Rounded division (floor-division) of a timedelta64[ns] Series by a scalar Timedelta gives a series of integers.
The mod (%) and divmod operations are defined for Timedelta when operating with another timedelta-like or with
a numeric argument.
2.20.5 Attributes
You can access various components of the Timedelta or TimedeltaIndex directly using the attributes
days,seconds,microseconds,nanoseconds. These are identical to the values returned by datetime.
timedelta, in that, for example, the .seconds attribute represents the number of seconds >= 0 and < 1 day.
These are signed according to whether the Timedelta is signed.
These operations can also be directly accessed via the .dt property of the Series as well.
Note: Note that the attributes are NOT the displayed values of the Timedelta. Use .components to retrieve the
displayed values.
For a Series:
In [89]: td.dt.days
Out[89]:
(continues on next page)
In [90]: td.dt.seconds
Out[90]:
0 0.0
1 0.0
2 303.0
3 NaN
dtype: float64
You can access the value of the fields for a scalar Timedelta directly.
In [92]: tds.days
Out[92]: 31
In [93]: tds.seconds
Out[93]: 303
In [94]: (-tds).seconds
Out[94]: 86097
You can use the .components property to access a reduced form of the timedelta. This returns a DataFrame
indexed similarly to the Series. These are the displayed values of the Timedelta.
In [95]: td.dt.components
Out[95]:
days hours minutes seconds milliseconds microseconds nanoseconds
0 31.0 0.0 0.0 0.0 0.0 0.0 0.0
1 31.0 0.0 0.0 0.0 0.0 0.0 0.0
2 31.0 0.0 5.0 3.0 0.0 0.0 0.0
3 NaN NaN NaN NaN NaN NaN NaN
In [96]: td.dt.components.seconds
Out[96]:
0 0.0
1 0.0
2 3.0
3 NaN
Name: seconds, dtype: float64
You can convert a Timedelta to an ISO 8601 Duration string with the .isoformat method
In [97]: pd.Timedelta(
....: days=6, minutes=50, seconds=3, milliseconds=10, microseconds=10,
˓→nanoseconds=12
....: ).isoformat()
....:
Out[97]: 'P6DT0H50M3.010010012S'
2.20.6 TimedeltaIndex
To generate an index with time delta, you can use either the TimedeltaIndex or the timedelta_range()
constructor.
Using TimedeltaIndex you can pass string-like, Timedelta, timedelta, or np.timedelta64 objects.
Passing np.nan/pd.NaT/nat will represent missing values.
In [98]: pd.TimedeltaIndex(
....: [
....: "1 days",
....: "1 days, 00:00:05",
....: np.timedelta64(2, "D"),
....: datetime.timedelta(days=2, seconds=2),
....: ]
....: )
....:
Out[98]:
TimedeltaIndex(['1 days 00:00:00', '1 days 00:00:05', '2 days 00:00:00',
'2 days 00:00:02'],
dtype='timedelta64[ns]', freq=None)
The string ‘infer’ can be passed in order to set the frequency of the index as the inferred frequency upon creation:
Various combinations of start, end, and periods can be used with timedelta_range:
Specifying start, end, and periods will generate a range of evenly spaced timedeltas from start to end
inclusively, with periods number of elements in the resulting TimedeltaIndex:
In [105]: pd.timedelta_range("0 days", "4 days", periods=5)
Out[105]: TimedeltaIndex(['0 days', '1 days', '2 days', '3 days', '4 days'], dtype=
˓→'timedelta64[ns]', freq=None)
Similarly to other of the datetime-like indices, DatetimeIndex and PeriodIndex, you can use
TimedeltaIndex as the index of pandas objects.
In [107]: s = pd.Series(
.....: np.arange(100),
.....: index=pd.timedelta_range("1 days", periods=100, freq="h"),
.....: )
.....:
In [108]: s
Out[108]:
1 days 00:00:00 0
1 days 01:00:00 1
1 days 02:00:00 2
1 days 03:00:00 3
1 days 04:00:00 4
..
4 days 23:00:00 95
(continues on next page)
Furthermore you can use partial string selection and the range will be inferred:
In [112]: s["1 day":"1 day 5 hours"]
Out[112]:
1 days 00:00:00 0
1 days 01:00:00 1
1 days 02:00:00 2
1 days 03:00:00 3
1 days 04:00:00 4
1 days 05:00:00 5
Freq: H, dtype: int64
Operations
Finally, the combination of TimedeltaIndex with DatetimeIndex allow certain combination operations that
are NaT preserving:
In [113]: tdi = pd.TimedeltaIndex(["1 days", pd.NaT, "2 days"])
In [114]: tdi.to_list()
Out[114]: [Timedelta('1 days 00:00:00'), NaT, Timedelta('2 days 00:00:00')]
In [116]: dti.to_list()
Out[116]:
(continues on next page)
Conversions
Similarly to frequency conversion on a Series above, you can convert these indices to yield another Index.
In [120]: tdi.astype("timedelta64[s]")
Out[120]: Float64Index([86400.0, nan, 172800.0], dtype='float64')
Scalars type ops work as well. These can potentially return a different type of index.
2.20.7 Resampling
In [126]: s.resample("D").mean()
Out[126]:
1 days 11.5
2 days 35.5
3 days 59.5
4 days 83.5
5 days 97.5
Freq: D, dtype: float64
2.21 Styling
This document is written as a Jupyter Notebook, and can be viewed or downloaded here.
You can apply conditional formatting, the visual styling of a DataFrame depending on the data within, by using
the DataFrame.style property. This is a property that returns a Styler object, which has useful methods for
formatting and displaying DataFrames.
The styling is accomplished using CSS. You write “style functions” that take scalars, DataFrames or Series, and
return like-indexed DataFrames or Series with CSS "attribute: value" pairs for the values. These functions
can be incrementally passed to the Styler which collects the styles before rendering.
np.random.seed(24)
df = pd.DataFrame({'A': np.linspace(1, 10, 10)})
df = pd.concat([df, pd.DataFrame(np.random.randn(10, 4), columns=list('BCDE'))],
axis=1)
df.iloc[3, 3] = np.nan
df.iloc[0, 2] = np.nan
[3]: df.style
[3]: <pandas.io.formats.style.Styler at 0x7f5f1ca6dee0>
Note: The DataFrame.style attribute is a property that returns a Styler object. Styler has a _repr_html_
method defined on it so they are rendered automatically. If you want the actual HTML back for further processing or
for writing to file call the .render() method which returns a string.
The above output looks very similar to the standard DataFrame HTML representation. But we’ve done some work
behind the scenes to attach CSS classes to each cell. We can view these by calling the .render method.
[4]: df.style.highlight_null().render().split('\n')[:10]
[4]: ['<style type="text/css" >',
'#T_2747c_row0_col2,#T_2747c_row3_col3{',
' background-color: red;',
' }</style><table id="T_2747c_" ><thead> <tr> <th class="blank
˓→level0" ></th> <th class="col_heading level0 col0" >A</th> <th class=
˓→"col_heading level0 col1" >B</th> <th class="col_heading level0 col2" >C</th>
˓→ <th class="col_heading level0 col3" >D</th> <th class="col_heading
˓→level0 col4" >E</th> </tr></thead><tbody>',
' <tr>',
' <th id="T_2747c_level0_row0" class="row_heading level0 row0
˓→" >0</th>',
The row0_col2 is the identifier for that particular cell. We’ve also prepended each row/column identifier with a
UUID unique to each DataFrame so that the style from one doesn’t collide with the styling from another within the
same notebook or page (you can set the uuid if you’d like to tie together the styling of two DataFrames).
When writing style functions, you take care of producing the CSS attribute / value pairs you want. Pandas matches
those up with the CSS classes that identify each cell.
Let’s write a simple style function that will color negative numbers red and positive numbers black.
In this case, the cell’s style depends only on its own value. That means we should use the Styler.applymap
method which works elementwise.
[6]: s = df.style.applymap(color_negative_red)
s
[6]: <pandas.io.formats.style.Styler at 0x7f5ee52d1490>
Notice the similarity with the standard df.applymap, which operates on DataFrames elementwise. We want you to
be able to reuse your existing knowledge of how to interact with DataFrames.
Notice also that our function returned a string containing the CSS attribute and value, separated by a colon just like in
a <style> tag. This will be a common theme.
Finally, the input shapes matched. Styler.applymap calls the function on each scalar input, and the function
returns a scalar output.
Now suppose you wanted to highlight the maximum value in each column. We can’t use .applymap anymore since
that operated elementwise. Instead, we’ll turn to .apply which operates columnwise (or rowwise using the axis
keyword). Later on we’ll see that something like highlight_max is already defined on Styler so you wouldn’t
need to write this yourself.
[8]: df.style.apply(highlight_max)
[8]: <pandas.io.formats.style.Styler at 0x7f5ee53321c0>
In this case the input is a Series, one column at a time. Notice that the output shape of highlight_max matches
the input shape, an array with len(s) items.
We encourage you to use method chains to build up a style piecewise, before finally rending at the end of the chain.
[9]: df.style.\
applymap(color_negative_red).\
apply(highlight_max)
[9]: <pandas.io.formats.style.Styler at 0x7f5ee53322b0>
When using Styler.apply(func, axis=None), the function must return a DataFrame with the same index
and column labels.
Style functions should return strings with one or more CSS attribute: value delimited by semicolons. Use
• Styler.applymap(func) for elementwise styles
• Styler.apply(func, axis=0) for columnwise styles
• Styler.apply(func, axis=1) for rowwise styles
• Styler.apply(func, axis=None) for tablewise styles
And crucially the input and output shapes of func must match. If x is the input then func(x).shape == x.
shape.
Both Styler.apply, and Styler.applymap accept a subset keyword. This allows you to apply styles to
specific rows or columns, without having to code that logic into your style function.
The value passed to subset behaves similar to slicing a DataFrame.
• A scalar is treated as a column label
• A list (or series or numpy array)
• A tuple is treated as (row_indexer, column_indexer)
Consider using pd.IndexSlice to construct the tuple for the last one.
For row and column slicing, any valid indexer to .loc will work.
[13]: df.style.applymap(color_negative_red,
subset=pd.IndexSlice[2:5, ['B', 'D']])
[13]: <pandas.io.formats.style.Styler at 0x7f5ee520c760>
We distinguish the display value from the actual value in Styler. To control the display value, the text is printed in
each cell, use Styler.format. Cells can be formatted according to a format spec string or a callable that takes a
single value and returns a string.
[14]: df.style.format("{:.2%}")
[14]: <pandas.io.formats.style.Styler at 0x7f5ee5204850>
You can format the text displayed for missing values by na_rep.
Finally, we expect certain styling functions to be common enough that we’ve included a few “built-in” to the Styler,
so you don’t have to write them yourself.
[19]: df.style.highlight_null(null_color='red')
[19]: <pandas.io.formats.style.Styler at 0x7f5ee5204f70>
You can create “heatmaps” with the background_gradient method. These require matplotlib, and we’ll use
Seaborn to get a nice colormap.
cm = sns.light_palette("green", as_cmap=True)
s = df.style.background_gradient(cmap=cm)
s
[20]: <pandas.io.formats.style.Styler at 0x7f5f18ec5070>
Styler.background_gradient takes the keyword arguments low and high. Roughly speaking these extend
the range of your data by low and high percent so that when we convert the colors, the colormap’s entire range isn’t
used. This is useful so that you can actually read the text still.
[23]: df.style.highlight_max(axis=0)
[23]: <pandas.io.formats.style.Styler at 0x7f5ee2897520>
Use Styler.set_properties when the style doesn’t actually depend on the values.
Bar charts
New in version 0.20.0 is the ability to customize further the bar chart: You can now have the df.style.bar be
centered on zero or midpoint value (in addition to the already existing way of having the min value at the left side of
the cell), and you can pass a list of [color_negative, color_positive].
Here’s how you can change the above with the new align='mid' option:
The following example aims to give a highlight of the behavior of the new align options:
# Test series
test1 = pd.Series([-100,-60,-30,-20], name='All Negative')
test2 = pd.Series([10,20,50,100], name='All Positive')
test3 = pd.Series([-10,-5,0,90], name='Both Pos and Neg')
head = """
<table>
<thead>
(continues on next page)
"""
aligns = ['left','zero','mid']
for align in aligns:
row = "<tr><th>{}</th>".format(align)
for series in [test1,test2,test3]:
s = series.copy()
s.name=''
row += "<td>{}</td>".format(s.to_frame().style.bar(align=align,
color=['#d65f5f', '#5fba7d
˓→'],
width=100).render())
˓→#testn['width']
row += '</tr>'
head += row
head+= """
</tbody>
</table>"""
HTML(head)
[27]: <IPython.core.display.HTML object>
Say you have a lovely style built up for a DataFrame, and now you want to apply the same style to a second DataFrame.
Export the style with df1.style.export, and import it on the second DataFrame with df1.style.set
Notice that you’re able to share the styles even though they’re data aware. The styles are re-evaluated on the new
DataFrame they’ve been used upon.
You’ve seen a few methods for data-driven styling. Styler also provides a few other options for styles that don’t
depend on the data.
• precision
• captions
• table-wide styles
• missing values representation
• hiding the index or columns
Each of these can be specified in two ways:
• A keyword argument to Styler.__init__
• A call to one of the .set_ or .hide_ methods, e.g. .set_caption or .hide_columns
The best method to use depends on the context. Use the Styler constructor when building many styled DataFrames
that should all share the same properties. For interactive use, the.set_ and .hide_ methods are more convenient.
Precision
You can control the precision of floats using pandas’ regular display.precision option.
[31]: df.style\
.applymap(color_negative_red)\
.apply(highlight_max)\
.set_precision(2)
[31]: <pandas.io.formats.style.Styler at 0x7f5ee28c4c10>
Setting the precision only affects the printed number; the full-precision values are always passed to your style func-
tions. You can always use df.round(2).style if you’d prefer to round from the start.
Captions
Table styles
The next option you have are “table styles”. These are styles that apply to the table as a whole, but don’t look at the
data. Certain stylings, including pseudo-selectors like :hover can only be used this way. These can also be used to
set specific row or column based class selectors, as will be shown.
def hover(hover_color="#ffff99"):
return dict(selector="tr:hover",
props=[("background-color", "%s" % hover_color)])
styles = [
hover(),
dict(selector="th", props=[("font-size", "150%"),
("text-align", "center")]),
dict(selector="caption", props=[("caption-side", "bottom")])
]
html = (df.style.set_table_styles(styles)
.set_caption("Hover to highlight."))
html
[33]: <pandas.io.formats.style.Styler at 0x7f5ee28bcd30>
table_styles should be a list of dictionaries. Each dictionary should have the selector and props keys.
The value for selector should be a valid CSS selector. Recall that all the styles are already attached to an id,
unique to each Styler. This selector is in addition to that id. The value for props should be a list of tuples of
('attribute', 'value').
table_styles are extremely flexible, but not as fun to type out by hand. We hope to collect some useful ones
either in pandas, or preferable in a new package that builds on top the tools here.
table_styles can be used to add column and row based class descriptors. For large tables this can increase
performance by avoiding repetitive individual css for each cell, and it can also simplify style construction in some
cases. If table_styles is given as a dictionary each key should be a specified column or index value and this will
map to specific class CSS selectors of the given column or row.
Note that Styler.set_table_styles will overwrite existing styles but can be chained by setting the
overwrite argument to False.
Missing values
You can control the default missing values representation for the entire table through set_na_rep method.
[35]: (df.style
.set_na_rep("FAIL")
.format(None, na_rep="PASS", subset=["D"])
.highlight_null("yellow"))
[35]: <pandas.io.formats.style.Styler at 0x7f5ee28976d0>
The index can be hidden from rendering by calling Styler.hide_index. Columns can be hidden from rendering
by calling Styler.hide_columns and passing in the name of a column, or a slice of columns.
[36]: df.style.hide_index()
[36]: <pandas.io.formats.style.Styler at 0x7f5ee2904f70>
[37]: df.style.hide_columns(['C','D'])
[37]: <pandas.io.formats.style.Styler at 0x7f5ee2904cd0>
CSS classes
Limitations
Terms
• Style function: a function that’s passed into Styler.apply or Styler.applymap and returns values like
'css attribute: value'
• Builtin style functions: style functions that are methods on Styler
• table style: a dictionary with the two keys selector and props. selector is the CSS selector that props
will apply to. props is a list of (attribute, value) tuples. A list of table styles passed into Styler.
warn("The `IPython.html` package has been deprecated since IPython 4.0. "
interactive(children=(IntSlider(value=179, description='h_neg', max=359),
˓→IntSlider(value=179, description='h_...
[40]: np.random.seed(25)
cmap = cmap=sns.diverging_palette(5, 250, as_cmap=True)
bigdf = pd.DataFrame(np.random.randn(20, 25)).cumsum()
bigdf.style.background_gradient(cmap, axis=1)\
.set_properties(**{'max-width': '80px', 'font-size': '1pt'})\
.set_caption("Hover to magnify")\
.set_precision(2)\
.set_table_styles(magnify())
[40]: <pandas.io.formats.style.Styler at 0x7f5ee28bc430>
[41]: df.style.\
applymap(color_negative_red).\
apply(highlight_max).\
to_excel('styled.xlsx', engine='openpyxl')
2.21.9 Extensibility
The core of pandas is, and will remain, its “high-performance, easy-to-use data structures”. With that in mind, we
hope that DataFrame.style accomplishes two goals
• Provide an API that is pleasing to use interactively and is “good enough” for many tasks
• Provide the foundations for dedicated libraries to build on
If you build a great library on top of this, let us know and we’ll link to it.
Subclassing
If the default template doesn’t quite suit your needs, you can subclass Styler and extend or override the template. We’ll
show an example of extending the default template to insert a custom header before each table.
Now that we’ve created a template, we need to set up a subclass of Styler that knows about it.
Notice that we include the original loader in our environment’s loader. That’s because we extend the original template,
so the Jinja environment needs to be able to find it.
Now we can use that custom styler. It’s __init__ takes a DataFrame.
[45]: MyStyler(df)
[45]: <__main__.MyStyler at 0x7f5ee0072520>
Our custom template accepts a table_title keyword. We can provide the value in the .render method.
[46]: HTML(MyStyler(df).render(table_title="Extending Example"))
[46]: <IPython.core.display.HTML object>
For convenience, we provide the Styler.from_custom_template method that does the same as the custom
subclass.
[47]: EasyStyler = Styler.from_custom_template("templates", "myhtml.tpl")
EasyStyler(df)
[47]: <pandas.io.formats.style.Styler.from_custom_template.<locals>.MyStyler at
˓→0x7f5ee28b1250>
HTML(structure)
[48]: <IPython.core.display.HTML object>
2.22.1 Overview
pandas has an options system that lets you customize some aspects of its behaviour, display-related options being those
the user is most likely to adjust.
Options have a full “dotted-style”, case-insensitive name (e.g. display.max_rows). You can get/set options
directly as attributes of the top-level options attribute:
In [1]: import pandas as pd
In [2]: pd.options.display.max_rows
Out[2]: 15
The API is composed of 5 relevant functions, available directly from the pandas namespace:
• get_option() / set_option() - get/set the value of a single option.
• reset_option() - reset one or more options to their default value.
• describe_option() - print the descriptions of one or more options.
• option_context() - execute a codeblock with a set of options that revert to prior settings after execution.
Note: Developers can check out pandas/core/config.py for more information.
All of the functions above accept a regexp pattern (re.search style) as an argument, and so passing in a substring
will work - as long as it is unambiguous:
In [5]: pd.get_option("display.max_rows")
Out[5]: 999
In [7]: pd.get_option("display.max_rows")
Out[7]: 101
In [9]: pd.get_option("display.max_rows")
Out[9]: 102
The following will not work because it matches multiple option names, e.g. display.max_colwidth,
display.max_rows, display.max_columns:
In [10]: try:
....: pd.get_option("column")
....: except KeyError as e:
....: print(e)
....:
'Pattern matched multiple keys'
Note: Using this form of shorthand may cause your code to break if new options with similar names are added in
future versions.
You can get a list of available options and their descriptions with describe_option. When called with no argu-
ment describe_option will print out the descriptions for all available options.
As described above, get_option() and set_option() are available from the pandas namespace. To change an
option, call set_option('option regex', new_value).
In [11]: pd.get_option("mode.sim_interactive")
Out[11]: False
In [14]: pd.get_option("display.max_rows")
Out[14]: 60
In [16]: pd.get_option("display.max_rows")
Out[16]: 999
In [17]: pd.reset_option("display.max_rows")
In [18]: pd.get_option("display.max_rows")
Out[18]: 60
In [19]: pd.reset_option("^display")
option_context context manager has been exposed through the top-level API, allowing you to execute code with
given option values. Option values are restored automatically when you exit the with block:
In [21]: print(pd.get_option("display.max_rows"))
60
In [22]: print(pd.get_option("display.max_columns"))
0
Using startup scripts for the Python/IPython environment to import pandas and set options makes working with pandas
more efficient. To do this, create a .py or .ipy script in the startup directory of the desired profile. An example where
the startup folder is in a default IPython profile can be found at:
$IPYTHONDIR/profile_default/startup
More information can be found in the IPython documentation. An example startup script for pandas is displayed
below:
import pandas as pd
pd.set_option("display.max_rows", 999)
pd.set_option("precision", 5)
In [24]: pd.set_option("max_rows", 7)
In [25]: df
Out[25]:
0 1
0 0.469112 -0.282863
1 -1.509059 -1.135632
2 1.212112 -0.173215
3 0.119209 -1.044236
4 -0.861849 -2.104569
5 -0.494929 1.071804
6 0.721555 -0.706771
In [26]: pd.set_option("max_rows", 5)
In [27]: df
Out[27]:
0 1
0 0.469112 -0.282863
1 -1.509059 -1.135632
.. ... ...
5 -0.494929 1.071804
6 0.721555 -0.706771
[7 rows x 2 columns]
In [28]: pd.reset_option("max_rows")
Once the display.max_rows is exceeded, the display.min_rows options determines how many rows are
shown in the truncated repr.
In [29]: pd.set_option("max_rows", 8)
In [30]: pd.set_option("min_rows", 4)
In [32]: df
Out[32]:
0 1
0 -1.039575 0.271860
1 -0.424972 0.567020
2 0.276232 -1.087401
3 -0.673690 0.113648
4 -1.478427 0.524988
5 0.404705 0.577046
6 -1.715002 -1.039268
In [34]: df
Out[34]:
0 1
0 -0.370647 -1.157892
1 -1.344312 0.844885
.. ... ...
7 0.276662 -0.472035
8 -0.013960 -0.362543
[9 rows x 2 columns]
In [35]: pd.reset_option("max_rows")
In [36]: pd.reset_option("min_rows")
display.expand_frame_repr allows for the representation of dataframes to stretch across pages, wrapped over
the full column vs row-wise.
In [39]: df
Out[39]:
0 1 2 3 4 5 6 7
˓→ 8 9
0 -0.006154 -0.923061 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.
˓→170299 -0.226169
In [41]: df
Out[41]:
0 1 2 3 4 5 6 7
˓→ 8 9
0 -0.006154 -0.923061 0.895717 0.805244 -1.206412 2.565646 1.431256 1.340309 -1.
˓→170299 -0.226169
In [42]: pd.reset_option("expand_frame_repr")
display.large_repr lets you select whether to display dataframes that exceed max_columns or max_rows
as a truncated frame, or as a summary.
In [44]: pd.set_option("max_rows", 5)
In [46]: df
Out[46]:
0 1 2 3 4 5 6 7
˓→ 8 9
0 -0.954208 1.462696 -1.743161 -0.826591 -0.345352 1.314232 0.690579 0.995761 2.
˓→396780 0.014871
1 3.357427 -0.317441 -1.236269 0.896171 -0.487602 -0.082240 -2.182937 0.380396 0.
˓→084844 0.432390
.. ... ... ... ... ... ... ... ...
˓→ ... ...
8 -0.303421 -0.858447 0.306996 -0.028665 0.384316 1.574159 1.588931 0.476720 0.
˓→473424 -0.242861
In [48]: df
Out[48]:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 10 non-null float64
1 1 10 non-null float64
2 2 10 non-null float64
3 3 10 non-null float64
4 4 10 non-null float64
5 5 10 non-null float64
6 6 10 non-null float64
7 7 10 non-null float64
8 8 10 non-null float64
9 9 10 non-null float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [49]: pd.reset_option("large_repr")
In [50]: pd.reset_option("max_rows")
display.max_colwidth sets the maximum width of columns. Cells of this length or longer will be truncated
with an ellipsis.
In [51]: df = pd.DataFrame(
....: np.array(
....: [
....: ["foo", "bar", "bim", "uncomfortably long string"],
....: ["horse", "cow", "banana", "apple"],
....: ]
....: )
....: )
....:
In [53]: df
Out[53]:
0 1 2 3
0 foo bar bim uncomfortably long string
1 horse cow banana apple
In [54]: pd.set_option("max_colwidth", 6)
In [55]: df
Out[55]:
0 1 2 3
0 foo bar bim un...
1 horse cow ba... apple
In [56]: pd.reset_option("max_colwidth")
In [59]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 10 non-null float64
1 1 10 non-null float64
2 2 10 non-null float64
3 3 10 non-null float64
4 4 10 non-null float64
5 5 10 non-null float64
6 6 10 non-null float64
7 7 10 non-null float64
8 8 10 non-null float64
9 9 10 non-null float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [60]: pd.set_option("max_info_columns", 5)
In [61]: df.info()
<class 'pandas.core.frame.DataFrame'>
(continues on next page)
In [62]: pd.reset_option("max_info_columns")
display.max_info_rows: df.info() will usually show null-counts for each column. For large frames this
can be quite slow. max_info_rows and max_info_cols limit this null check only to frames with smaller
dimensions then specified. Note that you can specify the option df.info(null_counts=True) to override on
showing a particular frame.
In [64]: df
Out[64]:
0 1 2 3 4 5 6 7 8 9
0 0.0 NaN 1.0 NaN NaN 0.0 NaN 0.0 NaN 1.0
1 1.0 NaN 1.0 1.0 1.0 1.0 NaN 0.0 0.0 NaN
2 0.0 NaN 1.0 0.0 0.0 NaN NaN NaN NaN 0.0
3 NaN NaN NaN 0.0 1.0 1.0 NaN 1.0 NaN 1.0
4 0.0 NaN NaN NaN 0.0 NaN NaN NaN 1.0 0.0
5 0.0 1.0 1.0 1.0 1.0 0.0 NaN NaN 1.0 0.0
6 1.0 1.0 1.0 NaN 1.0 NaN 1.0 0.0 NaN NaN
7 0.0 0.0 1.0 0.0 1.0 0.0 1.0 1.0 0.0 NaN
8 NaN NaN NaN 0.0 NaN NaN NaN NaN 1.0 NaN
9 0.0 NaN 0.0 NaN NaN 0.0 NaN 1.0 1.0 0.0
In [66]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 0 8 non-null float64
1 1 3 non-null float64
2 2 7 non-null float64
3 3 6 non-null float64
4 4 7 non-null float64
5 5 6 non-null float64
6 6 2 non-null float64
7 7 6 non-null float64
8 8 6 non-null float64
9 9 6 non-null float64
dtypes: float64(10)
memory usage: 928.0 bytes
In [67]: pd.set_option("max_info_rows", 5)
In [68]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10 entries, 0 to 9
Data columns (total 10 columns):
# Column Dtype
(continues on next page)
In [69]: pd.reset_option("max_info_rows")
display.precision sets the output display precision in terms of decimal places. This is only a suggestion.
In [71]: pd.set_option("precision", 7)
In [72]: df
Out[72]:
0 1 2 3 4
0 -1.1506406 -0.7983341 -0.5576966 0.3813531 1.3371217
1 -1.5310949 1.3314582 -0.5713290 -0.0266708 -1.0856630
2 -1.1147378 -0.0582158 -0.4867681 1.6851483 0.1125723
3 -1.4953086 0.8984347 -0.1482168 -1.5960698 0.1596530
4 0.2621358 0.0362196 0.1847350 -0.2550694 -0.2710197
In [73]: pd.set_option("precision", 4)
In [74]: df
Out[74]:
0 1 2 3 4
0 -1.1506 -0.7983 -0.5577 0.3814 1.3371
1 -1.5311 1.3315 -0.5713 -0.0267 -1.0857
2 -1.1147 -0.0582 -0.4868 1.6851 0.1126
3 -1.4953 0.8984 -0.1482 -1.5961 0.1597
4 0.2621 0.0362 0.1847 -0.2551 -0.2710
display.chop_threshold sets at what level pandas rounds to zero when it displays a Series of DataFrame. This
setting does not change the precision at which the number is stored.
In [76]: pd.set_option("chop_threshold", 0)
In [77]: df
Out[77]:
0 1 2 3 4 5
0 1.2884 0.2946 -1.1658 0.8470 -0.6856 0.6091
1 -0.3040 0.6256 -0.0593 0.2497 1.1039 -1.0875
2 1.9980 -0.2445 0.1362 0.8863 -1.3507 -0.8863
3 -1.0133 1.9209 -0.3882 -2.3144 0.6655 0.4026
(continues on next page)
In [79]: df
Out[79]:
0 1 2 3 4 5
0 1.2884 0.0000 -1.1658 0.8470 -0.6856 0.6091
1 0.0000 0.6256 0.0000 0.0000 1.1039 -1.0875
2 1.9980 0.0000 0.0000 0.8863 -1.3507 -0.8863
3 -1.0133 1.9209 0.0000 -2.3144 0.6655 0.0000
4 0.0000 -1.7660 0.8504 0.0000 0.9923 0.7441
5 -0.7398 -1.0549 0.0000 0.6396 1.5850 1.9067
In [80]: pd.reset_option("chop_threshold")
display.colheader_justify controls the justification of the headers. The options are ‘right’, and ‘left’.
In [81]: df = pd.DataFrame(
....: np.array([np.random.randn(6), np.random.randint(1, 9, 6) * 0.1, np.
˓→zeros(6)]).T,
In [83]: df
Out[83]:
A B C
0 0.1040 0.1 0.0
1 0.1741 0.5 0.0
2 -0.4395 0.4 0.0
3 -0.7413 0.8 0.0
4 -0.0797 0.4 0.0
5 -0.9229 0.3 0.0
In [85]: df
Out[85]:
A B C
0 0.1040 0.1 0.0
1 0.1741 0.5 0.0
2 -0.4395 0.4 0.0
3 -0.7413 0.8 0.0
4 -0.0797 0.4 0.0
5 -0.9229 0.3 0.0
In [86]: pd.reset_option("colheader_justify")
pandas also allows you to set how numbers are displayed in the console. This option is not set through the
set_options API.
Use the set_eng_float_format function to alter the floating-point formatting of pandas objects to produce a
particular format.
For instance:
In [90]: s / 1.0e3
Out[90]:
a 303.638u
b -721.084u
c -622.696u
d 648.250u
e -1.945m
dtype: float64
In [91]: s / 1.0e6
Out[91]:
a 303.638n
b -721.084n
c -622.696n
d 648.250n
e -1.945u
dtype: float64
To round floats on a case-by-case basis, you can also use round() and round().
Warning: Enabling this option will affect the performance for printing of DataFrame and Series (about 2 times
slower). Use only when it is actually required.
Some East Asian countries use Unicode characters whose width corresponds to two Latin characters. If a DataFrame
or Series contains these characters, the default output mode may not align them properly.
Note: Screen captures are attached for each output to show the actual results.
In [93]: df
Out[93]:
0 UK Alice
1
Enabling display.unicode.east_asian_width allows pandas to check each character’s “East Asian Width”
property. These characters can be aligned properly by setting this option to True. However, this will result in longer
render times than the standard len function.
In [95]: df
Out[95]:
0 UK Alice
1
In addition, Unicode characters whose width is “Ambiguous” can either be 1 or 2 characters wide depending on the
terminal setting or encoding. The option display.unicode.ambiguous_as_wide can be used to handle the
ambiguity.
By default, an “Ambiguous” character’s width, such as “¡” (inverted exclamation) in the example below, is taken to be
1.
In [97]: df
Out[97]:
a b
(continues on next page)
In [99]: df
Out[99]:
a b
0 xxx yyy
1 ¡¡ ¡¡
DataFrame and Series will publish a Table Schema representation by default. False by default, this can be enabled
globally with the display.html.table_schema option:
In this part of the tutorial, we will investigate how to speed up certain functions operating on pandas DataFrames
using three different techniques: Cython, Numba and pandas.eval(). We will see a speed improvement of ~200
when we use Cython and Numba on a test function operating row-wise on the DataFrame. Using pandas.eval()
we will speed up a sum by an order of ~2.
Note: In addition to following the steps in this tutorial, users interested in enhancing performance are highly encour-
aged to install the recommended dependencies for pandas. These dependencies are often not installed by default, but
will offer speed improvements if present.
For many use cases writing pandas in pure Python and NumPy is sufficient. In some computationally heavy applica-
tions however, it can be possible to achieve sizable speed-ups by offloading work to cython.
This tutorial assumes you have refactored as much as possible in Python, for example by trying to remove for-loops
and making use of NumPy vectorization. It’s always worth optimising in Python first.
This tutorial walks through a “typical” process of cythonizing a slow computation. We use an example from the
Cython documentation but in the context of pandas. Our final cythonized solution is around 100 times faster than the
pure Python solution.
Pure Python
In [1]: df = pd.DataFrame(
...: {
...: "a": np.random.randn(1000),
...: "b": np.random.randn(1000),
...: "N": np.random.randint(100, 1000, (1000)),
...: "x": "x",
...: }
...: )
...:
In [2]: df
Out[2]:
a b N x
0 0.469112 -0.218470 585 x
1 -0.282863 -0.061645 841 x
2 -1.509059 -0.723780 251 x
3 -1.135632 0.551225 972 x
4 1.212112 -0.497767 181 x
.. ... ... ... ..
995 -1.512743 0.874737 374 x
996 0.933753 1.120790 246 x
997 -0.308013 0.198768 157 x
998 -0.079915 1.757555 977 x
999 -1.010589 -1.115680 770 x
But clearly this isn’t fast enough for us. Let’s take a look and see where the time is spent during this operation (limited
to the most time consuming four calls) using the prun ipython magic function:
By far the majority of time is spend inside either integrate_f or f, hence we’ll concentrate our efforts cythonizing
these two functions.
Plain Cython
First we’re going to need to import the Cython magic function to IPython:
Now, let’s simply copy our functions over to Cython as is (the suffix is here to distinguish between function versions):
In [7]: %%cython
...: def f_plain(x):
...: return x * (x - 1)
...: def integrate_f_plain(a, b, N):
...: s = 0
...: dx = (b - a) / N
...: for i in range(N):
...: s += f_plain(a + i * dx)
...: return s * dx
...:
Note: If you’re having trouble pasting the above into your ipython, you may need to be using bleeding edge IPython
for paste to play well with cell magics.
Already this has shaved a third off, not too bad for a simple copy and paste.
Adding type
Now, we’re talking! It’s now over ten times faster than the original Python implementation, and we haven’t really
modified the code. Let’s have another look at what’s eating up time:
In [9]: %prun -l 4 df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]),
˓→axis=1)
Using ndarray
It’s calling series. . . a lot! It’s creating a Series from each row, and get-ting from both the index and the series (three
times for each row). Function calls are expensive in Python, so maybe we could minimize these by cythonizing the
apply part.
Note: We are now passing ndarrays into the Cython function, fortunately Cython plays very nicely with NumPy.
In [10]: %%cython
....: cimport numpy as np
....: import numpy as np
....: cdef double f_typed(double x) except? -2:
....: return x * (x - 1)
....: cpdef double integrate_f_typed(double a, double b, int N):
....: cdef int i
....: cdef double s, dx
....: s = 0
....: dx = (b - a) / N
(continues on next page)
The implementation is simple, it creates an array of zeros and loops over the rows, applying our
integrate_f_typed, and putting this in the zeros array.
Warning: You can not pass a Series directly as a ndarray typed parameter to a Cython function. Instead
pass the actual ndarray using the Series.to_numpy(). The reason is that the Cython definition is specific
to an ndarray and not the passed Series.
So, do not do this:
apply_integrate_f(df["a"], df["b"], df["N"])
Note: Loops like this would be extremely slow in Python, but in Cython looping over NumPy arrays is fast.
We’ve gotten another big improvement. Let’s check again where the time is spent:
As one might expect, the majority of the time is now spent in apply_integrate_f, so if we wanted to make
anymore efficiencies we must continue to concentrate our efforts here.
There is still hope for improvement. Here’s an example of using some more advanced Cython techniques:
In [12]: %%cython
....: cimport cython
....: cimport numpy as np
....: import numpy as np
....: cdef double f_typed(double x) except? -2:
....: return x * (x - 1)
....: cpdef double integrate_f_typed(double a, double b, int N):
....: cdef int i
....: cdef double s, dx
....: s = 0
....: dx = (b - a) / N
....: for i in range(N):
....: s += f_typed(a + i * dx)
....: return s * dx
....: @cython.boundscheck(False)
....: @cython.wraparound(False)
....: cpdef np.ndarray[double] apply_integrate_f_wrap(np.ndarray[double] col_a,
....: np.ndarray[double] col_b,
....: np.ndarray[int] col_N):
....: cdef int i, n = len(col_N)
....: assert len(col_a) == len(col_b) == n
....: cdef np.ndarray[double] res = np.empty(n)
....: for i in range(n):
....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])
....: return res
....:
Even faster, with the caveat that a bug in our Cython code (an off-by-one error, for example) might cause a segfault
because memory access isn’t checked. For more about boundscheck and wraparound, see the Cython docs on
compiler directives.
A recent alternative to statically compiling Cython code, is to use a dynamic jit-compiler, Numba.
Numba gives you the power to speed up your applications with high performance functions written directly in Python.
With a few annotations, array-oriented and math-heavy Python code can be just-in-time compiled to native machine
instructions, similar in performance to C, C++ and Fortran, without having to switch languages or Python interpreters.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime,
or statically (using the included pycc tool). Numba supports compilation of Python to run on either CPU or GPU
hardware, and is designed to integrate with the Python scientific software stack.
Note: You will need to install Numba. This is easy with conda, by using: conda install numba, see installing
using miniconda.
Note: As of Numba version 0.20, pandas objects cannot be passed directly to Numba-compiled functions. Instead,
one must pass the NumPy array underlying the pandas object to the Numba-compiled function as demonstrated below.
Jit
We demonstrate how to use Numba to just-in-time compile our code. We simply take the plain Python code from
above and annotate with the @jit decorator.
import numba
@numba.jit
def f_plain(x):
return x * (x - 1)
@numba.jit
def integrate_f_numba(a, b, N):
s = 0
dx = (b - a) / N
for i in range(N):
s += f_plain(a + i * dx)
return s * dx
@numba.jit
def apply_integrate_f_numba(col_a, col_b, col_N):
n = len(col_N)
result = np.empty(n, dtype="float64")
assert len(col_a) == len(col_b) == n
for i in range(n):
result[i] = integrate_f_numba(col_a[i], col_b[i], col_N[i])
return result
def compute_numba(df):
result = apply_integrate_f_numba(
df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()
)
return pd.Series(result, index=df.index, name="result")
Note that we directly pass NumPy arrays to the Numba function. compute_numba is just a wrapper that provides a
nicer interface by passing/returning pandas objects.
Numba as an argument
Additionally, we can leverage the power of Numba by calling it as an argument in apply(). See Computation tools
for an extensive example.
Vectorize
Numba can also be used to write vectorized functions that do not require the user to explicitly loop over the observa-
tions of a vector; a vectorized function will be applied to each row automatically. Consider the following toy example
of doubling each observation:
import numba
def double_every_value_nonumba(x):
return x * 2
@numba.vectorize
def double_every_value_withnumba(x): # noqa E501
return x * 2
Caveats
Note: Numba will execute on any function, but can only accelerate certain classes of functions.
Numba is best at accelerating functions that apply numerical functions to NumPy arrays. When passed a function that
only uses operations it knows how to accelerate, it will execute in nopython mode.
If Numba is passed a function that includes something it doesn’t know how to work with – a category that currently
includes sets, lists, dictionaries, or string functions – it will revert to object mode. In object mode, Numba
will execute but your code will not speed up significantly. If you would prefer that Numba throw an error if it cannot
compile a function in a way that speeds up your code, pass Numba the argument nopython=True (e.g. @numba.
jit(nopython=True)). For more on troubleshooting Numba modes, see the Numba troubleshooting page.
Read more in the Numba docs.
The top-level function pandas.eval() implements expression evaluation of Series and DataFrame objects.
Note: To benefit from using eval() you need to install numexpr. See the recommended dependencies section for
more details.
The point of using eval() for expression evaluation rather than plain Python is two-fold: 1) large DataFrame
objects are evaluated more efficiently and 2) large arithmetic and boolean expressions are evaluated all at once by the
underlying engine (by default numexpr is used for evaluation).
Note: You should not use eval() for simple expressions or for expressions involving small DataFrames. In fact,
eval() is many orders of magnitude slower for smaller expressions/objects than plain ol’ Python. A good rule of
thumb is to only use eval() when you have a DataFrame with more than 10,000 rows.
eval() supports all arithmetic expressions supported by the engine in addition to some extensions available only in
pandas.
Note: The larger the frame and the larger the expression the more speedup you will see from using eval().
Supported syntax
– yield expressions
– Generator expressions
– Boolean expressions consisting of only scalar values
• Statements
– Neither simple nor compound statements are allowed. This includes things like for, while, and if.
eval() examples
Now let’s compare adding them together using plain ol’ Python versus eval():
In [15]: %timeit df1 + df2 + df3 + df4
12.6 ms +- 180 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [18]: %timeit pd.eval("(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)")
9.63 ms +- 39.9 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
should be performed in Python. An exception will be raised if you try to perform any boolean/bitwise operations with
scalar operands that are not of type bool or np.bool_. Again, you should perform these kinds of operations in
plain Python.
In addition to the top level pandas.eval() function you can also evaluate an expression in the “context” of a
DataFrame.
In [22]: df = pd.DataFrame(np.random.randn(5, 2), columns=["a", "b"])
Any expression that is a valid pandas.eval() expression is also a valid DataFrame.eval() expression, with
the added benefit that you don’t have to prefix the name of the DataFrame to the column(s) you’re interested in
evaluating.
In addition, you can perform assignment of columns within an expression. This allows for formulaic evaluation. The
assignment target can be a new column name or an existing column name, and it must be a valid Python identifier.
The inplace keyword determines whether this assignment will performed on the original DataFrame or return a
copy with the new column.
In [24]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
In [28]: df
Out[28]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
When inplace is set to False, the default, a copy of the DataFrame with the new or modified columns is returned
and the original frame is unchanged.
In [29]: df
Out[29]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
In [31]: df
Out[31]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
In [36]: df["a"] = 1
In [37]: df
Out[37]:
a b c d
0 1 5 5 10
1 1 6 7 14
2 1 7 9 18
3 1 8 11 22
4 1 9 13 26
The query method has a inplace keyword which determines whether the query modifies the original frame.
In [38]: df = pd.DataFrame(dict(a=range(5), b=range(5, 10)))
In [41]: df
Out[41]:
a b
3 3 8
4 4 9
Local variables
You must explicitly reference any local variable that you want to use in an expression by placing the @ character in
front of the name. For example,
If you don’t prefix the local variable with @, pandas will raise an exception telling you the variable is undefined.
When using DataFrame.eval() and DataFrame.query(), this allows you to have a local variable and a
DataFrame column with the same name in an expression.
In [46]: a = np.random.randn()
With pandas.eval() you cannot use the @ prefix at all, because it isn’t defined in that context. pandas will let you
know this if you try to use @ in a top-level call to pandas.eval(). For example,
In [49]: a, b = 1, 2
File "/opt/conda/envs/pandas/lib/python3.8/site-packages/IPython/core/
˓→ interactiveshell.py", line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
In this case, you should simply refer to the variables like you would in standard Python.
pandas.eval() parsers
There are two different parsers and two different engines you can use as the backend.
The default 'pandas' parser allows a more intuitive syntax for expressing query-like operations (comparisons,
conjunctions and disjunctions). In particular, the precedence of the & and | operators is made equal to the precedence
of the corresponding boolean operations and and or.
For example, the above conjunction can be written without parentheses. Alternatively, you can use the 'python'
parser to enforce strict Python semantics.
In [52]: expr = "(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)"
In [54]: expr_no_parens = "df1 > 0 & df2 > 0 & df3 > 0 & df4 > 0"
In [56]: np.all(x == y)
Out[56]: True
The same expression can be “anded” together with the word and as well:
In [57]: expr = "(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)"
In [59]: expr_with_ands = "df1 > 0 and df2 > 0 and df3 > 0 and df4 > 0"
(continues on next page)
In [61]: np.all(x == y)
Out[61]: True
The and and or operators here have the same precedence that they would in vanilla Python.
pandas.eval() backends
There’s also the option to make eval() operate identical to plain ol’ Python.
Note: Using the 'python' engine is generally not useful, except for testing other evaluation engines against it. You
will achieve no performance benefits using eval() with engine='python' and in fact may incur a performance
hit.
You can see this by using pandas.eval() with the 'python' engine. It is a bit slower (not by much) than
evaluating the same expression in Python
pandas.eval() performance
eval() is intended to speed up certain kinds of operations. In particular, those operations involving complex expres-
sions with large DataFrame/Series objects should see a significant performance benefit. Here is a plot showing
the running time of pandas.eval() as function of the size of the frame involved in the computation. The two lines
are two different engines.
Note: Operations with smallish objects (around 15k-20k rows) are faster using plain Python:
This plot was created using a DataFrame with 3 columns each containing floating point values generated using
numpy.random.randn().
Expressions that would result in an object dtype or involve datetime operations (because of NaT) must be evaluated
in Python space. The main reason for this behavior is to maintain backwards compatibility with versions of NumPy <
1.7. In those versions of NumPy a call to ndarray.astype(str) will truncate any strings that are more than 60
characters in length. Second, we can’t pass object arrays to numexpr thus string comparisons must be evaluated
in Python space.
The upshot is that this only applies to object-dtype expressions. So, if you have an expression–for example
In [64]: df = pd.DataFrame(
....: {"strings": np.repeat(list("cba"), 3), "nums": np.repeat(range(3), 3)}
....: )
....:
In [65]: df
Out[65]:
strings nums
0 c 0
1 c 0
2 c 0
3 b 1
4 b 1
5 b 1
6 a 2
7 a 2
8 a 2
pandas provides data structures for in-memory analytics, which makes using pandas to analyze datasets that are larger
than memory datasets somewhat tricky. Even datasets that are a sizable fraction of memory become unwieldy, as some
pandas operations need to make intermediate copies.
This document provides a few recommendations for scaling your analysis to larger datasets. It’s a complement to
Enhancing performance, which focuses on speeding up analysis for datasets that fit in memory.
But first, it’s worth considering not using pandas. pandas isn’t the right tool for all situations. If you’re working with
very large datasets and a tool like PostgreSQL fits your needs, then you should probably be using that. Assuming you
want or need the expressiveness and power of pandas, let’s carry on.
To load the columns we want, we have two options. Option 1 loads in all the data and then filters to what we need.
In [4]: pd.read_parquet("timeseries_wide.parquet")[columns]
Out[4]:
id_0 name_0 x_0 y_0
timestamp
2000-01-01 00:00:00 1015 Michael -0.399453 0.095427
2000-01-01 00:01:00 969 Patricia 0.650773 -0.874275
2000-01-01 00:02:00 1016 Victor -0.721465 -0.584710
2000-01-01 00:03:00 939 Alice -0.746004 -0.908008
2000-01-01 00:04:00 1017 Dan 0.919451 -0.803504
... ... ... ... ...
2000-12-30 23:56:00 999 Tim 0.162578 0.512817
2000-12-30 23:57:00 970 Laura -0.433586 -0.600289
2000-12-30 23:58:00 1065 Edith 0.232211 -0.454540
2000-12-30 23:59:00 1019 Ingrid 0.322208 -0.615974
2000-12-31 00:00:00 937 Ursula -0.906523 0.943178
If we were to measure the memory usage of the two calls, we’d see that specifying columns uses about 1/10th the
memory in this case.
With pandas.read_csv(), you can specify usecols to limit the columns read into memory. Not all file formats
that can be read by pandas provide an option to read a subset of columns.
The default pandas data types are not the most memory efficient. This is especially true for text data columns with
relatively few unique values (commonly referred to as “low-cardinality” data). By using more efficient data types, you
can store larger datasets in memory.
In [6]: ts = pd.read_parquet("timeseries.parquet")
In [7]: ts
Out[7]:
id name x y
timestamp
2000-01-01 00:00:00 1029 Michael 0.278837 0.247932
2000-01-01 00:00:30 1010 Patricia 0.077144 0.490260
2000-01-01 00:01:00 1001 Victor 0.214525 0.258635
2000-01-01 00:01:30 1018 Alice -0.646866 0.822104
2000-01-01 00:02:00 991 Dan 0.902389 0.466665
... ... ... ... ...
2000-12-30 23:58:00 992 Sarah 0.721155 0.944118
2000-12-30 23:58:30 1007 Ursula 0.409277 0.133227
2000-12-30 23:59:00 1009 Hannah -0.452802 0.184318
2000-12-30 23:59:30 978 Kevin -0.904728 -0.179146
2000-12-31 00:00:00 973 Ingrid -0.370763 -0.794667
Now, let’s inspect the data types and memory usage to see where we should focus our attention.
In [8]: ts.dtypes
Out[8]:
id int64
name object
x float64
y float64
dtype: object
The name column is taking up much more memory than any other. It has just a few unique values, so it’s a good
candidate for converting to a Categorical. With a Categorical, we store each unique name once and use space-
efficient integers to know which specific name is used in each row.
In [12]: ts2.memory_usage(deep=True)
Out[12]:
Index 8409608
id 8409608
(continues on next page)
We can go a bit further and downcast the numeric columns to their smallest types using pandas.to_numeric().
In [13]: ts2["id"] = pd.to_numeric(ts2["id"], downcast="unsigned")
In [15]: ts2.dtypes
Out[15]:
id uint16
name category
x float32
y float32
dtype: object
In [16]: ts2.memory_usage(deep=True)
Out[16]:
Index 8409608
id 2102402
name 1053894
x 4204804
y 4204804
dtype: int64
In [18]: print(f"{reduction:0.2f}")
0.20
In all, we’ve reduced the in-memory footprint of this dataset to 1/5 of its original size.
See Categorical data for more on Categorical and dtypes for an overview of all of pandas’ dtypes.
Some workloads can be achieved with chunking: splitting a large problem like “convert this directory of CSVs to
parquet” into a bunch of small problems (“convert this individual CSV file into a Parquet file. Now repeat that for each
file in this directory.”). As long as each chunk fits in memory, you can work with datasets that are much larger than
memory.
Note: Chunking works well when the operation you’re performing requires zero or minimal coordination between
chunks. For more complicated workflows, you’re better off using another library.
Suppose we have an even larger “logical dataset” on disk that’s a directory of parquet files. Each file in the directory
represents a different year of the entire dataset.
data
timeseries
(continues on next page)
Now we’ll implement an out-of-core value_counts. The peak memory usage of this workflow is the single largest
chunk, plus a small series storing the unique value counts up to this point. As long as each individual file fits in
memory, this will work for arbitrary-sized datasets.
In [19]: %%time
....: files = pathlib.Path("data/timeseries/").glob("ts*.parquet")
....: counts = pd.Series(dtype=int)
....: for path in files:
....: df = pd.read_parquet(path)
....: counts = counts.add(df["name"].value_counts(), fill_value=0)
....: counts.astype(int)
....:
CPU times: user 850 ms, sys: 82.8 ms, total: 932 ms
Wall time: 687 ms
Out[19]:
Alice 229802
Bob 229211
Charlie 229303
Dan 230621
Edith 230349
...
Victor 230502
Wendy 230038
Xavier 229553
Yvonne 228766
Zelda 229909
Length: 26, dtype: int64
Some readers, like pandas.read_csv(), offer parameters to control the chunksize when reading a single file.
Manually chunking is an OK option for workflows that don’t require too sophisticated of operations. Some operations,
like groupby, are much harder to do chunkwise. In these cases, you may be better switching to a different library
that implements these out-of-core algorithms for you.
pandas is just one library offering a DataFrame API. Because of its popularity, pandas’ API has become something
of a standard that other libraries implement. The pandas documentation maintains a list of libraries implementing a
DataFrame API in our ecosystem page.
For example, Dask, a parallel computing library, has dask.dataframe, a pandas-like API for working with larger than
memory datasets in parallel. Dask can use multiple threads or processes on a single machine, or a cluster of machines
to process data in parallel.
We’ll import dask.dataframe and notice that the API feels similar to pandas. We can use Dask’s
read_parquet function, but provide a globstring of files to read in.
In [22]: ddf
Out[22]:
Dask DataFrame Structure:
id name x y
npartitions=12
int64 object float64 float64
... ... ... ...
... ... ... ... ...
... ... ... ...
... ... ... ...
Dask Name: read-parquet, 12 tasks
In [23]: ddf.columns
Out[23]: Index(['id', 'name', 'x', 'y'], dtype='object')
In [24]: ddf.dtypes
Out[24]:
id int64
name object
x float64
y float64
dtype: object
In [25]: ddf.npartitions
Out[25]: 12
One major difference: the dask.dataframe API is lazy. If you look at the repr above, you’ll notice that the values
aren’t actually printed out; just the column names and dtypes. That’s because Dask hasn’t actually read the data yet.
Rather than executing immediately, doing operations build up a task graph.
In [26]: ddf
Out[26]:
Dask DataFrame Structure:
id name x y
npartitions=12
int64 object float64 float64
... ... ... ...
... ... ... ... ...
... ... ... ...
... ... ... ...
Dask Name: read-parquet, 12 tasks
In [27]: ddf["name"]
Out[27]:
Dask Series Structure:
npartitions=12
object
...
...
...
...
Name: name, dtype: object
Dask Name: getitem, 24 tasks
In [28]: ddf["name"].value_counts()
Out[28]:
Dask Series Structure:
npartitions=1
int64
...
Name: name, dtype: int64
Dask Name: value-counts-agg, 39 tasks
Each of these calls is instant because the result isn’t being computed yet. We’re just building up a list of computation
to do when someone needs the result. Dask knows that the return type of a pandas.Series.value_counts is a
pandas Series with a certain dtype and a certain name. So the Dask version returns a Dask Series with the same dtype
and the same name.
To get the actual result you can call .compute().
At that point, you get back the same thing you’d get with pandas, in this case a concrete pandas Series with the count
of each name.
Calling .compute causes the full task graph to be executed. This includes reading the data, selecting the columns,
and doing the value_counts. The execution is done in parallel where possible, and Dask tries to keep the overall
memory footprint small. You can work with datasets that are much larger than memory, as long as each partition (a
regular pandas DataFrame) fits in memory.
By default, dask.dataframe operations use a threadpool to do operations in parallel. We can also connect to a
cluster to distribute the work on many machines. In this case we’ll connect to a local “cluster” made up of several
processes on this single machine.
>>> from dask.distributed import Client, LocalCluster
Once this client is created, all of Dask’s computation will take place on the cluster (which is just processes in this
case).
Dask implements the most used parts of the pandas API. For example, we can do a familiar groupby aggregation.
In [30]: %time ddf.groupby("name")[["x", "y"]].mean().compute().head()
CPU times: user 1.63 s, sys: 289 ms, total: 1.92 s
Wall time: 1.02 s
Out[30]:
x y
name
Alice 0.000086 -0.001170
Bob -0.000843 -0.000799
Charlie 0.000564 -0.000038
Dan 0.000584 0.000818
Edith -0.000116 -0.000044
In [36]: ddf
Out[36]:
Dask DataFrame Structure:
id name x y
npartitions=12
2000-01-01 int64 object float64 float64
2001-01-01 ... ... ... ...
... ... ... ... ...
(continues on next page)
Dask knows to just look in the 3rd partition for selecting values in 2002. It doesn’t need to look at any other data.
Many workflows involve a large amount of data and processing it in a way that reduces the size to something that fits
in memory. In this case, we’ll resample to daily frequency and take the mean. Once we’ve taken the mean, we know
the results will fit in memory, so we can safely call compute without running out of memory. At that point it’s just a
regular pandas object.
These Dask examples have all be done using multiple processes on a single machine. Dask can be deployed on a
cluster to scale up to even larger datasets.
You see more dask examples at https://examples.dask.org.
pandas provides data structures for efficiently storing sparse data. These are not necessarily sparse in the typical
“mostly 0”. Rather, you can view these objects as being “compressed” where any data matching a specific value (NaN
/ missing value, though any value can be chosen, including 0) is omitted. The compressed values are not actually
stored in the array.
In [3]: ts = pd.Series(pd.arrays.SparseArray(arr))
In [4]: ts
Out[4]:
0 0.469112
1 -0.282863
2 NaN
(continues on next page)
Notice the dtype, Sparse[float64, nan]. The nan means that elements in the array that are nan aren’t actually
stored, only the non-nan elements are. Those non-nan elements have a float64 dtype.
The sparse objects exist for memory efficiency reasons. Suppose you had a large, mostly NA DataFrame:
In [8]: sdf.head()
Out[8]:
0 1 2 3
0 NaN NaN NaN NaN
1 NaN NaN NaN NaN
2 NaN NaN NaN NaN
3 NaN NaN NaN NaN
4 NaN NaN NaN NaN
In [9]: sdf.dtypes
Out[9]:
0 Sparse[float64, nan]
1 Sparse[float64, nan]
2 Sparse[float64, nan]
3 Sparse[float64, nan]
dtype: object
In [10]: sdf.sparse.density
Out[10]: 0.0002
As you can see, the density (% of values that have not been “compressed”) is extremely low. This sparse object takes
up much less memory on disk (pickled) and in the Python interpreter.
2.25.1 SparseArray
arrays.SparseArray is a ExtensionArray for storing an array of sparse values (see dtypes for more on
extension arrays). It is a 1-dimensional ndarray-like object storing only values distinct from the fill_value:
In [17]: sparr
Out[17]:
[-1.9556635297215477, -1.6588664275960427, nan, nan, nan, 1.1589328886422277, 0.
˓→14529711373305043, nan, 0.6060271905134522, 1.3342113401317768]
Fill: nan
IntIndex
Indices: array([0, 1, 5, 6, 8, 9], dtype=int32)
In [18]: np.asarray(sparr)
Out[18]:
array([-1.9557, -1.6589, nan, nan, nan, 1.1589, 0.1453,
nan, 0.606 , 1.3342])
2.25.2 SparseDtype
In [19]: sparr.dtype
Out[19]: Sparse[float64, nan]
In [20]: pd.SparseDtype(np.dtype('datetime64[ns]'))
Out[20]: Sparse[datetime64[ns], NaT]
in which case a default fill value will be used (for NumPy dtypes this is often the “missing” value for that dtype). To
override this default an explicit fill value may be passed instead
In [21]: pd.SparseDtype(np.dtype('datetime64[ns]'),
....: fill_value=pd.Timestamp('2017-01-01'))
....:
Out[21]: Sparse[datetime64[ns], Timestamp('2017-01-01 00:00:00')]
Finally, the string alias 'Sparse[dtype]' may be used to specify a sparse dtype in many places
In [24]: s.sparse.density
Out[24]: 0.5
In [25]: s.sparse.fill_value
Out[25]: 0
This accessor is available only on data with SparseDtype, and on the Series class itself for creating a Series with
sparse data from a scipy COO matrix with.
New in version 0.25.0.
A .sparse accessor has been added for DataFrame as well. See Sparse accessor for more.
You can apply NumPy ufuncs to SparseArray and get a SparseArray as a result.
In [27]: np.abs(arr)
Out[27]:
[1.0, nan, nan, 2.0, nan]
Fill: nan
IntIndex
Indices: array([0, 3], dtype=int32)
The ufunc is also applied to fill_value. This is needed to get the correct dense result.
In [29]: np.abs(arr)
Out[29]:
[1.0, 1, 1, 2.0, 1]
Fill: 1
IntIndex
Indices: array([0, 3], dtype=int32)
In [30]: np.abs(arr).to_dense()
Out[30]: array([1., 1., 1., 2., 1.])
2.25.5 Migrating
Note: SparseSeries and SparseDataFrame were removed in pandas 1.0.0. This migration guide is present
to aid in migrating from previous versions.
In older versions of pandas, the SparseSeries and SparseDataFrame classes (documented below) were the
preferred way to work with sparse data. With the advent of extension arrays, these subclasses are no longer needed.
Their purpose is better served by using a regular Series or DataFrame with sparse values instead.
Note: There’s no performance or memory penalty to using a Series or DataFrame with sparse values, rather than a
SparseSeries or SparseDataFrame.
This section provides some guidance on migrating your code to the new style. As a reminder, you can use the Python
warnings module to control warnings. But we recommend modifying your code, rather than ignoring the warning.
Construction
From an array-like, use the regular Series or DataFrame constructors with SparseArray values.
# Previous way
>>> pd.SparseDataFrame({"A": [0, 1]})
# New way
In [31]: pd.DataFrame({"A": pd.arrays.SparseArray([0, 1])})
Out[31]:
A
0 0
1 1
# Previous way
>>> from scipy import sparse
>>> mat = sparse.eye(3)
>>> df = pd.SparseDataFrame(mat, columns=['A', 'B', 'C'])
# New way
In [32]: from scipy import sparse
In [35]: df.dtypes
Out[35]:
A Sparse[float64, 0]
B Sparse[float64, 0]
C Sparse[float64, 0]
dtype: object
Conversion
From sparse to dense, use the .sparse accessors
In [36]: df.sparse.to_dense()
Out[36]:
A B C
0 1.0 0.0 0.0
1 0.0 1.0 0.0
2 0.0 0.0 1.0
In [37]: df.sparse.to_coo()
Out[37]:
<3x3 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [40]: dense.astype(dtype)
Out[40]:
A
0 1
1 0
2 0
3 1
Sparse Properties
Sparse-specific properties, like density, are available on the .sparse accessor.
In [41]: df.sparse.density
Out[41]: 0.3333333333333333
General differences
In a SparseDataFrame, all columns were sparse. A DataFrame can have a mixture of sparse and dense columns.
As a consequence, assigning new columns to a DataFrame with sparse values will not automatically convert the input
to be sparse.
# Previous Way
>>> df = pd.SparseDataFrame({"A": [0, 1]})
>>> df['B'] = [0, 0] # implicitly becomes Sparse
>>> df['B'].dtype
Sparse[int64, nan]
Instead, you’ll need to ensure that the values being assigned are sparse
In [44]: df['B'].dtype
Out[44]: dtype('int64')
In [46]: df['B'].dtype
Out[46]: Sparse[int64, 0]
Use DataFrame.sparse.from_spmatrix() to create a DataFrame with sparse values from a sparse matrix.
New in version 0.25.0.
In [51]: sp_arr
Out[51]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 517 stored elements in Compressed Sparse Row format>
In [53]: sdf.head()
Out[53]:
0 1 2 3 4
0 0.956380 0.0 0.0 0.000000 0.0
1 0.000000 0.0 0.0 0.000000 0.0
2 0.000000 0.0 0.0 0.000000 0.0
3 0.000000 0.0 0.0 0.000000 0.0
4 0.999552 0.0 0.0 0.956153 0.0
In [54]: sdf.dtypes
Out[54]:
0 Sparse[float64, 0]
1 Sparse[float64, 0]
2 Sparse[float64, 0]
3 Sparse[float64, 0]
4 Sparse[float64, 0]
dtype: object
All sparse formats are supported, but matrices that are not in COOrdinate format will be converted, copying data as
needed. To convert back to sparse SciPy matrix in COO format, you can use the DataFrame.sparse.to_coo()
method:
In [55]: sdf.sparse.to_coo()
Out[55]:
<1000x5 sparse matrix of type '<class 'numpy.float64'>'
with 517 stored elements in COOrdinate format>
meth:Series.sparse.to_coo is implemented for transforming a Series with sparse values indexed by a MultiIndex
to a scipy.sparse.coo_matrix.
The method requires a MultiIndex with two or more levels.
In [58]: ss = s.astype('Sparse')
In [59]: ss
Out[59]:
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: Sparse[float64, nan]
In the example below, we transform the Series to a sparse representation of a 2-d array by specifying that the first
and second MultiIndex levels define labels for the rows and the third and fourth levels define labels for the columns.
We also specify that the column and row labels should be sorted in the final sparse representation.
In [61]: A
Out[61]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [62]: A.todense()
Out[62]:
matrix([[0., 0., 1., 3.],
[3., 0., 0., 0.],
[0., 0., 0., 0.]])
In [63]: rows
Out[63]: [(1, 1), (1, 2), (2, 1)]
In [64]: columns
Out[64]: [('a', 0), ('a', 1), ('b', 0), ('b', 1)]
Specifying different row and column labels (and not sorting them) yields a different sparse matrix:
In [66]: A
Out[66]:
<3x2 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [67]: A.todense()
Out[67]:
matrix([[3., 0.],
[1., 3.],
[0., 0.]])
In [68]: rows
Out[68]: [(1, 2, 'a'), (1, 1, 'b'), (2, 1, 'b')]
In [69]: columns
Out[69]: [0, 1]
A convenience method Series.sparse.from_coo() is implemented for creating a Series with sparse values
from a scipy.sparse.coo_matrix.
In [72]: A
Out[72]:
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
In [73]: A.todense()
Out[73]:
matrix([[0., 0., 1., 2.],
[3., 0., 0., 0.],
[0., 0., 0., 0.]])
The default behaviour (with dense_index=False) simply returns a Series containing only the non-null entries.
In [74]: ss = pd.Series.sparse.from_coo(A)
In [75]: ss
Out[75]:
0 2 1.0
3 2.0
1 0 3.0
dtype: Sparse[float64, nan]
Specifying dense_index=True will result in an index that is the Cartesian product of the row and columns coordi-
nates of the matrix. Note that this will consume a significant amount of memory (relative to dense_index=False)
if the sparse matrix is large (and sparse) enough.
In [77]: ss_dense
Out[77]:
0 0 NaN
1 NaN
2 1.0
3 2.0
1 0 3.0
1 NaN
2 NaN
3 NaN
2 0 NaN
1 NaN
2 NaN
3 NaN
dtype: Sparse[float64, nan]
The memory usage of a DataFrame (including the index) is shown when calling the info(). A configuration
option, display.memory_usage (see the list of options), specifies if the DataFrame’s memory usage will be
displayed when invoking the df.info() method.
For example, the memory usage of the DataFrame below is shown when calling info():
In [1]: dtypes = [
...: "int64",
...: "float64",
...: "datetime64[ns]",
...: "timedelta64[ns]",
...: "complex128",
...: "object",
...: "bool",
...: ]
...:
In [2]: n = 5000
In [4]: df = pd.DataFrame(data)
In [6]: df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int64 5000 non-null int64
1 float64 5000 non-null float64
(continues on next page)
The + symbol indicates that the true memory usage could be higher, because pandas does not count the memory used
by values in columns with dtype=object.
Passing memory_usage='deep' will enable a more accurate memory usage report, accounting for the full usage
of the contained objects. This is optional as it can be expensive to do this deeper introspection.
In [7]: df.info(memory_usage="deep")
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int64 5000 non-null int64
1 float64 5000 non-null float64
2 datetime64[ns] 5000 non-null datetime64[ns]
3 timedelta64[ns] 5000 non-null timedelta64[ns]
4 complex128 5000 non-null complex128
5 object 5000 non-null object
6 bool 5000 non-null bool
7 categorical 5000 non-null category
dtypes: bool(1), category(1), complex128(1), datetime64[ns](1), float64(1), int64(1),
˓→object(1), timedelta64[ns](1)
By default the display option is set to True but can be explicitly overridden by passing the memory_usage argument
when invoking df.info().
The memory usage of each column can be found by calling the memory_usage() method. This returns a Series
with an index represented by column names and memory usage of each column shown in bytes. For the DataFrame
above, the memory usage of each column and the total memory usage can be found with the memory_usage method:
In [8]: df.memory_usage()
Out[8]:
Index 128
int64 40000
float64 40000
datetime64[ns] 40000
timedelta64[ns] 40000
complex128 80000
object 40000
bool 5000
categorical 9968
dtype: int64
By default the memory usage of the DataFrame’s index is shown in the returned Series, the memory usage of the
index can be suppressed by passing the index=False argument:
In [10]: df.memory_usage(index=False)
Out[10]:
int64 40000
float64 40000
datetime64[ns] 40000
timedelta64[ns] 40000
complex128 80000
object 40000
bool 5000
categorical 9968
dtype: int64
The memory usage displayed by the info() method utilizes the memory_usage() method to determine the mem-
ory usage of a DataFrame while also formatting the output in human-readable units (base-2 representation; i.e. 1KB
= 1024 bytes).
See also Categorical Memory Usage.
pandas follows the NumPy convention of raising an error when you try to convert something to a bool. This happens
in an if-statement or when using the boolean operations: and, or, and not. It is not clear what the result of the
following code should be:
Should it be True because it’s not zero-length, or False because there are False values? It is unclear, so instead,
pandas raises a ValueError:
You need to explicitly choose what you want to do with the DataFrame, e.g. use any(), all() or empty().
Alternatively, you might want to compare if the pandas object is None:
To evaluate single-element pandas objects in a boolean context, use the method bool():
In [11]: pd.Series([True]).bool()
Out[11]: True
(continues on next page)
In [12]: pd.Series([False]).bool()
Out[12]: False
In [13]: pd.DataFrame([[True]]).bool()
Out[13]: True
In [14]: pd.DataFrame([[False]]).bool()
Out[14]: False
Bitwise boolean
Bitwise boolean operators like == and != return a boolean Series, which is almost always what you want anyways.
>>> s = pd.Series(range(5))
>>> s == 4
0 False
1 False
2 False
3 False
4 True
dtype: bool
Using the Python in operator on a Series tests for membership in the index, not membership among the values.
In [16]: 2 in s
Out[16]: False
In [17]: 'b' in s
Out[17]: True
If this behavior is surprising, keep in mind that using in on a Python dictionary tests keys, not values, and Series
are dict-like. To test for membership in the values, use the method isin():
In [18]: s.isin([2])
Out[18]:
a False
b False
c True
d False
e False
dtype: bool
In [19]: s.isin([2]).any()
Out[19]: True
For DataFrames, likewise, in applies to the column axis, testing for membership in the list of column names.
Choice of NA representation
For lack of NA (missing) support from the ground up in NumPy and Python in general, we were given the difficult
choice between either:
• A masked array solution: an array of data and an array of boolean values indicating whether a value is there or
is missing.
• Using a special sentinel value, bit pattern, or set of sentinel values to denote NA across the dtypes.
For many reasons we chose the latter. After years of production use it has proven, at least in my opinion, to be the best
decision given the state of affairs in NumPy and Python in general. The special value NaN (Not-A-Number) is used
everywhere as the NA value, and there are API functions isna and notna which can be used across the dtypes to
detect NA values.
However, it comes with it a couple of trade-offs which I most certainly have not ignored.
In the absence of high performance NA support being built into NumPy from the ground up, the primary casualty is
the ability to represent NAs in integer arrays. For example:
In [21]: s
Out[21]:
a 1
b 2
c 3
d 4
e 5
dtype: int64
In [22]: s.dtype
Out[22]: dtype('int64')
In [24]: s2
Out[24]:
a 1.0
b 2.0
c 3.0
f NaN
u NaN
dtype: float64
In [25]: s2.dtype
Out[25]: dtype('float64')
This trade-off is made largely for memory and performance reasons, and also so that the resulting Series continues
to be “numeric”.
If you need to represent integers with possibly missing values, use one of the nullable-integer extension dtypes pro-
vided by pandas
• Int8Dtype
• Int16Dtype
• Int32Dtype
• Int64Dtype
In [27]: s_int
Out[27]:
a 1
b 2
c 3
d 4
e 5
dtype: Int64
In [28]: s_int.dtype
Out[28]: Int64Dtype()
In [30]: s2_int
Out[30]:
a 1
b 2
c 3
f <NA>
u <NA>
dtype: Int64
In [31]: s2_int.dtype
Out[31]: Int64Dtype()
NA type promotions
When introducing NAs into an existing Series or DataFrame via reindex() or some other means, boolean and
integer types will be promoted to a different dtype in order to store the NAs. The promotions are summarized in this
table:
While this may seem like a heavy trade-off, I have found very few cases where this is an issue in practice i.e. storing
values greater than 2**53. Some explanation for the motivation is in the next section.
Many people have suggested that NumPy should simply emulate the NA support present in the more domain-specific
statistical programming language R. Part of the reason is the NumPy type hierarchy:
Typeclass Dtypes
numpy.floating float16, float32, float64, float128
numpy.integer int8, int16, int32, int64
numpy.unsignedinteger uint8, uint16, uint32, uint64
numpy.object_ object_
numpy.bool_ bool_
numpy.character string_, unicode_
The R language, by contrast, only has a handful of built-in data types: integer, numeric (floating-point),
character, and boolean. NA types are implemented by reserving special bit patterns for each type to be used
as the missing value. While doing this with the full NumPy type hierarchy would be possible, it would be a more
substantial trade-off (especially for the 8- and 16-bit data types) and implementation undertaking.
An alternate approach is that of using masked arrays. A masked array is an array of data with an associated boolean
mask denoting whether each value should be considered NA or not. I am personally not in love with this approach as I
feel that overall it places a fairly heavy burden on the user and the library implementer. Additionally, it exacts a fairly
high performance cost when working with numerical data compared with the simple approach of using NaN. Thus,
I have chosen the Pythonic “practicality beats purity” approach and traded integer NA capability for a much simpler
approach of using a special value in float and object arrays to denote NA, and promoting integer arrays to floating when
NAs must be introduced.
For Series and DataFrame objects, var() normalizes by N-1 to produce unbiased estimates of the sample vari-
ance, while NumPy’s var normalizes by N, which measures the variance of the sample. Note that cov() normalizes
by N-1 in both pandas and NumPy.
2.26.5 Thread-safety
As of pandas 0.11, pandas is not 100% thread safe. The known issues relate to the copy() method. If you are doing
a lot of copying of DataFrame objects shared among threads, we recommend holding locks inside the threads where
the data copying occurs.
See this link for more information.
Occasionally you may have to deal with data that were created on a machine with a different byte order than the one
on which you are running Python. A common symptom of this issue is an error like:
Traceback
...
ValueError: Big-endian buffer not supported on little-endian compiler
To deal with this issue you should convert the underlying NumPy array to the native system byte order before passing
it to Series or DataFrame constructors using something similar to the following:
In [34]: s = pd.Series(newx)
2.27 Cookbook
This is a repository for short and sweet examples and links for useful pandas recipes. We encourage users to add to
this documentation.
Adding interesting links and/or inline examples to this section is a great First Pull Request.
Simplified, condensed, new-user friendly, in-line examples have been inserted where possible to augment the Stack-
Overflow and GitHub links. Many of the links contain expanded information, above what the in-line examples offer.
pandas (pd) and Numpy (np) are the only two abbreviated imported modules. The rest are kept explicitly imported for
newer users.
2.27.1 Idioms
In [1]: df = pd.DataFrame(
...: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -50]}
...: )
...:
In [2]: df
Out[2]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
if-then. . .
In [4]: df
Out[4]:
AAA BBB CCC
0 4 10 100
1 5 -1 50
2 6 -1 -30
3 7 -1 -50
In [6]: df
Out[6]:
AAA BBB CCC
0 4 10 100
1 5 555 555
2 6 555 555
3 7 555 555
In [8]: df
Out[8]:
AAA BBB CCC
0 4 2000 2000
1 5 555 555
2 6 555 555
3 7 555 555
....: )
....:
In [12]: df
Out[12]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [14]: df
(continues on next page)
Splitting
In [15]: df = pd.DataFrame(
....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -
˓→50]}
....: )
....:
In [16]: df
Out[16]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
Building criteria
In [19]: df = pd.DataFrame(
....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -
˓→50]}
....: )
....:
In [20]: df
Out[20]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [24]: df
Out[24]:
AAA BBB CCC
0 0.1 10 100
1 5.0 20 50
2 0.1 30 -30
3 0.1 40 -50
In [25]: df = pd.DataFrame(
....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -
˓→50]}
....: )
....:
In [26]: df
Out[26]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [29]: df = pd.DataFrame(
....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -
˓→50]}
....: )
....:
In [30]: df
Out[30]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
In [38]: df[AllCrit]
Out[38]:
AAA BBB CCC
0 4 10 100
2.27.2 Selection
Dataframes
....: )
....:
In [40]: df
Out[40]:
AAA BBB CCC
0 4 10 100
1 5 20 50
(continues on next page)
# Generic
In [44]: df[0:3]
Out[44]:
AAA BBB CCC
foo 4 10 100
bar 5 20 50
boo 6 30 -30
In [45]: df["bar":"kar"]
Out[45]:
AAA BBB CCC
bar 5 20 50
boo 6 30 -30
kar 7 40 -50
Ambiguity arises when an index consists of integers with a non-zero start or non-unit increment.
In [46]: data = {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -
˓→50]}
In [50]: df = pd.DataFrame(
....: {"AAA": [4, 5, 6, 7], "BBB": [10, 20, 30, 40], "CCC": [100, 50, -30, -
˓→50]}
....: )
....:
In [51]: df
Out[51]:
AAA BBB CCC
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
New columns
In [54]: df
Out[54]:
AAA BBB CCC
0 1 1 2
1 2 1 1
2 1 2 3
3 3 2 1
In [60]: df = pd.DataFrame(
....: {"AAA": [1, 1, 1, 2, 2, 2, 3, 3], "BBB": [2, 1, 3, 4, 5, 1, 2, 3]}
....: )
....:
In [61]: df
Out[61]:
AAA BBB
0 1 2
1 1 1
2 1 3
3 2 4
4 2 5
5 2 1
6 3 2
7 3 3
In [62]: df.loc[df.groupby("AAA")["BBB"].idxmin()]
Out[62]:
AAA BBB
1 1 1
5 2 1
6 3 2
2.27.3 Multiindexing
In [64]: df = pd.DataFrame(
....: {
....: "row": [0, 1, 2],
....: "One_X": [1.1, 1.1, 1.1],
....: "One_Y": [1.2, 1.2, 1.2],
....: "Two_X": [1.11, 1.11, 1.11],
....: "Two_Y": [1.22, 1.22, 1.22],
....: }
....: )
....:
In [65]: df
Out[65]:
row One_X One_Y Two_X Two_Y
0 0 1.1 1.2 1.11 1.22
1 1 1.1 1.2 1.11 1.22
2 2 1.1 1.2 1.11 1.22
# As Labelled Index
In [66]: df = df.set_index("row")
In [67]: df
Out[67]:
One_X One_Y Two_X Two_Y
row
0 1.1 1.2 1.11 1.22
1 1.1 1.2 1.11 1.22
2 1.1 1.2 1.11 1.22
In [69]: df
Out[69]:
One Two
X Y X Y
row
0 1.1 1.2 1.11 1.22
1 1.1 1.2 1.11 1.22
2 1.1 1.2 1.11 1.22
In [71]: df
Out[71]:
level_1 X Y
row
0 One 1.10 1.20
0 Two 1.11 1.22
1 One 1.10 1.20
(continues on next page)
# And fix the labels (Notice the label 'level_1' got added automatically)
In [72]: df.columns = ["Sample", "All_X", "All_Y"]
In [73]: df
Out[73]:
Sample All_X All_Y
row
0 One 1.10 1.20
0 Two 1.11 1.22
1 One 1.10 1.20
1 Two 1.11 1.22
2 One 1.10 1.20
2 Two 1.11 1.22
Arithmetic
In [76]: df
Out[76]:
A B C
O I O I O I
n 0.469112 -0.282863 -1.509059 -1.135632 1.212112 -0.173215
m 0.119209 -1.044236 -0.861849 -2.104569 -0.494929 1.071804
In [78]: df
Out[78]:
A B C
O I O I O I
n 0.387021 1.633022 -1.244983 6.556214 1.0 1.0
m -0.240860 -0.974279 1.741358 -1.963577 1.0 1.0
Slicing
In [82]: df
Out[82]:
MyData
AA one 11
six 22
BB one 33
two 44
six 55
To take the cross section of the 1st level and 1st axis the index:
# Note : level and axis are optional, and default to zero
In [83]: df.xs("BB", level=0, axis=0)
Out[83]:
MyData
one 33
two 44
six 55
In [92]: df
Out[92]:
Exams Labs
(continues on next page)
In [94]: df.loc["Violet"]
Out[94]:
Exams Labs
I II I II
Course
Comp 76 77 78 79
Math 77 79 81 80
Sci 78 81 81 81
Sorting
Levels
In [100]: df = pd.DataFrame(
.....: np.random.randn(6, 1),
.....: index=pd.date_range("2013-08-01", periods=6, freq="B"),
.....: columns=list("A"),
.....: )
.....:
In [102]: df
Out[102]:
A
2013-08-01 0.721555
2013-08-02 -0.706771
2013-08-05 -1.039575
2013-08-06 NaN
2013-08-07 -0.424972
2013-08-08 0.567020
In [103]: df.reindex(df.index[::-1]).ffill()
(continues on next page)
Replace
2.27.5 Grouping
In [104]: df = pd.DataFrame(
.....: {
.....: "animal": "cat dog cat fish dog cat cat".split(),
.....: "size": list("SSMMMLL"),
.....: "weight": [8, 10, 11, 1, 20, 12, 12],
.....: "adult": [False] * 5 + [True] * 2,
.....: }
.....: )
.....:
In [105]: df
Out[105]:
animal size weight adult
0 cat S 8 False
1 dog S 10 False
2 cat M 11 False
3 fish M 1 False
4 dog M 20 False
5 cat L 12 True
6 cat L 12 True
Out[106]:
animal
cat L
dog M
fish M
dtype: object
Using get_group
In [107]: gb = df.groupby(["animal"])
In [108]: gb.get_group("cat")
Out[108]:
animal size weight adult
0 cat S 8 False
2 cat M 11 False
5 cat L 12 True
6 cat L 12 True
.....:
In [111]: expected_df
Out[111]:
size weight adult
animal
cat L 12.4375 True
dog L 20.0000 True
fish L 1.2500 True
Expanding apply
In [117]: gb = df.groupby("A")
In [119]: gb.transform(replace)
Out[119]:
B
0 1.0
1 -1.0
2 1.5
3 1.5
In [120]: df = pd.DataFrame(
.....: {
.....: "code": ["foo", "bar", "baz"] * 2,
.....: "data": [0.16, -0.21, 0.33, 0.45, -0.59, 0.62],
.....: "flag": [False, True] * 3,
.....: }
.....: )
.....:
In [124]: sorted_df
Out[124]:
code data flag
1 bar -0.21 True
4 bar -0.59 False
0 foo 0.16 False
3 foo 0.45 True
2 baz 0.33 False
5 baz 0.62 True
In [129]: ts.resample("5min").apply(mhc)
Out[129]:
Mean 2014-10-07 00:00:00 1.000
2014-10-07 00:05:00 3.500
2014-10-07 00:10:00 6.000
2014-10-07 00:15:00 8.500
Max 2014-10-07 00:00:00 2
2014-10-07 00:05:00 4
2014-10-07 00:10:00 7
2014-10-07 00:15:00 9
Custom 2014-10-07 00:00:00 1.234
2014-10-07 00:05:00 NaT
2014-10-07 00:10:00 7.404
2014-10-07 00:15:00 NaT
dtype: object
In [130]: ts
Out[130]:
2014-10-07 00:00:00 0
2014-10-07 00:02:00 1
2014-10-07 00:04:00 2
2014-10-07 00:06:00 3
2014-10-07 00:08:00 4
2014-10-07 00:10:00 5
2014-10-07 00:12:00 6
2014-10-07 00:14:00 7
2014-10-07 00:16:00 8
2014-10-07 00:18:00 9
Freq: 2T, dtype: int64
In [131]: df = pd.DataFrame(
.....: {"Color": "Red Red Red Blue".split(), "Value": [100, 150, 50, 50]}
.....: )
.....:
In [132]: df
Out[132]:
Color Value
0 Red 100
1 Red 150
2 Red 50
3 Blue 50
In [134]: df
Out[134]:
Color Value Counts
0 Red 100 3
1 Red 150 3
2 Red 50 3
3 Blue 50 1
In [135]: df = pd.DataFrame(
.....: {"line_race": [10, 10, 8, 10, 10, 8], "beyer": [99, 102, 103, 103, 88,
˓→100]},
.....: index=[
.....: "Last Gunfighter",
.....: "Last Gunfighter",
.....: "Last Gunfighter",
.....: "Paynter",
.....: "Paynter",
.....: "Paynter",
.....: ],
.....: )
.....:
In [136]: df
Out[136]:
line_race beyer
Last Gunfighter 10 99
Last Gunfighter 10 102
Last Gunfighter 8 103
Paynter 10 103
Paynter 10 88
Paynter 8 100
In [138]: df
Out[138]:
line_race beyer beyer_shifted
Last Gunfighter 10 99 NaN
Last Gunfighter 10 102 99.0
Last Gunfighter 8 103 102.0
Paynter 10 103 NaN
Paynter 10 88 103.0
Paynter 8 100 88.0
In [139]: df = pd.DataFrame(
.....: {
.....: "host": ["other", "other", "that", "this", "this"],
.....: "service": ["mail", "web", "mail", "mail", "web"],
.....: "no": [1, 2, 1, 2, 1],
.....: }
.....: ).set_index(["host", "service"])
.....:
In [142]: df_count
Out[142]:
host service no
0 other web 2
1 that mail 1
(continues on next page)
Expanding data
Splitting
Splitting a frame
Create a list of dataframes, split using a delineation based on logic included in rows.
In [146]: df = pd.DataFrame(
.....: data={
.....: "Case": ["A", "A", "A", "B", "A", "A", "B", "A", "A"],
.....: "Data": np.random.randn(9),
.....: }
.....: )
.....:
In [148]: dfs[0]
Out[148]:
Case Data
0 A 0.276232
1 A -1.087401
2 A -0.673690
3 B 0.113648
In [149]: dfs[1]
Out[149]:
Case Data
4 A -1.478427
5 A 0.524988
6 B 0.404705
In [150]: dfs[2]
Out[150]:
Case Data
7 A 0.577046
8 A -1.715002
Pivot
In [151]: df = pd.DataFrame(
.....: data={
.....: "Province": ["ON", "QC", "BC", "AL", "AL", "MN", "ON"],
.....: "City": [
.....: "Toronto",
.....: "Montreal",
.....: "Vancouver",
.....: "Calgary",
.....: "Edmonton",
.....: "Winnipeg",
.....: "Windsor",
.....: ],
.....: "Sales": [13, 6, 16, 8, 4, 3, 1],
.....: }
.....: )
.....:
In [155]: df = pd.DataFrame(
.....: {
.....: "ID": ["x%d" % r for r in range(10)],
.....: "Gender": ["F", "M", "F", "M", "F", "M", "F", "M", "M", "M"],
.....: "ExamYear": [
.....: "2007",
.....: "2007",
.....: "2007",
.....: "2008",
.....: "2008",
.....: "2008",
.....: "2008",
.....: "2009",
.....: "2009",
.....: "2009",
.....: ],
.....: "Class": [
.....: "algebra",
.....: "stats",
.....: "bio",
.....: "algebra",
.....: "algebra",
.....: "stats",
.....: "stats",
.....: "algebra",
.....: "bio",
.....: "bio",
.....: ],
.....: "Participated": [
.....: "yes",
.....: "yes",
.....: "yes",
.....: "yes",
.....: "no",
.....: "yes",
(continues on next page)
In [156]: df.groupby("ExamYear").agg(
.....: {
.....: "Participated": lambda x: x.value_counts()["yes"],
.....: "Passed": lambda x: sum(x == "yes"),
.....: "Employed": lambda x: sum(x),
.....: "Grade": lambda x: sum(x) / len(x),
.....: }
.....: )
.....:
Out[156]:
Participated Passed Employed Grade
ExamYear
2007 3 2 3 74.000000
2008 3 3 0 68.500000
2009 3 2 2 60.666667
In [157]: df = pd.DataFrame(
.....: {"value": np.random.randn(36)},
.....: index=pd.date_range("2011-01-01", freq="M", periods=36),
.....: )
.....:
In [158]: pd.pivot_table(
.....: df, index=df.index.month, columns=df.index.year, values="value",
˓→aggfunc="sum"
.....: )
.....:
Out[158]:
2011 2012 2013
1 -1.039268 -0.968914 2.565646
(continues on next page)
Apply
In [159]: df = pd.DataFrame(
.....: data={
.....: "A": [[2, 4, 8, 16], [100, 200], [10, 20, 30]],
.....: "B": [["a", "b", "c"], ["jj", "kk"], ["ccc"]],
.....: },
.....: index=["I", "II", "III"],
.....: )
.....:
In [162]: df_orgz
Out[162]:
0 1 2 3
I A 2 4 8 16.0
B a b c NaN
II A 100 200 NaN NaN
B jj kk NaN NaN
III A 10 20.0 30.0 NaN
B ccc NaN NaN NaN
In [163]: df = pd.DataFrame(
.....: data=np.random.randn(2000, 2) / 10000,
.....: index=pd.date_range("2001-01-01", periods=2000),
.....: columns=["A", "B"],
.....: )
.....:
In [166]: s = pd.Series(
.....: {
.....: df.index[i]: gm(df.iloc[i: min(i + 51, len(df) - 1)], 5)
.....: for i in range(len(df) - 50)
.....: }
.....: )
.....:
In [167]: s
Out[167]:
2001-01-01 0.000930
2001-01-02 0.002615
2001-01-03 0.001281
2001-01-04 0.001117
2001-01-05 0.002772
...
2006-04-30 0.003296
2006-05-01 0.002629
2006-05-02 0.002081
2006-05-03 0.004247
2006-05-04 0.003928
Length: 1950, dtype: float64
In [169]: df = pd.DataFrame(
.....: {
.....: "Open": np.random.randn(len(rng)),
.....: "Close": np.random.randn(len(rng)),
.....: "Volume": np.random.randint(100, 2000, len(rng)),
.....: },
(continues on next page)
In [170]: df
Out[170]:
Open Close Volume
2014-01-01 -1.611353 -0.492885 1219
2014-01-02 -3.000951 0.445794 1054
2014-01-03 -0.138359 -0.076081 1381
2014-01-04 0.301568 1.198259 1253
2014-01-05 0.276381 -0.669831 1728
... ... ... ...
2014-04-06 -0.040338 0.937843 1188
2014-04-07 0.359661 -0.285908 1864
2014-04-08 0.060978 1.714814 941
2014-04-09 1.759055 -0.455942 1065
2014-04-10 0.138185 -1.147008 1453
In [172]: window = 5
In [173]: s = pd.concat(
.....: [
.....: (pd.Series(vwap(df.iloc[i: i + window]), index=[df.index[i +
˓→window]]))
In [174]: s.round(2)
Out[174]:
2014-01-06 0.02
2014-01-07 0.11
2014-01-08 0.10
2014-01-09 0.07
2014-01-10 -0.29
...
2014-04-06 -0.63
2014-04-07 -0.02
2014-04-08 -0.03
2014-04-09 0.34
2014-04-10 0.29
Length: 95, dtype: float64
2.27.6 Timeseries
Between times
Using indexer between time
Constructing a datetime range that excludes weekends and includes only certain times
Vectorized Lookup
Aggregation and plotting time series
Turn a matrix with hours in columns and days in rows into a continuous row sequence in the form of a time series.
How to rearrange a Python pandas DataFrame?
Dealing with duplicates when reindexing a timeseries to a specified frequency
Calculate the first day of the month for each entry in a DatetimeIndex
In [176]: dates.to_period(freq="M").to_timestamp()
Out[176]:
DatetimeIndex(['2000-01-01', '2000-01-01', '2000-01-01', '2000-01-01',
'2000-01-01'],
dtype='datetime64[ns]', freq=None)
Resampling
2.27.7 Merge
In [181]: df
Out[181]:
A B C
0 -0.870117 -0.479265 -0.790855
1 0.144817 1.726395 -0.464535
2 -0.821906 1.597605 0.187307
3 -0.128342 -1.511638 -0.289858
4 0.399194 -1.430030 -0.639760
5 1.115116 -2.012600 1.810662
6 -0.870117 -0.479265 -0.790855
7 0.144817 1.726395 -0.464535
8 -0.821906 1.597605 0.187307
9 -0.128342 -1.511638 -0.289858
10 0.399194 -1.430030 -0.639760
11 1.115116 -2.012600 1.810662
In [182]: df = pd.DataFrame(
.....: data={
.....: "Area": ["A"] * 5 + ["C"] * 2,
.....: "Bins": [110] * 2 + [160] * 3 + [40] * 2,
.....: "Test_0": [0, 1, 0, 1, 2, 0, 1],
.....: "Data": np.random.randn(7),
.....: }
.....: )
.....:
In [183]: df
Out[183]:
Area Bins Test_0 Data
0 A 110 0 -0.433937
1 A 110 1 -0.160552
2 A 160 0 0.744434
3 A 160 1 1.754213
4 A 160 2 0.000850
5 C 40 0 0.342243
6 C 40 1 1.070599
In [185]: pd.merge(
.....: df,
.....: df,
.....: left_on=["Bins", "Area", "Test_0"],
.....: right_on=["Bins", "Area", "Test_1"],
.....: suffixes=("_L", "_R"),
.....: )
.....:
Out[185]:
Area Bins Test_0_L Data_L Test_1_L Test_0_R Data_R Test_1_R
0 A 110 0 -0.433937 -1 1 -0.160552 0
1 A 160 0 0.744434 -1 1 1.754213 0
2 A 160 1 1.754213 0 2 0.000850 1
(continues on next page)
2.27.8 Plotting
In [186]: df = pd.DataFrame(
.....: {
.....: "stratifying_var": np.random.uniform(0, 100, 20),
.....: "price": np.random.normal(100, 5, 20),
.....: }
.....: )
.....:
.....: )
.....:
CSV
The best way to combine multiple files into a single DataFrame is to read the individual frames one by one, put all of
the individual frames into a list, and then combine the frames in the list using pd.concat():
In [189]: for i in range(3):
.....: data = pd.DataFrame(np.random.randn(10, 4))
.....: data.to_csv("file_{}.csv".format(i))
.....:
You can use the same approach to read all files matching a pattern. Here is an example using glob:
In [192]: import glob
In [193]: import os
Finally, this strategy will work with the other pd.read_*(...) functions described in the io docs.
In [198]: df.head()
Out[198]:
year month day
0 2000 1 1
1 2000 1 2
2 2000 1 3
3 2000 1 4
4 2000 1 5
.....: ds.head()
.....: %timeit pd.to_datetime(ds)
.....:
(continues on next page)
In [202]: pd.read_csv(
.....: StringIO(data),
.....: sep=";",
.....: skiprows=[11, 12],
.....: index_col=0,
.....: parse_dates=True,
.....: header=10,
.....: )
.....:
Out[202]:
Param1 Param2 Param4 Param5
date
1990-01-01 00:00:00 1 1 2 3
1990-01-01 01:00:00 5 3 4 5
1990-01-01 02:00:00 9 5 6 7
1990-01-01 03:00:00 13 7 8 9
1990-01-01 04:00:00 17 9 10 11
1990-01-01 05:00:00 21 11 12 13
In [205]: pd.read_csv(
.....: StringIO(data), sep=";", index_col=0, header=12, parse_dates=True,
˓→names=columns
.....: )
.....:
Out[205]:
Param1 Param2 Param4 Param5
date
1990-01-01 00:00:00 1 1 2 3
1990-01-01 01:00:00 5 3 4 5
1990-01-01 02:00:00 9 5 6 7
1990-01-01 03:00:00 13 7 8 9
1990-01-01 04:00:00 17 9 10 11
1990-01-01 05:00:00 21 11 12 13
SQL
Excel
HTML
Reading HTML tables from a server that cannot handle the default request header
HDFStore
In [210]: store.get_storer("df").attrs.my_attribute
Out[210]: {'A': 10}
You can create or load a HDFStore in-memory by passing the driver parameter to PyTables. Changes are only
written to disk when the HDFStore is closed.
In [211]: store = pd.HDFStore("test.h5", "w", diver="H5FD_CORE")
In [213]: store["test"] = df
Binary files
pandas readily accepts NumPy record arrays, if you need to read in a binary file consisting of an array of C structs.
For example, given this C program in a file called main.c compiled with gcc main.c -std=gnu99 on a 64-bit
machine,
#include <stdio.h>
#include <stdint.h>
return 0;
}
the following Python code will read the binary file 'binary.dat' into a pandas DataFrame, where each element
of the struct corresponds to a column in the frame:
names = "count", "avg", "scale"
# note that the offsets are larger than the size of the type because of
# struct padding
offsets = 0, 8, 16
formats = "i4", "f8", "f4"
dt = np.dtype({"names": names, "offsets": offsets, "formats": formats}, align=True)
df = pd.DataFrame(np.fromfile("binary.dat", dt))
Note: The offsets of the structure elements may be different depending on the architecture of the machine on which
the file was created. Using a raw binary file format like this for general data storage is not recommended, as it is not
cross platform. We recommended either HDF5 or parquet, both of which are supported by pandas’ IO facilities.
2.27.10 Computation
Correlation
Often it’s useful to obtain the lower (or upper) triangular form of a correlation matrix calculated from DataFrame.
corr(). This can be achieved by passing a boolean mask to where as follows:
In [215]: df = pd.DataFrame(np.random.random(size=(100, 5)))
The method argument within DataFrame.corr can accept a callable in addition to the named correla-
tion types. Here we compute the distance correlation <https://en.wikipedia.org/wiki/
Distance_correlation>``__ matrix for a ``DataFrame object.
In [221]: df.corr(method=distcorr)
Out[221]:
0 1 2
0 1.000000 0.197613 0.216328
1 0.197613 1.000000 0.208749
2 0.216328 0.208749 1.000000
2.27.11 Timedeltas
In [224]: s - s.max()
Out[224]:
0 -2 days
(continues on next page)
In [225]: s.max() - s
Out[225]:
0 2 days
1 1 days
2 0 days
dtype: timedelta64[ns]
In [226]: s - datetime.datetime(2011, 1, 1, 3, 5)
Out[226]:
0 364 days 20:55:00
1 365 days 20:55:00
2 366 days 20:55:00
dtype: timedelta64[ns]
In [227]: s + datetime.timedelta(minutes=5)
Out[227]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
In [228]: datetime.datetime(2011, 1, 1, 3, 5) - s
Out[228]:
0 -365 days +03:05:00
1 -366 days +03:05:00
2 -367 days +03:05:00
dtype: timedelta64[ns]
In [229]: datetime.timedelta(minutes=5) + s
Out[229]:
0 2012-01-01 00:05:00
1 2012-01-02 00:05:00
2 2012-01-03 00:05:00
dtype: datetime64[ns]
In [232]: df
Out[232]:
A B
0 2012-01-01 0 days
1 2012-01-02 1 days
2 2012-01-03 2 days
In [235]: df
(continues on next page)
In [236]: df.dtypes
Out[236]:
A datetime64[ns]
B timedelta64[ns]
New Dates datetime64[ns]
Delta timedelta64[ns]
dtype: object
Another example
Values can be set to NaT using np.nan, similar to datetime
In [237]: y = s - s.shift()
In [238]: y
Out[238]:
0 NaT
1 1 days
2 1 days
dtype: timedelta64[ns]
In [240]: y
Out[240]:
0 NaT
1 NaT
2 1 days
dtype: timedelta64[ns]
To create a dataframe from every combination of some given values, like R’s expand.grid() function, we can
create a dict where the keys are column names and the values are lists of the data values:
In [242]: df = expand_grid(
.....: {"height": [60, 70], "weight": [100, 140, 180], "sex": ["Male", "Female
˓→"]}
.....: )
.....:
In [243]: df
Out[243]:
height weight sex
(continues on next page)
THREE
API REFERENCE
This page gives an overview of all public pandas objects, functions and methods. All classes and functions exposed in
pandas.* namespace are public.
Some subpackages are public which include pandas.errors, pandas.plotting, and pandas.testing.
Public functions in pandas.io and pandas.tseries submodules are mentioned in the documentation.
pandas.api.types subpackage holds some public functions related to data types in pandas.
Warning: The pandas.core, pandas.compat, and pandas.util top-level modules are PRIVATE. Sta-
ble functionality in such modules is not guaranteed.
3.1 Input/output
3.1.1 Pickling
read_pickle(filepath_or_buffer[, . . . ]) Load pickled pandas object (or any object) from file.
pandas.read_pickle
Warning: Loading pickled data received from untrusted sources can be unsafe. See here.
Parameters
filepath_or_buffer [str, path object or file-like object] File path, URL, or buffer where the
pickled object will be loaded from.
Changed in version 1.0.0: Accept URL. URL is not limited to S3 and GCS.
compression [{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’] If ‘infer’ and
‘path_or_url’ is path-like, then detect compression from the following extensions: ‘.gz’,
‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no compression) If ‘infer’ and ‘path_or_url’ is not path-
like, then use None (= no decompression).
storage_options [dict, optional] Extra options that make sense for a particular storage con-
nection, e.g. host, port, username, password, etc., if using a URL that will be parsed by
925
pandas: powerful Python data analysis toolkit, Release 1.2.0
fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument
with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
New in version 1.2.0.
Returns
unpickled [same type as object stored in file]
See also:
Notes
Examples
>>> import os
>>> os.remove("./dummy.pkl")
pandas.read_table
skipfooter [int, default 0] Number of lines at bottom of file to skip (Unsupported with en-
gine=’c’).
nrows [int, optional] Number of rows of file to read. Useful for reading pieces of large files.
na_values [scalar, str, list-like, or dict, optional] Additional strings to recognize as NA/NaN. If
dict passed, specific per-column NA values. By default the following values are interpreted
as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’,
‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
keep_default_na [bool, default True] Whether or not to include the default NaN values when
parsing the data. Depending on whether na_values is passed in, the behavior is as follows:
• If keep_default_na is True, and na_values are specified, na_values is appended to the
default NaN values used for parsing.
• If keep_default_na is True, and na_values are not specified, only the default NaN values
are used for parsing.
• If keep_default_na is False, and na_values are specified, only the NaN values specified
na_values are used for parsing.
• If keep_default_na is False, and na_values are not specified, no strings will be parsed as
NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters
will be ignored.
na_filter [bool, default True] Detect missing value markers (empty strings and the value of
na_values). In data without any NAs, passing na_filter=False can improve the performance
of reading a large file.
verbose [bool, default False] Indicate number of NA values placed in non-numeric columns.
skip_blank_lines [bool, default True] If True, skip over blank lines rather than interpreting as
NaN values.
parse_dates [bool or list of int or names or list of lists or dict, default False] The behavior is as
follows:
• boolean. If True -> try parsing the index.
• list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date
column.
• list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
• dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index cannot be represented as an array of datetimes, say because of an
unparsable value or a mixture of timezones, the column or index will be returned unal-
tered as an object data type. For non-standard datetime parsing, use pd.to_datetime
after pd.read_csv. To parse an index or column with a mixture of timezones, specify
date_parser to be a partially-applied pandas.to_datetime() with utc=True.
See Parsing a CSV with mixed timezones for more.
Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format [bool, default False] If True and parse_dates is enabled, pandas will
attempt to infer the format of the datetime strings in the columns, and if it can be inferred,
switch to a faster method of parsing them. In some cases this can increase the parsing speed
by 5-10x.
keep_date_col [bool, default False] If True and parse_dates specifies combining multiple
columns then keep the original columns.
date_parser [function, optional] Function to use for converting a sequence of string columns to
an array of datetime instances. The default uses dateutil.parser.parser to do the
conversion. Pandas will try to call date_parser in three different ways, advancing to the next
if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments;
2) concatenate (row-wise) the string values from the columns defined by parse_dates into
a single array and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst [bool, default False] DD/MM format dates, international and European format.
cache_dates [bool, default True] If True, use a cache of unique, converted dates to apply the
datetime conversion. May produce significant speed-up when parsing duplicate date strings,
especially ones with timezone offsets.
New in version 0.25.0.
iterator [bool, default False] Return TextFileReader object for iteration or getting chunks with
get_chunk().
Changed in version 1.2: TextFileReader is a context manager.
chunksize [int, optional] Return TextFileReader object for iteration. See the IO Tools docs for
more information on iterator and chunksize.
Changed in version 1.2: TextFileReader is a context manager.
compression [{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’] For on-the-fly decom-
pression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then detect compres-
sion from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompres-
sion). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None
for no decompression.
thousands [str, optional] Thousands separator.
decimal [str, default ‘.’] Character to recognize as decimal point (e.g. use ‘,’ for European data).
lineterminator [str (length 1), optional] Character to break file into lines. Only valid with C
parser.
quotechar [str (length 1), optional] The character used to denote the start and end of a quoted
item. Quoted items can include the delimiter and it will be ignored.
quoting [int or csv.QUOTE_* instance, default 0] Control field quoting behavior per
csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1),
QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote [bool, default True] When quotechar is specified and quoting is not
QUOTE_NONE, indicate whether or not to interpret two consecutive quotechar elements
INSIDE a field as a single quotechar element.
escapechar [str (length 1), optional] One-character string used to escape other characters.
comment [str, optional] Indicates remainder of line should not be parsed. If found at the begin-
ning of a line, the line will be ignored altogether. This parameter must be a single character.
Like empty lines (as long as skip_blank_lines=True), fully commented lines are
ignored by the parameter header but not by skiprows. For example, if comment='#',
parsing #empty\na,b,c\n1,2,3 with header=0 will result in ‘a,b,c’ being treated
as the header.
encoding [str, optional] Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of
Python standard encodings .
dialect [str or csv.Dialect, optional] If provided, this parameter will override values (default
or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace,
quotechar, and quoting. If it is necessary to override values, a ParserWarning will be issued.
See csv.Dialect documentation for more details.
error_bad_lines [bool, default True] Lines with too many fields (e.g. a csv line with too many
commas) will by default cause an exception to be raised, and no DataFrame will be returned.
If False, then these “bad lines” will dropped from the DataFrame that is returned.
warn_bad_lines [bool, default True] If error_bad_lines is False, and warn_bad_lines is True, a
warning for each “bad line” will be output.
delim_whitespace [bool, default False] Specifies whether or not whitespace (e.g. ' ' or '
') will be used as the sep. Equivalent to setting sep='\s+'. If this option is set to True,
nothing should be passed in for the delimiter parameter.
low_memory [bool, default True] Internally process the file in chunks, resulting in lower mem-
ory use while parsing, but possibly mixed type inference. To ensure no mixed types either
set False, or specify the type with the dtype parameter. Note that the entire file is read into
a single DataFrame regardless, use the chunksize or iterator parameter to return the data in
chunks. (Only valid with C parser).
memory_map [bool, default False] If a filepath is provided for filepath_or_buffer, map the file
object directly onto memory and access the data directly from there. Using this option can
improve performance because there is no longer any I/O overhead.
float_precision [str, optional] Specifies which converter the C engine should use for floating-
point values. The options are None or ‘high’ for the ordinary converter, ‘legacy’ for the
original lower precision pandas converter, and ‘round_trip’ for the round-trip converter.
Changed in version 1.2.
storage_options [dict, optional] Extra options that make sense for a particular storage con-
nection, e.g. host, port, username, password, etc., if using a URL that will be parsed by
fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument
with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
New in version 1.2.
Returns
DataFrame or TextParser A comma-separated values (csv) file is returned as two-dimensional
data structure with labeled axes.
See also:
Examples
>>> pd.read_table('data.csv')
pandas.read_csv
index_col [int, str, sequence of int / str, or False, default None] Column(s) to use as the row
labels of the DataFrame, either given as string name or column index. If a sequence of int
/ str is given, a MultiIndex is used.
Note: index_col=False can be used to force pandas to not use the first column as the
index, e.g. when you have a malformed file with delimiters at the end of each line.
usecols [list-like or callable, optional] Return a subset of the columns. If list-like, all elements
must either be positional (i.e. integer indices into the document columns) or strings that
correspond to column names provided either by the user in names or inferred from the
document header row(s). For example, a valid list-like usecols parameter would be [0,
1, 2] or ['foo', 'bar', 'baz']. Element order is ignored, so usecols=[0,
1] is the same as [1, 0]. To instantiate a DataFrame from data with element or-
der preserved use pd.read_csv(data, usecols=['foo', 'bar'])[['foo',
'bar']] for columns in ['foo', 'bar'] order or pd.read_csv(data,
usecols=['foo', 'bar'])[['bar', 'foo']] for ['bar', 'foo'] order.
If callable, the callable function will be evaluated against the column names, returning
names where the callable function evaluates to True. An example of a valid callable ar-
gument would be lambda x: x.upper() in ['AAA', 'BBB', 'DDD']. Using
this parameter results in much faster parsing time and lower memory usage.
squeeze [bool, default False] If the parsed data only contains one column then return a Series.
prefix [str, optional] Prefix to add to column numbers when no header, e.g. ‘X’ for X0, X1, . . .
mangle_dupe_cols [bool, default True] Duplicate columns will be specified as ‘X’, ‘X.1’,
. . . ’X.N’, rather than ‘X’. . . ’X’. Passing in False will cause data to be overwritten if there
are duplicate names in the columns.
dtype [Type name or dict of column -> type, optional] Data type for data or columns. E.g. {‘a’:
np.float64, ‘b’: np.int32, ‘c’: ‘Int64’} Use str or object together with suitable na_values
settings to preserve and not interpret dtype. If converters are specified, they will be applied
INSTEAD of dtype conversion.
engine [{‘c’, ‘python’}, optional] Parser engine to use. The C engine is faster while the python
engine is currently more feature-complete.
converters [dict, optional] Dict of functions for converting values in certain columns. Keys can
either be integers or column labels.
true_values [list, optional] Values to consider as True.
false_values [list, optional] Values to consider as False.
skipinitialspace [bool, default False] Skip spaces after delimiter.
skiprows [list-like, int or callable, optional] Line numbers to skip (0-indexed) or number of
lines to skip (int) at the start of the file.
If callable, the callable function will be evaluated against the row indices, returning True if
the row should be skipped and False otherwise. An example of a valid callable argument
would be lambda x: x in [0, 2].
skipfooter [int, default 0] Number of lines at bottom of file to skip (Unsupported with en-
gine=’c’).
nrows [int, optional] Number of rows of file to read. Useful for reading pieces of large files.
na_values [scalar, str, list-like, or dict, optional] Additional strings to recognize as NA/NaN. If
dict passed, specific per-column NA values. By default the following values are interpreted
as NaN: ‘’, ‘#N/A’, ‘#N/A N/A’, ‘#NA’, ‘-1.#IND’, ‘-1.#QNAN’, ‘-NaN’, ‘-nan’, ‘1.#IND’,
‘1.#QNAN’, ‘<NA>’, ‘N/A’, ‘NA’, ‘NULL’, ‘NaN’, ‘n/a’, ‘nan’, ‘null’.
keep_default_na [bool, default True] Whether or not to include the default NaN values when
parsing the data. Depending on whether na_values is passed in, the behavior is as follows:
• If keep_default_na is True, and na_values are specified, na_values is appended to the
default NaN values used for parsing.
• If keep_default_na is True, and na_values are not specified, only the default NaN values
are used for parsing.
• If keep_default_na is False, and na_values are specified, only the NaN values specified
na_values are used for parsing.
• If keep_default_na is False, and na_values are not specified, no strings will be parsed as
NaN.
Note that if na_filter is passed in as False, the keep_default_na and na_values parameters
will be ignored.
na_filter [bool, default True] Detect missing value markers (empty strings and the value of
na_values). In data without any NAs, passing na_filter=False can improve the performance
of reading a large file.
verbose [bool, default False] Indicate number of NA values placed in non-numeric columns.
skip_blank_lines [bool, default True] If True, skip over blank lines rather than interpreting as
NaN values.
parse_dates [bool or list of int or names or list of lists or dict, default False] The behavior is as
follows:
• boolean. If True -> try parsing the index.
• list of int or names. e.g. If [1, 2, 3] -> try parsing columns 1, 2, 3 each as a separate date
column.
• list of lists. e.g. If [[1, 3]] -> combine columns 1 and 3 and parse as a single date column.
• dict, e.g. {‘foo’ : [1, 3]} -> parse columns 1, 3 as date and call result ‘foo’
If a column or index cannot be represented as an array of datetimes, say because of an
unparsable value or a mixture of timezones, the column or index will be returned unal-
tered as an object data type. For non-standard datetime parsing, use pd.to_datetime
after pd.read_csv. To parse an index or column with a mixture of timezones, specify
date_parser to be a partially-applied pandas.to_datetime() with utc=True.
See Parsing a CSV with mixed timezones for more.
Note: A fast-path exists for iso8601-formatted dates.
infer_datetime_format [bool, default False] If True and parse_dates is enabled, pandas will
attempt to infer the format of the datetime strings in the columns, and if it can be inferred,
switch to a faster method of parsing them. In some cases this can increase the parsing speed
by 5-10x.
keep_date_col [bool, default False] If True and parse_dates specifies combining multiple
columns then keep the original columns.
date_parser [function, optional] Function to use for converting a sequence of string columns to
an array of datetime instances. The default uses dateutil.parser.parser to do the
conversion. Pandas will try to call date_parser in three different ways, advancing to the next
if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments;
2) concatenate (row-wise) the string values from the columns defined by parse_dates into
a single array and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
dayfirst [bool, default False] DD/MM format dates, international and European format.
cache_dates [bool, default True] If True, use a cache of unique, converted dates to apply the
datetime conversion. May produce significant speed-up when parsing duplicate date strings,
especially ones with timezone offsets.
New in version 0.25.0.
iterator [bool, default False] Return TextFileReader object for iteration or getting chunks with
get_chunk().
Changed in version 1.2: TextFileReader is a context manager.
chunksize [int, optional] Return TextFileReader object for iteration. See the IO Tools docs for
more information on iterator and chunksize.
Changed in version 1.2: TextFileReader is a context manager.
compression [{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’] For on-the-fly decom-
pression of on-disk data. If ‘infer’ and filepath_or_buffer is path-like, then detect compres-
sion from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise no decompres-
sion). If using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None
for no decompression.
thousands [str, optional] Thousands separator.
decimal [str, default ‘.’] Character to recognize as decimal point (e.g. use ‘,’ for European data).
lineterminator [str (length 1), optional] Character to break file into lines. Only valid with C
parser.
quotechar [str (length 1), optional] The character used to denote the start and end of a quoted
item. Quoted items can include the delimiter and it will be ignored.
quoting [int or csv.QUOTE_* instance, default 0] Control field quoting behavior per
csv.QUOTE_* constants. Use one of QUOTE_MINIMAL (0), QUOTE_ALL (1),
QUOTE_NONNUMERIC (2) or QUOTE_NONE (3).
doublequote [bool, default True] When quotechar is specified and quoting is not
QUOTE_NONE, indicate whether or not to interpret two consecutive quotechar elements
INSIDE a field as a single quotechar element.
escapechar [str (length 1), optional] One-character string used to escape other characters.
comment [str, optional] Indicates remainder of line should not be parsed. If found at the begin-
ning of a line, the line will be ignored altogether. This parameter must be a single character.
Like empty lines (as long as skip_blank_lines=True), fully commented lines are
ignored by the parameter header but not by skiprows. For example, if comment='#',
parsing #empty\na,b,c\n1,2,3 with header=0 will result in ‘a,b,c’ being treated
as the header.
encoding [str, optional] Encoding to use for UTF when reading/writing (ex. ‘utf-8’). List of
Python standard encodings .
dialect [str or csv.Dialect, optional] If provided, this parameter will override values (default
or not) for the following parameters: delimiter, doublequote, escapechar, skipinitialspace,
quotechar, and quoting. If it is necessary to override values, a ParserWarning will be issued.
See csv.Dialect documentation for more details.
error_bad_lines [bool, default True] Lines with too many fields (e.g. a csv line with too many
commas) will by default cause an exception to be raised, and no DataFrame will be returned.
If False, then these “bad lines” will dropped from the DataFrame that is returned.
warn_bad_lines [bool, default True] If error_bad_lines is False, and warn_bad_lines is True, a
warning for each “bad line” will be output.
delim_whitespace [bool, default False] Specifies whether or not whitespace (e.g. ' ' or '
') will be used as the sep. Equivalent to setting sep='\s+'. If this option is set to True,
nothing should be passed in for the delimiter parameter.
low_memory [bool, default True] Internally process the file in chunks, resulting in lower mem-
ory use while parsing, but possibly mixed type inference. To ensure no mixed types either
set False, or specify the type with the dtype parameter. Note that the entire file is read into
a single DataFrame regardless, use the chunksize or iterator parameter to return the data in
chunks. (Only valid with C parser).
memory_map [bool, default False] If a filepath is provided for filepath_or_buffer, map the file
object directly onto memory and access the data directly from there. Using this option can
improve performance because there is no longer any I/O overhead.
float_precision [str, optional] Specifies which converter the C engine should use for floating-
point values. The options are None or ‘high’ for the ordinary converter, ‘legacy’ for the
original lower precision pandas converter, and ‘round_trip’ for the round-trip converter.
Changed in version 1.2.
storage_options [dict, optional] Extra options that make sense for a particular storage con-
nection, e.g. host, port, username, password, etc., if using a URL that will be parsed by
fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument
with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
New in version 1.2.
Returns
DataFrame or TextParser A comma-separated values (csv) file is returned as two-dimensional
data structure with labeled axes.
See also:
Examples
>>> pd.read_csv('data.csv')
pandas.read_fwf
Examples
>>> pd.read_fwf('data.csv')
3.1.3 Clipboard
pandas.read_clipboard
pandas.read_clipboard(sep='\\s+', **kwargs)
Read text from clipboard and pass to read_csv.
Parameters
sep [str, default ‘s+’] A string or regex delimiter. The default of ‘s+’ denotes one or more
whitespace characters.
**kwargs See read_csv for the full argument list.
Returns
DataFrame A parsed DataFrame object.
3.1.4 Excel
read_excel(io[, sheet_name, header, names, . . . ]) Read an Excel file into a pandas DataFrame.
ExcelFile.parse([sheet_name, header, names, Parse specified sheet(s) into a DataFrame.
. . . ])
pandas.read_excel
If a column or index contains an unparseable date, the entire column or index will be re-
turned unaltered as an object data type. If you don`t want to parse some cells as date
just change their type in Excel to “Text”. For non-standard datetime parsing, use pd.
to_datetime after pd.read_excel.
Note: A fast-path exists for iso8601-formatted dates.
date_parser [function, optional] Function to use for converting a sequence of string columns to
an array of datetime instances. The default uses dateutil.parser.parser to do the
conversion. Pandas will try to call date_parser in three different ways, advancing to the next
if an exception occurs: 1) Pass one or more arrays (as defined by parse_dates) as arguments;
2) concatenate (row-wise) the string values from the columns defined by parse_dates into
a single array and pass that; and 3) call date_parser once for each row using one or more
strings (corresponding to the columns defined by parse_dates) as arguments.
thousands [str, default None] Thousands separator for parsing string columns to numeric. Note
that this parameter is only necessary for columns stored as TEXT in Excel, any numeric
columns will automatically be parsed, regardless of display format.
comment [str, default None] Comments out remainder of line. Pass a character or characters to
this argument to indicate comments in the input file. Any data between the comment string
and the end of the current line is ignored.
skipfooter [int, default 0] Rows at the end to skip (0-indexed).
convert_float [bool, default True] Convert integral floats to int (i.e., 1.0 –> 1). If False, all
numeric data will be read in as floats: Excel stores all numbers as floats internally.
mangle_dupe_cols [bool, default True] Duplicate columns will be specified as ‘X’, ‘X.1’,
. . . ’X.N’, rather than ‘X’. . . ’X’. Passing in False will cause data to be overwritten if there
are duplicate names in the columns.
storage_options [dict, optional] Extra options that make sense for a particular storage con-
nection, e.g. host, port, username, password, etc., if using a URL that will be parsed by
fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument
with a local path or a file-like buffer. See the fsspec and backend storage implementation
docs for the set of allowed keys and values.
New in version 1.2.0.
Returns
DataFrame or dict of DataFrames DataFrame from the passed in Excel file. See notes in
sheet_name argument for more information on when a dict of DataFrames is returned.
See also:
Examples
The file can be read using the file name as string or an open file object:
Index and header can be specified via the index_col and header arguments
True, False, and NA values, and thousands separators have defaults, but can be explicitly specified, too. Supply
the values you would like as strings or lists of strings!
Comment lines in the excel input file can be skipped using the comment kwarg
pandas.ExcelFile.parse
ExcelWriter(path[, engine]) Class for writing DataFrame objects into excel sheets.
pandas.ExcelWriter
Notes
Examples
Default usage:
>>> import io
>>> buffer = io.BytesIO()
>>> with pd.ExcelWriter(buffer) as writer:
... df.to_excel(writer)
Attributes
None
Methods
None
3.1.5 JSON
pandas.read_json
– default is 'columns'
– The DataFrame index must be unique for orients 'index' and 'columns'.
– The DataFrame columns must be unique for orients 'index', 'columns', and
'records'.
typ [{‘frame’, ‘series’}, default ‘frame’] The type of object to recover.
dtype [bool or dict, default None] If True, infer dtypes; if a dict of column to dtype, then use
those; if False, then don’t infer dtypes at all, applies only to the data.
For all orient values except 'table', default is True.
Changed in version 0.25.0: Not applicable for orient='table'.
convert_axes [bool, default None] Try to convert the axes to the proper dtypes.
For all orient values except 'table', default is True.
Changed in version 0.25.0: Not applicable for orient='table'.
convert_dates [bool or list of str, default True] If True then default datelike columns may be
converted (depending on keep_default_dates). If False, no dates will be converted. If a list
of column names, then those columns will be converted and default datelike columns may
also be converted (depending on keep_default_dates).
keep_default_dates [bool, default True] If parsing dates (convert_dates is not False), then try
to parse the default datelike columns. A column label is datelike if
• it ends with '_at',
• it ends with '_time',
• it begins with 'timestamp',
• it is 'modified', or
• it is 'date'.
numpy [bool, default False] Direct decoding to numpy arrays. Supports numeric data only,
but non-numeric column and index labels are supported. Note also that the JSON ordering
MUST be the same for each term if numpy=True.
Deprecated since version 1.0.0.
precise_float [bool, default False] Set to enable usage of higher precision (strtod) function when
decoding string to double values. Default (False) is to use fast but less precise builtin func-
tionality.
date_unit [str, default None] The timestamp unit to detect if converting dates. The default be-
haviour is to try and detect the correct precision, but if this is not desired then pass one of ‘s’,
‘ms’, ‘us’ or ‘ns’ to force parsing only seconds, milliseconds, microseconds or nanoseconds
respectively.
encoding [str, default is ‘utf-8’] The encoding to use to decode py3 bytes.
lines [bool, default False] Read the file as a json object per line.
chunksize [int, optional] Return JsonReader object for iteration. See the line-delimited json
docs for more information on chunksize. This can only be passed if lines=True. If this
is None, the file will be read into memory all at once.
Changed in version 1.2: JsonReader is a context manager.
compression [{‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}, default ‘infer’] For on-the-fly decom-
pression of on-disk data. If ‘infer’, then use gzip, bz2, zip or xz if path_or_buf is a string
ending in ‘.gz’, ‘.bz2’, ‘.zip’, or ‘xz’, respectively, and no decompression otherwise. If
using ‘zip’, the ZIP file must contain only one data file to be read in. Set to None for no
decompression.
nrows [int, optional] The number of lines from the line-delimited jsonfile that has to be read.
This can only be passed if lines=True. If this is None, all the rows will be returned.
New in version 1.1.
storage_options [dict, optional] Extra options that make sense for a particular storage con-
nection, e.g. host, port, username, password, etc., if using a URL that will be parsed by
fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument
with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
New in version 1.2.0.
Returns
Series or DataFrame The type returned depends on the value of typ.
See also:
Notes
Specific to orient='table', if a DataFrame with a literal Index name of index gets written with
to_json(), the subsequent read operation will incorrectly set the Index name to None. This is because
index is also used by DataFrame.to_json() to denote a missing Index name, and the subsequent
read_json() operation cannot distinguish between the two. The same limitation is encountered with a
MultiIndex and any names beginning with 'level_'.
Examples
>>> df.to_json(orient='split')
'{"columns":["col 1","col 2"],
"index":["row 1","row 2"],
"data":[["a","b"],["c","d"]]}'
>>> pd.read_json(_, orient='split')
col 1 col 2
row 1 a b
row 2 c d
>>> df.to_json(orient='index')
'{"row 1":{"col 1":"a","col 2":"b"},"row 2":{"col 1":"c","col 2":"d"}}'
>>> pd.read_json(_, orient='index')
col 1 col 2
row 1 a b
row 2 c d
Encoding/decoding a Dataframe using 'records' formatted JSON. Note that index labels are not preserved
with this encoding.
>>> df.to_json(orient='records')
'[{"col 1":"a","col 2":"b"},{"col 1":"c","col 2":"d"}]'
>>> pd.read_json(_, orient='records')
col 1 col 2
0 a b
1 c d
>>> df.to_json(orient='table')
'{"schema": {"fields": [{"name": "index", "type": "string"},
{"name": "col 1", "type": "string"},
{"name": "col 2", "type": "string"}],
"primaryKey": "index",
"pandas_version": "0.20.0"},
"data": [{"index": "row 1", "col 1": "a", "col 2": "b"},
{"index": "row 2", "col 1": "c", "col 2": "d"}]}'
pandas.json_normalize
max_level [int, default None] Max number of levels(depth of dict) to normalize. if None, nor-
malizes all levels.
New in version 0.25.0.
Returns
frame [DataFrame]
Normalize semi-structured JSON data into a flat table.
Examples
Returns normalized data with columns prefixed with the given string.
pandas.io.json.build_table_schema
Notes
See Table Schema for conversion types. Timedeltas as converted to ISO8601 duration format with 9 decimal
places after the seconds field for nanosecond precision.
Categoricals are converted to the any dtype, and use the enum field constraint to list the allowed values. The
ordered attribute is included in an ordered field.
Examples
>>> df = pd.DataFrame(
... {'A': [1, 2, 3],
... 'B': ['a', 'b', 'c'],
... 'C': pd.date_range('2016-01-01', freq='d', periods=3),
... }, index=pd.Index(range(3), name='idx'))
>>> build_table_schema(df)
{'fields': [{'name': 'idx', 'type': 'integer'},
{'name': 'A', 'type': 'integer'},
{'name': 'B', 'type': 'string'},
{'name': 'C', 'type': 'datetime'}],
'pandas_version': '0.20.0',
'primaryKey': ['idx']}
3.1.6 HTML
read_html(io[, match, flavor, header, . . . ]) Read HTML tables into a list of DataFrame ob-
jects.
pandas.read_html
Soup. However, these attributes must be valid HTML table attributes to work correctly. For
example,
is a valid attribute dictionary because the ‘id’ HTML tag attribute is a valid HTML attribute
for any HTML tag as per this document.
is not a valid attribute dictionary because ‘asdf’ is not a valid HTML attribute even if it is a
valid XML attribute. Valid HTML 4.01 table attributes can be found here. A working draft
of the HTML 5 spec can be found here. It contains the latest information on table attributes
for the modern web.
parse_dates [bool, optional] See read_csv() for more details.
thousands [str, optional] Separator to use to parse thousands. Defaults to ','.
encoding [str, optional] The encoding used to decode the web page. Defaults to None.``None``
preserves the previous encoding behavior, which depends on the underlying parser library
(e.g., the parser library will try to use the encoding provided by the document).
decimal [str, default ‘.’] Character to recognize as decimal point (e.g. use ‘,’ for European data).
converters [dict, default None] Dict of functions for converting values in certain columns. Keys
can either be integers or column labels, values are functions that take one input argument,
the cell (not column) content, and return the transformed content.
na_values [iterable, default None] Custom NA values.
keep_default_na [bool, default True] If na_values are specified and keep_default_na is False
the default NaN values are overridden, otherwise they’re appended to.
displayed_only [bool, default True] Whether elements with “display: none” should be parsed.
Returns
dfs A list of DataFrames.
See also:
Notes
Before using this function you should read the gotchas about the HTML parsing libraries.
Expect to do some cleanup after you call this function. For example, you might need to manually assign column
names if the column names are converted to NaN when you pass the header=0 argument. We try to assume as
little as possible about the structure of the table and push the idiosyncrasies of the HTML contained in the table
to the user.
This function searches for <table> elements and only for <tr> and <th> rows and <td> elements within
each <tr> or <th> element in the table. <td> stands for “table data”. This function attempts to properly
handle colspan and rowspan attributes. If the function has a <thead> argument, it is used to construct
the header, otherwise the function attempts to find the header within the body (by putting rows with only <th>
elements into the header).
Similar to read_csv() the header argument is applied after skiprows is applied.
This function will always return a list of DataFrame or it will fail, e.g., it will not return an empty list.
Examples
See the read_html documentation in the IO section of the docs for some examples of reading in HTML tables.
read_hdf(path_or_buf[, key, mode, errors, . . . ]) Read from the store, close it if we opened it.
HDFStore.put(key, value[, format, index, . . . ]) Store object in HDFStore.
HDFStore.append(key, value[, format, axes, . . . ]) Append to Table in file.
HDFStore.get(key) Retrieve pandas object stored in file.
HDFStore.select(key[, where, start, stop, . . . ]) Retrieve pandas object stored in file, optionally based
on where criteria.
HDFStore.info() Print detailed information on the store.
HDFStore.keys([include]) Return a list of keys corresponding to objects stored in
HDFStore.
HDFStore.groups() Return a list of all the top-level nodes.
HDFStore.walk([where]) Walk the pytables group hierarchy for pandas objects.
pandas.read_hdf
Warning: Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype
data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can
be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
Parameters
path_or_buf [str, path object, pandas.HDFStore or file-like object] Any valid string path is
acceptable. The string could be a URL. Valid URL schemes include http, ftp, s3, and file.
For file URLs, a host is expected. A local file could be: file://localhost/path/
to/table.h5.
If you want to pass in a path object, pandas accepts any os.PathLike.
Alternatively, pandas accepts an open pandas.HDFStore object.
By file-like object, we refer to objects with a read() method, such as a file handle (e.g.
via builtin open function) or StringIO.
key [object, optional] The group identifier in the store. Can be omitted if the HDF file contains
a single pandas object.
mode [{‘r’, ‘r+’, ‘a’}, default ‘r’] Mode to use when opening the file. Ignored if path_or_buf is
a pandas.HDFStore. Default is ‘r’.
errors [str, default ‘strict’] Specifies how encoding and decoding errors are to be handled. See
the errors argument for open() for a full list of options.
where [list, optional] A list of Term (or convertible) objects.
start [int, optional] Row number to start selection.
stop [int, optional] Row number to stop selection.
columns [list, optional] A list of columns names to return.
iterator [bool, optional] Return an iterator object.
chunksize [int, optional] Number of rows to include in an iteration when using an iterator.
**kwargs Additional keyword arguments passed to HDFStore.
Returns
item [object] The selected object. Return type depends on the object stored.
See also:
Examples
pandas.HDFStore.put
pandas.HDFStore.append
Notes
Does not check if data being appended overlaps with existing data in the table, so be careful
pandas.HDFStore.get
HDFStore.get(key)
Retrieve pandas object stored in file.
Parameters
key [str]
Returns
object Same type as object stored in file.
pandas.HDFStore.select
Warning: Pandas uses PyTables for reading and writing HDF5 files, which allows serializing object-dtype
data with pickle when using the “fixed” format. Loading pickled data received from untrusted sources can
be unsafe.
See: https://docs.python.org/3/library/pickle.html for more.
Parameters
key [str] Object being retrieved from file.
where [list or None] List of Term (or convertible) objects, optional.
start [int or None] Row number to start selection.
stop [int, default None] Row number to stop selection.
columns [list or None] A list of columns that if not None, will limit the return columns.
iterator [bool or False] Returns an iterator.
chunksize [int or None] Number or rows to include in iteration, return an iterator.
auto_close [bool or False] Should automatically close the store when finished.
Returns
object Retrieved object from file.
pandas.HDFStore.info
HDFStore.info()
Print detailed information on the store.
Returns
str
pandas.HDFStore.keys
HDFStore.keys(include='pandas')
Return a list of keys corresponding to objects stored in HDFStore.
Parameters
include [str, default ‘pandas’] When kind equals ‘pandas’ return pandas objects. When kind
equals ‘native’ return native HDF5 Table objects.
New in version 1.1.0.
Returns
list List of ABSOLUTE path-names (e.g. have the leading ‘/’).
Raises
pandas.HDFStore.groups
HDFStore.groups()
Return a list of all the top-level nodes.
Each node returned is not a pandas storage object.
Returns
list List of objects.
pandas.HDFStore.walk
HDFStore.walk(where='/')
Walk the pytables group hierarchy for pandas objects.
This generator will yield the group path, subgroups and pandas object names for each group.
Any non-pandas PyTables objects that are not a group will be ignored.
The where group itself is listed first (preorder), then each of its child groups (following an alphanumerical order)
is also traversed, following the same procedure.
New in version 0.24.0.
Parameters
where [str, default “/”] Group where to start walking.
Yields
path [str] Full path to a group (without trailing ‘/’).
groups [list] Names (strings) of the groups contained in path.
leaves [list] Names (strings) of the pandas objects contained in path.
3.1.8 Feather
read_feather(path[, columns, use_threads, . . . ]) Load a feather-format object from the file path.
pandas.read_feather
columns [sequence, default None] If not provided, all columns are read.
New in version 0.24.0.
use_threads [bool, default True]
Whether to parallelize reading using multiple threads.
New in version 0.24.0.
storage_options [dict, optional] Extra options that make sense for a particular storage con-
nection, e.g. host, port, username, password, etc., if using a URL that will be parsed by
fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing this argument
with a non-fsspec URL. See the fsspec and backend storage implementation docs for the set
of allowed keys and values.
New in version 1.2.0.
Returns
type of object stored in file
3.1.9 Parquet
read_parquet(path[, engine, columns, . . . ]) Load a parquet object from the file path, returning a
DataFrame.
pandas.read_parquet
3.1.10 ORC
read_orc(path[, columns]) Load an ORC object from the file path, returning a
DataFrame.
pandas.read_orc
3.1.11 SAS
pandas.read_sas
3.1.12 SPSS
read_spss(path[, usecols, convert_categoricals]) Load an SPSS file from the file path, returning a
DataFrame.
pandas.read_spss
3.1.13 SQL
pandas.read_sql_table
Notes
Any datetime values with time zone information will be converted to UTC.
Examples
pandas.read_sql_query
Notes
Any datetime values with time zone information parsed via the parse_dates parameter will be converted to UTC.
pandas.read_sql
pandas.read_gbq
3.1.15 STATA
pandas.read_stata
Notes
Categorical variables read through an iterator may not have the same categories and dtype. This occurs when a
variable stored in a DTA file is associated to an incomplete set of value labels that only label a strict subset of
the values.
Examples
>>> df = pd.read_stata('filename.dta')
pandas.io.stata.StataReader.data_label
property StataReader.data_label
Return data label of Stata file.
pandas.io.stata.StataReader.value_labels
StataReader.value_labels()
Return a dict, associating each variable name a dict, associating each value its corresponding label.
Returns
dict
pandas.io.stata.StataReader.variable_labels
StataReader.variable_labels()
Return variable labels as a dict, associating each variable name with corresponding label.
Returns
dict
pandas.io.stata.StataWriter.write_file
StataWriter.write_file()
melt(frame[, id_vars, value_vars, var_name, . . . ]) Unpivot a DataFrame from wide to long format, option-
ally leaving identifiers set.
pivot(data[, index, columns, values]) Return reshaped DataFrame organized by given index /
column values.
pivot_table(data[, values, index, columns, . . . ]) Create a spreadsheet-style pivot table as a DataFrame.
crosstab(index, columns[, values, rownames, . . . ]) Compute a simple cross tabulation of two (or more) fac-
tors.
cut(x, bins[, right, labels, retbins, . . . ]) Bin values into discrete intervals.
qcut(x, q[, labels, retbins, precision, . . . ]) Quantile-based discretization function.
merge(left, right[, how, on, left_on, . . . ]) Merge DataFrame or named Series objects with a
database-style join.
merge_ordered(left, right[, on, left_on, . . . ]) Perform merge with optional filling/interpolation.
merge_asof(left, right[, on, left_on, . . . ]) Perform an asof merge.
concat(objs[, axis, join, ignore_index, . . . ]) Concatenate pandas objects along a particular axis with
optional set logic along the other axes.
get_dummies(data[, prefix, prefix_sep, . . . ]) Convert categorical variable into dummy/indicator vari-
ables.
factorize(values[, sort, na_sentinel, size_hint]) Encode the object as an enumerated type or categorical
variable.
unique(values) Hash table-based unique.
wide_to_long(df, stubnames, i, j[, sep, suffix]) Wide panel to long format.
pandas.melt
ignore_index [bool, default True] If True, original index is ignored. If False, the original index
is retained. Index labels will be repeated as necessary.
New in version 1.1.0.
Returns
DataFrame Unpivoted DataFrame.
See also:
Examples
pandas.pivot
DataFrame.pivot_table Generalization of pivot that can handle duplicate values for one index/column
pair.
DataFrame.unstack Pivot based on the index values instead of a column.
wide_to_long Wide panel to long format. Less flexible but more user-friendly than melt.
Notes
For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack methods.
Examples
You could also assign a list of column names or a list of index names.
>>> df = pd.DataFrame({
... "lev1": [1, 1, 1, 2, 2, 2],
... "lev2": [1, 1, 2, 1, 1, 2],
... "lev3": [1, 2, 1, 2, 1, 2],
... "lev4": [1, 2, 3, 4, 5, 6],
... "values": [0, 1, 2, 3, 4, 5]})
>>> df
lev1 lev2 lev3 lev4 values
0 1 1 1 1 0
1 1 1 2 2 1
2 1 2 1 3 2
3 2 1 2 4 3
4 2 1 1 5 4
5 2 2 2 6 5
Notice that the first two rows are the same for our index and columns arguments.
pandas.pivot_table
Examples
The next example aggregates by taking the mean across multiple columns.
We can also calculate multiple types of aggregations for any given value column.
pandas.crosstab
Notes
Any Series passed will have their name attributes used unless row or column names for the cross-tabulation are
specified.
Any input passed containing Categorical data will have all of its categories included in the cross-tabulation,
even if the actual data does not contain any instances of a particular category.
In the event that there aren’t overlapping indexes an empty DataFrame will be returned.
Examples
Here ‘c’ and ‘f’ are not represented in the data and will not be shown in the output because dropna is True by
default. Set dropna=False to preserve categories with no data.
pandas.cut
bins [numpy.ndarray or IntervalIndex.] The computed or specified bins. Only returned when
retbins=True. For scalar or sequence bins, this is an ndarray with the computed bins. If set
duplicates=drop, bins will drop non-unique bin. For an IntervalIndex bins, this is equal to
bins.
See also:
qcut Discretize variable into equal-sized buckets based on rank or based on sample quantiles.
Categorical Array type for storing data that come from a fixed set of values.
Series One-dimensional array with axis labels (including time series).
IntervalIndex Immutable Index implementing an ordered, sliceable set.
Notes
Any NA values will be NA in the result. Out of bounds values will be NA in the resulting Series or Categorical
object.
Examples
Discovers the same bins, but assign them specific labels. Notice that the returned Categorical’s categories are
labels and is ordered.
ordered=False will result in unordered categories when labels are passed. This parameter can be used to
allow non-unique labels:
Passing a Series as an input returns a Series with mapping value. It is used to map numerically to intervals based
on bins.
Passing an IntervalIndex for bins results in those categories exactly. Notice that values not covered by the
IntervalIndex are set to NaN. 0 is to the left of the first bin (which is closed on the right), and 1.5 falls between
two bins.
pandas.qcut
Notes
Examples
>>> pd.qcut(range(5), 4)
...
[(-0.001, 1.0], (-0.001, 1.0], (1.0, 2.0], (2.0, 3.0], (3.0, 4.0]]
Categories (4, interval[float64]): [(-0.001, 1.0] < (1.0, 2.0] ...
pandas.merge
indicator [bool or str, default False] If True, adds a column to the output DataFrame called
“_merge” with information on the source of each row. The column can be given a different
name by providing a string argument. The column will have a Categorical type with the
value of “left_only” for observations whose merge key only appears in the left DataFrame,
“right_only” for observations whose merge key only appears in the right DataFrame, and
“both” if the observation’s merge key is found in both DataFrames.
validate [str, optional] If specified, checks if merge is of specified type.
• “one_to_one” or “1:1”: check if merge keys are unique in both left and right datasets.
• “one_to_many” or “1:m”: check if merge keys are unique in left dataset.
• “many_to_one” or “m:1”: check if merge keys are unique in right dataset.
• “many_to_many” or “m:m”: allowed, but does not result in checks.
Returns
DataFrame A DataFrame of the two merged objects.
See also:
Notes
Support for specifying index levels as the on, left_on, and right_on parameters was added in version 0.23.0
Support for merging named Series objects was added in version 0.24.0
Examples
Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y,
appended.
Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey',
... suffixes=('_left', '_right'))
lkey value_left rkey value_right
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
Traceback (most recent call last):
...
ValueError: columns overlap but no suffix specified:
Index(['value'], dtype='object')
pandas.merge_ordered
DataFrame The merged DataFrame output type will the be same as ‘left’, if it is a subclass of
DataFrame.
See also:
Examples
pandas.merge_asof
Returns
merged [DataFrame]
See also:
Examples
We only asof within 2ms between the quote time and the trade time
>>> pd.merge_asof(
... trades, quotes, on="time", by="ticker", tolerance=pd.Timedelta("2ms")
... )
time ticker price quantity bid ask
0 2016-05-25 13:30:00.023 MSFT 51.95 75 51.95 51.96
1 2016-05-25 13:30:00.038 MSFT 51.95 155 NaN NaN
2 2016-05-25 13:30:00.048 GOOG 720.77 100 720.50 720.93
3 2016-05-25 13:30:00.048 GOOG 720.92 100 720.50 720.93
4 2016-05-25 13:30:00.048 AAPL 98.00 100 NaN NaN
We only asof within 10ms between the quote time and the trade time and we exclude exact matches on time.
However prior data will propagate forward
>>> pd.merge_asof(
... trades,
... quotes,
... on="time",
... by="ticker",
(continues on next page)
pandas.concat
See also:
Notes
Examples
Clear the existing index and reset it in the result by setting the ignore_index option to True.
Add a hierarchical index at the outermost level of the data with the keys option.
Label the index keys you create with the names option.
Combine DataFrame objects with overlapping columns and return everything. Columns outside the intersec-
tion will be filled with NaN values.
Combine DataFrame objects with overlapping columns and return only those that are shared by passing
inner to the join keyword argument.
Prevent the result from including duplicate index values with the verify_integrity option.
pandas.get_dummies
Examples
>>> s = pd.Series(list('abca'))
>>> pd.get_dummies(s)
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
>>> pd.get_dummies(s1)
a b
0 1 0
1 0 1
2 0 0
>>> pd.get_dummies(pd.Series(list('abcaa')))
a b c
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 1 0 0
pandas.factorize
Note: Even if there’s a missing value in values, uniques will not contain an entry for it.
See also:
Examples
These examples all show factorize as a top-level method like pd.factorize(values). The results are
identical for methods like Series.factorize().
With sort=True, the uniques will be sorted, and codes will be shuffled so that the relationship is the main-
tained.
Missing values are indicated in codes with na_sentinel (-1 by default). Note that missing values are never
included in uniques.
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing pandas
objects, the type of uniques will differ. For Categoricals, a Categorical is returned.
If NaN is in the values, and we want to include NaN in the uniques of the values, it can be achieved by setting
na_sentinel=None.
pandas.unique
pandas.unique(values)
Hash table-based unique. Uniques are returned in order of appearance. This does NOT sort.
Significantly faster than numpy.unique. Includes NA values.
Parameters
values [1d array-like]
Returns
numpy.ndarray or ExtensionArray The return can be:
• Index : when the input is an Index
• Categorical : when the input is a Categorical dtype
• ndarray : when the input is a Series/ndarray
Return numpy.ndarray or ExtensionArray.
See also:
Examples
>>> pd.unique(pd.Series([pd.Timestamp('20160101'),
... pd.Timestamp('20160101')]))
array(['2016-01-01T00:00:00.000000000'], dtype='datetime64[ns]')
>>> pd.unique(list('baabc'))
array(['b', 'a', 'c'], dtype=object)
>>> pd.unique(pd.Series(pd.Categorical(list('baabc'),
... categories=list('abc'))))
[b, a, c]
Categories (3, object): [b, a, c]
>>> pd.unique(pd.Series(pd.Categorical(list('baabc'),
... categories=list('abc'),
... ordered=True)))
[b, a, c]
Categories (3, object): [a < b < c]
An array of tuples
pandas.wide_to_long
melt Unpivot a DataFrame from wide to long format, optionally leaving identifiers set.
pivot Create a spreadsheet-style pivot table as a DataFrame.
DataFrame.pivot Pivot without aggregation that can handle non-numeric data.
DataFrame.pivot_table Generalization of pivot that can handle duplicate values for one index/column
pair.
DataFrame.unstack Pivot based on the index values instead of a column.
Notes
All extra variables are left untouched. This simply uses pandas.melt under the hood, but is hard-coded to “do
the right thing” in a typical case.
Examples
>>> np.random.seed(123)
>>> df = pd.DataFrame({"A1970" : {0 : "a", 1 : "b", 2 : "c"},
... "A1980" : {0 : "d", 1 : "e", 2 : "f"},
... "B1970" : {0 : 2.5, 1 : 1.2, 2 : .7},
... "B1980" : {0 : 3.2, 1 : 1.3, 2 : .1},
... "X" : dict(zip(range(3), np.random.randn(3)))
... })
>>> df["id"] = df.index
>>> df
A1970 A1980 B1970 B1980 X id
0 a d 2.5 3.2 -1.085631 0
1 b e 1.2 1.3 0.997345 1
2 c f 0.7 0.1 0.282978 2
>>> pd.wide_to_long(df, ["A", "B"], i="id", j="year")
...
X A B
id year
0 1970 -1.085631 a 2.5
1 1970 0.997345 b 1.2
2 1970 0.282978 c 0.7
0 1980 -1.085631 d 3.2
1 1980 0.997345 e 1.3
2 1980 0.282978 f 0.1
Going from long back to wide just takes some creative use of unstack
>>> w = l.unstack()
>>> w.columns = w.columns.map('{0[0]}{0[1]}'.format)
>>> w.reset_index()
famid birth ht1 ht2
0 1 1 2.8 3.4
1 1 2 2.9 3.8
2 1 3 2.2 2.9
3 2 1 2.0 3.2
4 2 2 1.8 2.8
5 2 3 1.9 2.4
6 3 1 2.2 3.3
7 3 2 2.3 3.4
8 3 3 2.1 2.9
If we have many columns, we could also use a regex to find our stubnames and pass that list on to wide_to_long
>>> stubnames = sorted(
... set([match[0] for match in df.columns.str.findall(
... r'[A-B]\(.*\)').values if match != []])
... )
>>> list(stubnames)
['A(weekly)', 'B(weekly)']
All of the above examples have integers as suffixes. It is possible to have non-integers as suffixes.
>>> df = pd.DataFrame({
... 'famid': [1, 1, 1, 2, 2, 2, 3, 3, 3],
... 'birth': [1, 2, 3, 1, 2, 3, 1, 2, 3],
... 'ht_one': [2.8, 2.9, 2.2, 2, 1.8, 1.9, 2.2, 2.3, 2.1],
... 'ht_two': [3.4, 3.8, 2.9, 3.2, 2.8, 2.4, 3.3, 3.4, 2.9]
... })
>>> df
famid birth ht_one ht_two
0 1 1 2.8 3.4
1 1 2 2.9 3.8
2 1 3 2.2 2.9
3 2 1 2.0 3.2
4 2 2 1.8 2.8
5 2 3 1.9 2.4
6 3 1 2.2 3.3
7 3 2 2.3 3.4
8 3 3 2.1 2.9
pandas.isna
pandas.isna(obj)
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays,
None or NaN in object arrays, NaT in datetimelike).
Parameters
obj [scalar or array-like] Object to check for null or missing values.
Returns
bool or array-like of bool For scalar input, returns a scalar boolean. For array input, returns an
array of boolean indicating whether each corresponding element is missing.
See also:
Examples
>>> pd.isna('dog')
False
>>> pd.isna(pd.NA)
True
>>> pd.isna(np.nan)
True
For Series and DataFrame, the same type is returned, containing booleans.
>>> pd.isna(df[1])
0 False
1 True
Name: 1, dtype: bool
pandas.isnull
pandas.isnull(obj)
Detect missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are missing (NaN in numeric arrays,
None or NaN in object arrays, NaT in datetimelike).
Parameters
obj [scalar or array-like] Object to check for null or missing values.
Returns
bool or array-like of bool For scalar input, returns a scalar boolean. For array input, returns an
array of boolean indicating whether each corresponding element is missing.
See also:
Examples
>>> pd.isna('dog')
False
>>> pd.isna(pd.NA)
True
>>> pd.isna(np.nan)
True
For Series and DataFrame, the same type is returned, containing booleans.
>>> pd.isna(df[1])
0 False
1 True
Name: 1, dtype: bool
pandas.notna
pandas.notna(obj)
Detect non-missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are valid (not missing, which is
NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).
Parameters
obj [array-like or object value] Object to check for not null or non-missing values.
Returns
bool or array-like of bool For scalar input, returns a scalar boolean. For array input, returns an
array of boolean indicating whether each corresponding element is valid.
See also:
Examples
>>> pd.notna('dog')
True
>>> pd.notna(pd.NA)
False
>>> pd.notna(np.nan)
False
For Series and DataFrame, the same type is returned, containing booleans.
>>> df = pd.DataFrame([['ant', 'bee', 'cat'], ['dog', None, 'fly']])
>>> df
0 1 2
0 ant bee cat
1 dog None fly
>>> pd.notna(df)
0 1 2
0 True True True
1 True False True
>>> pd.notna(df[1])
0 True
1 False
Name: 1, dtype: bool
pandas.notnull
pandas.notnull(obj)
Detect non-missing values for an array-like object.
This function takes a scalar or array-like object and indicates whether values are valid (not missing, which is
NaN in numeric arrays, None or NaN in object arrays, NaT in datetimelike).
Parameters
obj [array-like or object value] Object to check for not null or non-missing values.
Returns
bool or array-like of bool For scalar input, returns a scalar boolean. For array input, returns an
array of boolean indicating whether each corresponding element is valid.
See also:
Examples
>>> pd.notna('dog')
True
>>> pd.notna(pd.NA)
False
>>> pd.notna(np.nan)
False
For Series and DataFrame, the same type is returned, containing booleans.
>>> pd.notna(df[1])
0 True
1 False
Name: 1, dtype: bool
pandas.to_numeric
Examples
pandas.to_datetime
The presence of out-of-bounds values will render the cache unusable and may slow down
parsing.
Changed in version 0.25.0: - changed default value from False to True.
Returns
datetime If parsing succeeded. Return type depends on input:
• list-like: DatetimeIndex
• Series: Series of datetime64 dtype
• scalar: Timestamp
In case when it is not possible to return designated types (e.g. when any element of input is
before Timestamp.min or after Timestamp.max) return will have datetime.datetime type (or
corresponding array/Series).
See also:
Examples
Assembling a datetime from multiple columns of a DataFrame. The keys can be common abbreviations like
[‘year’, ‘month’, ‘day’, ‘minute’, ‘second’, ‘ms’, ‘us’, ‘ns’]) or plurals of the same
If a date does not meet the timestamp limitations, passing errors=’ignore’ will return the original input instead
of raising any exception.
Passing errors=’coerce’ will force an out-of-bounds date to NaT, in addition to forcing non-dates (or non-
parseable dates) to NaT.
Passing infer_datetime_format=True can often-times speedup a parsing if its not an ISO8601 format exactly,
but in a regular format.
Warning: For float arg, precision rounding might happen. To prevent unexpected behavior use a fixed-
width exact type.
pandas.to_timedelta
Notes
If the precision is higher than nanoseconds, the precision of the duration is truncated to nanoseconds for string
inputs.
Examples
pandas.date_range
Notes
Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omitted,
the resulting DatetimeIndex will have periods linearly spaced elements between start and end (closed
on both sides).
To learn more about the frequency strings, please see this link.
Examples
Specify start, end, and periods; the frequency is generated automatically (linearly spaced).
Other Parameters
Changed the freq (frequency) to 'M' (month end frequency).
closed controls whether to include start and end that are on the boundary. The default includes boundary points
on either end.
pandas.bdate_range
Notes
Of the four parameters: start, end, periods, and freq, exactly three must be specified. Specifying freq
is a requirement for bdate_range. Use date_range if specifying freq is not desired.
To learn more about the frequency strings, please see this link.
Examples
Note how the two weekend days are skipped in the result.
pandas.period_range
Notes
Of the three parameters: start, end, and periods, exactly two must be specified.
To learn more about the frequency strings, please see this link.
Examples
If start or end are Period objects, they will be used as anchor endpoints for a PeriodIndex with
frequency matching that of the period_range constructor.
pandas.timedelta_range
Notes
Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omit-
ted, the resulting TimedeltaIndex will have periods linearly spaced elements between start and end
(closed on both sides).
To learn more about the frequency strings, please see this link.
Examples
The closed parameter specifies which endpoint is included. The default behavior is to include both endpoints.
The freq parameter specifies the frequency of the TimedeltaIndex. Only fixed frequencies can be passed,
non-fixed frequencies such as ‘M’ (month end) will raise.
Specify start, end, and periods; the frequency is generated automatically (linearly spaced).
pandas.infer_freq
pandas.infer_freq(index, warn=True)
Infer the most likely frequency given the input index. If the frequency is uncertain, a warning will be printed.
Parameters
index [DatetimeIndex or TimedeltaIndex] If passed a Series will use the values of the series
(NOT THE INDEX).
warn [bool, default True]
Returns
str or None None if no discernible frequency.
Raises
TypeError If the index is not datetime-like.
ValueError If there are fewer than three values.
pandas.interval_range
IntervalIndex An Index of intervals that are all closed on the same side.
Notes
Of the four parameters start, end, periods, and freq, exactly three must be specified. If freq is omit-
ted, the resulting IntervalIndex will have periods linearly spaced elements between start and end,
inclusively.
To learn more about datetime-like frequency strings, please see this link.
Examples
>>> pd.interval_range(start=pd.Timestamp('2017-01-01'),
... end=pd.Timestamp('2017-01-04'))
IntervalIndex([(2017-01-01, 2017-01-02], (2017-01-02, 2017-01-03],
(continues on next page)
The freq parameter specifies the frequency between the left and right. endpoints of the individual intervals
within the IntervalIndex. For numeric start and end, the frequency must also be numeric.
Similarly, for datetime-like start and end, the frequency must be convertible to a DateOffset.
>>> pd.interval_range(start=pd.Timestamp('2017-01-01'),
... periods=3, freq='MS')
IntervalIndex([(2017-01-01, 2017-02-01], (2017-02-01, 2017-03-01],
(2017-03-01, 2017-04-01]],
closed='right', dtype='interval[datetime64[ns]]')
Specify start, end, and periods; the frequency is generated automatically (linearly spaced).
The closed parameter specifies which endpoints of the individual intervals within the IntervalIndex are
closed.
eval(expr[, parser, engine, truediv, . . . ]) Evaluate a Python expression as a string using various
backends.
pandas.eval
from the expression. The default of 'pandas' parses code slightly different than standard
Python. Alternatively, you can parse an expression using the 'python' parser to retain
strict Python semantics. See the enhancing performance documentation for more details.
engine [{‘python’, ‘numexpr’}, default ‘numexpr’] The engine used to evaluate the expression.
Supported engines are
• None : tries to use numexpr, falls back to python
• 'numexpr': This default engine evaluates pandas objects using numexpr for large
speed ups in complex expressions with large frames.
• 'python': Performs operations as if you had eval’d in top level python. This en-
gine is generally not that useful.
More backends may be available in the future.
truediv [bool, optional] Whether to use true division, like in Python >= 3.
Deprecated since version 1.0.0.
local_dict [dict or None, optional] A dictionary of local variables, taken from locals() by de-
fault.
global_dict [dict or None, optional] A dictionary of global variables, taken from globals() by
default.
resolvers [list of dict-like or None, optional] A list of objects implementing the __getitem__
special method that you can use to inject an additional collection of namespaces to use
for variable lookup. For example, this is used in the query() method to inject the
DataFrame.index and DataFrame.columns variables that refer to their respective
DataFrame instance attributes.
level [int, optional] The number of prior stack frames to traverse and add to the current scope.
Most users will not need to change this parameter.
target [object, optional, default None] This is the target object for assignment. It is used when
there is variable assignment in the expression. If so, then target must support item assign-
ment with string keys, and if a copy is being returned, it must also support .copy().
inplace [bool, default False] If target is provided, and the expression mutates target, whether to
modify target inplace. Otherwise, return a copy of target with the mutation.
Returns
ndarray, numeric scalar, DataFrame, Series, or None The completion value of evaluating
the given code or None if inplace=True.
Raises
ValueError There are many instances where such an error can be raised:
• target=None, but the expression is multiline.
• The expression is multiline, but not all them have item assignment. An example of such
an arrangement is this:
a=b+1a+2
Here, there are expressions on different lines, making it multiline, but the last line has no
variable assigned to the output of a + 2.
• inplace=True, but the expression is missing item assignment.
• Item assignment is provided, but the target does not support string item assignment.
• Item assignment is provided and inplace=False, but the target does not support the .copy()
method
See also:
Notes
The dtype of any objects involved in an arithmetic % operation are recursively cast to float64.
See the enhancing performance documentation for more details.
Examples
3.2.7 Hashing
pandas.util.hash_array
pandas.util.hash_pandas_object
3.2.8 Testing
test([extra_args])
pandas.test
pandas.test(extra_args=None)
3.3 Series
3.3.1 Constructor
Series([data, index, dtype, name, copy, . . . ]) One-dimensional ndarray with axis labels (including
time series).
pandas.Series
index [array-like or Index (1d)] Values must be hashable and have the same length as data.
Non-unique index values are allowed. Will default to RangeIndex (0, 1, 2, . . . , n) if not
provided. If data is dict-like and index is None, then the values in the index are used to
reindex the Series after it is created using the keys in the data.
dtype [str, numpy.dtype, or ExtensionDtype, optional] Data type for the output Series. If not
specified, this will be inferred from data. See the user guide for more usages.
name [str, optional] The name to give to the Series.
copy [bool, default False] Copy input data.
Attributes
pandas.Series.T
property Series.T
Return the transpose, which is by definition self.
pandas.Series.array
property Series.array
The ExtensionArray of the data backing this Series or Index.
New in version 0.24.0.
Returns
ExtensionArray An ExtensionArray of the values stored within. For extension types, this is
the actual array. For NumPy native types, this is a thin (no copy) wrapper around numpy.
ndarray.
.array differs .values which may require converting the data to a different form.
See also:
Notes
This table lays out the different array types for each extension dtype within pandas.
For any 3rd-party extension types, the array type will be an ExtensionArray.
For all remaining dtypes .array will be a arrays.NumpyExtensionArray wrapping the actual
ndarray stored within. If you absolutely need a NumPy array (possibly with copying / coercing data), then
use Series.to_numpy() instead.
Examples
For regular NumPy types like int, and float, a PandasArray is returned.
pandas.Series.at
property Series.at
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a single
value in a DataFrame or Series.
Raises
KeyError If ‘label’ does not exist in DataFrame.
See also:
Examples
>>> df.loc[5].at['B']
4
pandas.Series.attrs
property Series.attrs
Dictionary of global attributes of this dataset.
See also:
pandas.Series.axes
property Series.axes
Return a list of the row axis labels.
pandas.Series.dtype
property Series.dtype
Return the dtype object of the underlying data.
pandas.Series.dtypes
property Series.dtypes
Return the dtype object of the underlying data.
pandas.Series.flags
property Series.flags
Get the properties associated with this pandas object.
The available flags are
• Flags.allows_duplicate_labels
See also:
Notes
“Flags” differ from “metadata”. Flags reflect properties of the pandas object (the Series or DataFrame).
Metadata refer to properties of the dataset, and should be stored in DataFrame.attrs.
Examples
>>> df.flags.allows_duplicate_labels
True
>>> df.flags.allows_duplicate_labels = False
>>> df.flags["allows_duplicate_labels"]
False
>>> df.flags["allows_duplicate_labels"] = True
pandas.Series.hasnans
property Series.hasnans
Return if I have any nans; enables various perf speedups.
pandas.Series.iat
property Series.iat
Access a single value for a row/column pair by integer position.
Similar to iloc, in that both provide integer-based lookups. Use iat if you only need to get or set a
single value in a DataFrame or Series.
Raises
IndexError When integer position is out of bounds.
See also:
Examples
>>> df.iat[1, 2]
1
>>> df.iat[1, 2] = 10
>>> df.iat[1, 2]
10
>>> df.loc[0].iat[1]
2
pandas.Series.iloc
property Series.iloc
Purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used
with a boolean array.
Allowed inputs are:
• An integer, e.g. 5.
• A list or array of integers, e.g. [4, 3, 0].
• A slice object with ints, e.g. 1:7.
• A boolean array.
• A callable function with one argument (the calling Series or DataFrame) and that returns valid
output for indexing (one of the above). This is useful in method chains, when you don’t have a
reference to the calling object, but would like to base your selection on some value.
.iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which allow
out-of-bounds indexing (this conforms with python/numpy slice semantics).
See more at Selection by Position.
See also:
Examples
>>> type(df.iloc[0])
<class 'pandas.core.series.Series'>
>>> df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: int64
>>> df.iloc[[0]]
a b c d
0 1 2 3 4
>>> type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>
>>> df.iloc[:3]
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000
With a callable, useful in method chains. The x passed to the lambda is the DataFrame being sliced. This
selects the rows whose index label even.
>>> df.iloc[0, 1]
2
pandas.Series.index
Series.index: pandas.core.indexes.base.Index
The index (axis labels) of the Series.
pandas.Series.is_monotonic
property Series.is_monotonic
Return boolean if values in the object are monotonic_increasing.
Returns
bool
pandas.Series.is_monotonic_decreasing
property Series.is_monotonic_decreasing
Return boolean if values in the object are monotonic_decreasing.
Returns
bool
pandas.Series.is_monotonic_increasing
property Series.is_monotonic_increasing
Alias for is_monotonic.
pandas.Series.is_unique
property Series.is_unique
Return boolean if values in the object are unique.
Returns
bool
pandas.Series.loc
property Series.loc
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.
Allowed inputs are:
• A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an integer
position along the index).
• A list or array of labels, e.g. ['a', 'b', 'c'].
• A slice object with labels, e.g. 'a':'f'.
Warning: Note that contrary to usual python slices, both the start and the stop are included
• A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
• An alignable boolean Series. The index of the key will be aligned before masking.
• An alignable Index. The Index of the returned selection will be the input.
• A callable function with one argument (the calling Series or DataFrame) and that returns valid
output for indexing (one of the above)
See more at Selection by Label.
Raises
KeyError If any items are not found.
IndexingError If an indexed key is passed and its index is unalignable to the frame index.
See also:
Examples
Getting values
>>> df = pd.DataFrame([[1, 2], [4, 5], [7, 8]],
... index=['cobra', 'viper', 'sidewinder'],
... columns=['max_speed', 'shield'])
>>> df
max_speed shield
cobra 1 2
viper 4 5
sidewinder 7 8
Slice with labels for row and single label for column. As mentioned above, note that both the start and
stop of the slice are included.
>>> df.loc['cobra':'viper', 'max_speed']
cobra 1
viper 4
Name: max_speed, dtype: int64
Setting values
Set value for all items matching the list of labels
>>> df.loc['cobra'] = 10
>>> df
max_speed shield
cobra 10 10
viper 4 50
sidewinder 7 50
Slice with integer labels for rows. As mentioned above, note that both the start and stop of the slice are
included.
>>> df.loc[7:9]
max_speed shield
7 1 2
8 4 5
9 7 8
>>> tuples = [
... ('cobra', 'mark i'), ('cobra', 'mark ii'),
... ('sidewinder', 'mark i'), ('sidewinder', 'mark ii'),
... ('viper', 'mark ii'), ('viper', 'mark iii')
... ]
>>> index = pd.MultiIndex.from_tuples(tuples)
>>> values = [[12, 2], [0, 4], [10, 20],
... [1, 4], [7, 1], [16, 36]]
>>> df = pd.DataFrame(values, columns=['max_speed', 'shield'], index=index)
>>> df
max_speed shield
cobra mark i 12 2
mark ii 0 4
sidewinder mark i 10 20
mark ii 1 4
(continues on next page)
>>> df.loc['cobra']
max_speed shield
mark i 12 2
mark ii 0 4
Single label for row and column. Similar to passing in a tuple, this returns a Series.
Single tuple for the index with a single label for the column
pandas.Series.name
property Series.name
Return the name of the Series.
The name of a Series becomes its index or column name if it is used to form a DataFrame. It is also used
whenever displaying the Series using the interpreter.
Returns
label (hashable object) The name of the Series, also the column name if part of a
DataFrame.
See also:
Examples
The Series name can be set initially when calling the constructor.
pandas.Series.nbytes
property Series.nbytes
Return the number of bytes in the underlying data.
pandas.Series.ndim
property Series.ndim
Number of dimensions of the underlying data, by definition 1.
pandas.Series.shape
property Series.shape
Return a tuple of the shape of the underlying data.
pandas.Series.size
property Series.size
Return the number of elements in the underlying data.
pandas.Series.values
property Series.values
Return Series as ndarray or ndarray-like depending on the dtype.
Returns
numpy.ndarray or ndarray-like
See also:
Examples
>>> pd.Series(list('aabc')).values
array(['a', 'a', 'b', 'c'], dtype=object)
>>> pd.Series(list('aabc')).astype('category').values
['a', 'a', 'b', 'c']
Categories (3, object): ['a', 'b', 'c']
empty
Methods
pandas.Series.abs
Series.abs()
Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
Returns
abs Series/DataFrame containing the absolute value of each element.
See also:
Notes
√
For complex inputs, 1.2 + 1j, the absolute value is 𝑎2 + 𝑏2 .
Examples
Select rows with data closest to certain value using argsort (from StackOverflow).
>>> df = pd.DataFrame({
... 'a': [4, 5, 6, 7],
... 'b': [10, 20, 30, 40],
... 'c': [100, 50, -30, -50]
... })
>>> df
a b c
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
>>> df.loc[(df.c - 43).abs().argsort()]
a b c
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
pandas.Series.add
Series.radd Reverse of the Addition operator, see Python documentation for more details.
Examples
pandas.Series.add_prefix
Series.add_prefix(prefix)
Prefix labels with string prefix.
For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.
Parameters
prefix [str] The string to add before each label.
Returns
Series or DataFrame New Series or DataFrame with updated labels.
See also:
Examples
>>> s.add_prefix('item_')
item_0 1
item_1 2
item_2 3
item_3 4
dtype: int64
>>> df.add_prefix('col_')
col_A col_B
0 1 3
1 2 4
2 3 5
3 4 6
pandas.Series.add_suffix
Series.add_suffix(suffix)
Suffix labels with string suffix.
For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.
Parameters
suffix [str] The string to add after each label.
Returns
Series or DataFrame New Series or DataFrame with updated labels.
See also:
Examples
>>> s.add_suffix('_item')
0_item 1
1_item 2
2_item 3
3_item 4
dtype: int64
>>> df.add_suffix('_col')
A_col B_col
0 1 3
1 2 4
2 3 5
3 4 6
pandas.Series.agg
Notes
Examples
>>> s.agg('min')
1
pandas.Series.aggregate
Notes
Examples
>>> s.agg('min')
1
pandas.Series.align
method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None] Method to use for filling
holes in reindexed Series:
• pad / ffill: propagate last valid observation forward to next valid.
• backfill / bfill: use NEXT valid observation to fill gap.
limit [int, default None] If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is a gap with more than this
number of consecutive NaNs, it will only be partially filled. If method is not specified, this
is the maximum number of entries along the entire axis where NaNs will be filled. Must
be greater than 0 if not None.
fill_axis [{0 or ‘index’}, default 0] Filling axis, method and limit.
broadcast_axis [{0 or ‘index’}, default None] Broadcast values along this axis, if aligning
two objects of different dimensions.
Returns
(left, right) [(Series, type of other)] Aligned objects.
pandas.Series.all
Examples
Series
DataFrames
Create a dataframe from a dictionary.
>>> df.all()
col1 True
col2 False
dtype: bool
>>> df.all(axis='columns')
0 True
1 False
dtype: bool
>>> df.all(axis=None)
False
pandas.Series.any
• 1 / ‘columns’ : reduce the columns, return a Series whose index is the original index.
• None : reduce all axes, return a scalar.
bool_only [bool, default None] Include only boolean columns. If None, will attempt to use
everything, then use only boolean data. Not implemented for Series.
skipna [bool, default True] Exclude NA/null values. If the entire row/column is NA and
skipna is True, then the result will be False, as for an empty row/column. If skipna is
False, then NA are treated as True, because these are not equal to zero.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical), count along
a particular level, collapsing into a scalar.
**kwargs [any, default None] Additional keywords have no effect but might be accepted for
compatibility with NumPy.
Returns
scalar or Series If level is specified, then, Series is returned; otherwise, scalar is returned.
See also:
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
A B C
0 1 0 0
1 2 2 0
>>> df.any()
A True
B True
C False
dtype: bool
>>> df.any(axis='columns')
0 True
1 True
dtype: bool
>>> df.any(axis='columns')
0 True
1 False
dtype: bool
>>> df.any(axis=None)
True
>>> pd.DataFrame([]).any()
Series([], dtype: bool)
pandas.Series.append
See also:
Notes
Iteratively appending to a Series can be more computationally intensive than a single concatenate. A better
solution is to append values to a list and then concatenate the list with the original Series all at once.
Examples
>>> s1.append(s3)
0 1
1 2
2 3
3 4
4 5
5 6
dtype: int64
pandas.Series.apply
Examples
>>> s.apply(lambda x: x ** 2)
London 400
New York 441
Helsinki 144
dtype: int64
Define a custom function that needs additional positional arguments and pass these additional arguments
using the args keyword.
Define a custom function that takes keyword arguments and pass these arguments to apply.
>>> s.apply(np.log)
London 2.995732
New York 3.044522
Helsinki 2.484907
dtype: float64
pandas.Series.argmax
Examples
>>> s.argmax()
2
>>> s.argmin()
0
The maximum cereal calories is the third element and the minimum cereal calories is the first element,
since series is zero-indexed.
pandas.Series.argmin
Examples
>>> s.argmax()
2
>>> s.argmin()
0
The maximum cereal calories is the third element and the minimum cereal calories is the first element,
since series is zero-indexed.
pandas.Series.argsort
pandas.Series.asfreq
method [{‘backfill’/’bfill’, ‘pad’/’ffill’}, default None] Method to use for filling holes in
reindexed Series (note this does not fill NaNs that already were present):
• ‘pad’ / ‘ffill’: propagate last valid observation forward to next valid
• ‘backfill’ / ‘bfill’: use NEXT valid observation to fill.
how [{‘start’, ‘end’}, default end] For PeriodIndex only (see PeriodIndex.asfreq).
normalize [bool, default False] Whether to reset output index to midnight.
fill_value [scalar, optional] Value to use for missing values, applied during upsampling (note
this does not fill NaNs that already were present).
Returns
Same type as caller Object converted to the specified frequency.
See also:
Notes
To learn more about the frequency strings, please see this link.
Examples
>>> df.asfreq(freq='30S')
s
2000-01-01 00:00:00 0.0
2000-01-01 00:00:30 NaN
2000-01-01 00:01:00 NaN
2000-01-01 00:01:30 NaN
2000-01-01 00:02:00 2.0
2000-01-01 00:02:30 NaN
2000-01-01 00:03:00 3.0
pandas.Series.asof
Series.asof(where, subset=None)
Return the last row(s) without any NaNs before where.
The last row (for each element in where, if list) without any NaN is taken. In case of a DataFrame, the
last row without NaN considering only the subset of columns (if not None)
If there is no good value, NaN is returned for a Series or a Series of NaN values for a DataFrame
Parameters
where [date or array-like of dates] Date(s) before which the last row(s) are returned.
subset [str or array-like of str, default None] For DataFrame, if not None, only use these
columns to check for NaNs.
Returns
scalar, Series, or DataFrame The return can be:
• scalar : when self is a Series and where is a scalar
• Series: when self is a Series and where is an array-like, or when self is a DataFrame
and where is a scalar
• DataFrame : when self is a DataFrame and where is an array-like
Return scalar, Series, or DataFrame.
See also:
Notes
Examples
>>> s.asof(20)
2.0
For a sequence where, a Series is returned. The first value is NaN, because the first element of where is
before the first index value.
Missing values are not considered. The following is 2.0, not NaN, even though NaN is at the index
location for 30.
>>> s.asof(30)
2.0
pandas.Series.astype
Examples
Create a DataFrame:
>>> df.astype('int32').dtypes
col1 int32
col2 int32
dtype: object
Create a series:
>>> ser.astype('category')
0 1
1 2
dtype: category
Categories (2, int64): [1, 2]
Note that using copy=False and changing data on a new pandas object may propagate changes:
Datetimes are localized to UTC first before converting to the specified timezone:
pandas.Series.at_time
Examples
>>> ts.at_time('12:00')
A
2018-04-09 12:00:00 2
2018-04-10 12:00:00 4
pandas.Series.autocorr
Series.autocorr(lag=1)
Compute the lag-N autocorrelation.
This method computes the Pearson correlation between the Series and its shifted self.
Parameters
lag [int, default 1] Number of lags to apply before performing autocorrelation.
Returns
float The Pearson correlation between self and self.shift(lag).
See also:
Notes
Examples
pandas.Series.backfill
pandas.Series.between
See also:
Notes
This function is equivalent to (left <= ser) & (ser <= right)
Examples
>>> s.between(1, 4)
0 True
1 False
2 True
3 False
4 False
dtype: bool
pandas.Series.between_time
Examples
You get the times that are not between two times by setting start_time later than end_time:
pandas.Series.bfill
pandas.Series.bool
Series.bool()
Return the bool of a single element Series or DataFrame.
This must be a boolean scalar value, either True or False. It will raise a ValueError if the Series or
DataFrame does not have exactly 1 element, or that element is not boolean (integer values 0 and 1 will also
raise an exception).
Returns
bool The value in the Series or DataFrame.
See also:
Examples
The method will only work for single element objects with a boolean value:
>>> pd.Series([True]).bool()
True
>>> pd.Series([False]).bool()
False
pandas.Series.cat
Series.cat()
Accessor object for categorical properties of the Series values.
Be aware that assigning to categories is a inplace operation, while all methods return new categorical data
per default (but can be called with inplace=True).
Parameters
data [Series or CategoricalIndex]
Examples
>>> s = pd.Series(list("abbccc")).astype("category")
>>> s
0 a
1 b
2 b
3 c
4 c
5 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
>>> s.cat.categories
Index(['a', 'b', 'c'], dtype='object')
>>> s.cat.rename_categories(list("cba"))
0 c
1 b
2 b
3 a
4 a
5 a
dtype: category
Categories (3, object): ['c', 'b', 'a']
>>> s.cat.reorder_categories(list("cba"))
0 a
1 b
2 b
3 c
4 c
5 c
dtype: category
Categories (3, object): ['c', 'b', 'a']
>>> s.cat.set_categories(list("abcde"))
0 a
1 b
2 b
3 c
4 c
5 c
dtype: category
Categories (5, object): ['a', 'b', 'c', 'd', 'e']
>>> s.cat.as_ordered()
0 a
1 b
2 b
3 c
4 c
5 c
dtype: category
Categories (3, object): ['a' < 'b' < 'c']
>>> s.cat.as_unordered()
0 a
1 b
2 b
3 c
4 c
5 c
dtype: category
Categories (3, object): ['a', 'b', 'c']
pandas.Series.clip
axis [int or str axis name, optional] Align object with lower and upper along the given axis.
inplace [bool, default False] Whether to perform the operation in place on the data.
*args, **kwargs Additional keywords have no effect but might be accepted for compatibil-
ity with numpy.
Returns
Series or DataFrame or None Same type as calling object with the values outside the clip
boundaries replaced or None if inplace=True.
See also:
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
col_0 col_1
0 9 -2
1 -3 -7
2 0 6
3 -1 8
4 5 -5
>>> df.clip(-4, 6)
col_0 col_1
0 6 -2
1 -3 -4
2 0 6
3 -1 6
4 5 -4
Clips using specific lower and upper thresholds per column element:
pandas.Series.combine
Series.combine_first Combine Series values, choosing the calling Series’ values first.
Examples
Now, to combine the two datasets and view the highest speeds of the birds across the two datasets
In the previous example, the resulting value for duck is missing, because the maximum of a NaN and a
float is a NaN. So, in the example, we set fill_value=0, so the maximum value returned will be the
value from some dataset.
pandas.Series.combine_first
Series.combine_first(other)
Combine Series values, choosing the calling Series’s values first.
Parameters
other [Series] The value(s) to be combined with the Series.
Returns
Series The result of combining the Series with the other object.
See also:
Notes
Examples
pandas.Series.compare
keep_shape [bool, default False] If true, all rows and columns are kept. Otherwise, only the
ones with different values are kept.
keep_equal [bool, default False] If true, the result keeps values that are equal. Otherwise,
equal values are shown as NaNs.
Returns
Series or DataFrame If axis is 0 or ‘index’ the result will be a Series. The resulting index
will be a MultiIndex with ‘self’ and ‘other’ stacked alternately at the inner level.
If axis is 1 or ‘columns’ the result will be a DataFrame. It will have two columns namely
‘self’ and ‘other’.
See also:
Notes
Examples
>>> s1.compare(s2)
self other
1 b a
3 d b
pandas.Series.convert_dtypes
Notes
By default, convert_dtypes will attempt to convert a Series (or each Series in a DataFrame)
to dtypes that support pd.NA. By using the options convert_string, convert_integer,
convert_boolean and convert_boolean, it is possible to turn off individual conversions to
StringDtype, the integer extension types, BooleanDtype or floating extension types, respectively.
For object-dtyped columns, if infer_objects is True, use the inference rules as during normal
Series/DataFrame construction. Then, if possible, convert to StringDtype, BooleanDtype or an
appropriate integer or floating extension type, otherwise leave as object.
If the dtype is integer, convert to an appropriate integer extension type.
If the dtype is numeric, and consists of all integers, convert to an appropriate integer extension type.
Otherwise, convert to an appropriate floating extension type.
Changed in version 1.2: Starting with pandas 1.2, this method also converts float columns to the nullable
floating extension type.
In the future, as new dtypes are added that support pd.NA, the results of this method will change to
support those new dtypes.
Examples
>>> df = pd.DataFrame(
... {
... "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
... "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
... "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
... "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
... "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
... "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
... }
... )
>>> df.dtypes
a int32
b object
c object
d object
e float64
f float64
dtype: object
>>> dfn.dtypes
a Int32
b string
c boolean
d string
e Int64
f Float64
dtype: object
>>> s.convert_dtypes()
0 a
1 b
2 <NA>
dtype: string
pandas.Series.copy
Series.copy(deep=True)
Make a copy of this object’s indices and data.
When deep=True (default), a new object will be created with a copy of the calling object’s data and
indices. Modifications to the data or indices of the copy will not be reflected in the original object (see
notes below).
When deep=False, a new object will be created without copying the calling object’s data or index
(only references to the data and index are copied). Any changes to the data of the original will be reflected
in the shallow copy (and vice versa).
Parameters
deep [bool, default True] Make a deep copy, including a copy of the data and the indices.
With deep=False neither the indices nor the data are copied.
Returns
copy [Series or DataFrame] Object type matches caller.
Notes
When deep=True, data is copied but actual Python objects will not be copied recursively, only the
reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively
copies object data (see examples below).
While Index objects are copied when deep=True, the underlying numpy array is not copied for per-
formance reasons. Since Index is immutable, the underlying data can be safely shared and a copy is not
needed.
Examples
Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged.
>>> s[0] = 3
>>> shallow[1] = 4
>>> s
a 3
b 4
dtype: int64
>>> shallow
a 3
b 4
dtype: int64
(continues on next page)
Note that when copying an object containing Python objects, a deep copy will copy the data, but will not
do so recursively. Updating a nested data object will be reflected in the deep copy.
pandas.Series.corr
Examples
pandas.Series.count
Series.count(level=None)
Return number of non-NA/null observations in the Series.
Parameters
level [int or level name, default None] If the axis is a MultiIndex (hierarchical), count
along a particular level, collapsing into a smaller Series.
Returns
int or Series (if level specified) Number of non-null values in the Series.
See also:
Examples
pandas.Series.cov
Examples
pandas.Series.cummax
Examples
Series
>>> s.cummax()
0 2.0
1 NaN
2 5.0
3 5.0
4 5.0
dtype: float64
>>> s.cummax(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the maximum in each column. This is equivalent to axis=None
or axis='index'.
>>> df.cummax()
A B
0 2.0 1.0
1 3.0 NaN
2 3.0 1.0
To iterate over columns and find the maximum in each row, use axis=1
>>> df.cummax(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 1.0
pandas.Series.cummin
Examples
Series
>>> s.cummin()
0 2.0
1 NaN
2 2.0
3 -1.0
4 -1.0
dtype: float64
>>> s.cummin(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the minimum in each column. This is equivalent to axis=None
or axis='index'.
>>> df.cummin()
A B
0 2.0 1.0
1 2.0 NaN
2 1.0 0.0
To iterate over columns and find the minimum in each row, use axis=1
>>> df.cummin(axis=1)
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0
pandas.Series.cumprod
See also:
Examples
Series
>>> s.cumprod()
0 2.0
1 NaN
2 10.0
3 -10.0
4 -0.0
dtype: float64
>>> s.cumprod(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the product in each column. This is equivalent to axis=None or
axis='index'.
>>> df.cumprod()
A B
0 2.0 1.0
1 6.0 NaN
2 6.0 0.0
To iterate over columns and find the product in each row, use axis=1
>>> df.cumprod(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 0.0
pandas.Series.cumsum
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0 2.0
1 NaN
2 5.0
3 -1.0
4 0.0
dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0],
... [3.0, np.nan],
... [1.0, 0.0]],
... columns=list('AB'))
>>> df
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0
By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None or
axis='index'.
>>> df.cumsum()
A B
0 2.0 1.0
1 5.0 NaN
2 6.0 1.0
To iterate over columns and find the sum in each row, use axis=1
>>> df.cumsum(axis=1)
A B
(continues on next page)
pandas.Series.describe
Notes
For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and
upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile
is the same as the median.
For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and
freq. The top is the most common value. The freq is the most common value’s frequency. Times-
tamps also include the first and last items.
If multiple object values have the highest count, then the count and top results will be arbitrarily chosen
from among those with the highest count.
For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric
columns. If the dataframe consists only of object and categorical data without any numeric columns,
the default is to return an analysis of both the object and categorical columns. If include='all' is
provided as an option, the result will include a union of attributes of each type.
The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed
for the output. The parameters are ignored when analyzing a Series.
Examples
>>> s = pd.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count 3
mean 2006-09-01 08:00:00
min 2000-01-01 00:00:00
25% 2004-12-31 12:00:00
50% 2010-01-01 00:00:00
75% 2010-01-01 00:00:00
max 2010-01-01 00:00:00
dtype: object
>>> df.describe(include='all')
categorical numeric object
count 3 3.0 3
unique 3 NaN 3
top f NaN a
freq 1 NaN 1
mean NaN 2.0 NaN
std NaN 1.0 NaN
min NaN 1.0 NaN
25% NaN 1.5 NaN
50% NaN 2.0 NaN
75% NaN 2.5 NaN
max NaN 3.0 NaN
>>> df.numeric.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
(continues on next page)
>>> df.describe(include=[np.number])
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
>>> df.describe(include=[object])
object
count 3
unique 3
top a
freq 1
>>> df.describe(include=['category'])
categorical
count 3
unique 3
top d
freq 1
>>> df.describe(exclude=[np.number])
categorical object
count 3 3
unique 3 3
top f a
freq 1 1
>>> df.describe(exclude=[object])
categorical numeric
count 3 3.0
unique 3 NaN
top f NaN
freq 1 NaN
mean NaN 2.0
std NaN 1.0
min NaN 1.0
25% NaN 1.5
(continues on next page)
pandas.Series.diff
Series.diff(periods=1)
First discrete difference of element.
Calculates the difference of a Series element compared with another element in the Series (default is
element in previous row).
Parameters
periods [int, default 1] Periods to shift for calculating difference, accepts negative values.
Returns
Series First differences of the Series.
See also:
Notes
For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calcu-
lated according to current dtype in Series, however dtype of the result is always float64.
Examples
>>> s.diff(periods=3)
0 NaN
1 NaN
2 NaN
3 2.0
4 4.0
(continues on next page)
>>> s.diff(periods=-1)
0 0.0
1 -1.0
2 -1.0
3 -2.0
4 -3.0
5 NaN
dtype: float64
pandas.Series.div
Series.rtruediv Reverse of the Floating division operator, see Python documentation for more
details.
Examples
pandas.Series.divide
Series.rtruediv Reverse of the Floating division operator, see Python documentation for more
details.
Examples
pandas.Series.divmod
Series.rdivmod Reverse of the Integer division and modulo operator, see Python documentation for
more details.
Examples
pandas.Series.dot
Series.dot(other)
Compute the dot product between the Series and the columns of other.
This method computes the dot product between the Series and another one, or the Series and each columns
of a DataFrame, or the Series and each columns of an array.
It can also be called using self @ other in Python >= 3.5.
Parameters
other [Series, DataFrame or array-like] The other object to compute the dot product with
its columns.
Returns
scalar, Series or numpy.ndarray Return the dot product of the Series and other if other
is a Series, the Series of the dot product of Series and each rows of other if other is
a DataFrame or a numpy.ndarray between the Series and each columns of the numpy
array.
See also:
Notes
The Series and other has to share the same index if other is a Series or a DataFrame.
Examples
pandas.Series.drop
Examples
Drop labels B en C
pandas.Series.drop_duplicates
Series.drop_duplicates(keep='first', inplace=False)
Return Series with duplicate values removed.
Parameters
keep [{‘first’, ‘last’, False}, default ‘first’] Method to handle dropping duplicates:
• ‘first’ : Drop duplicates except for the first occurrence.
• ‘last’ : Drop duplicates except for the last occurrence.
• False : Drop all duplicates.
inplace [bool, default False] If True, performs operation inplace and returns None.
Returns
Series or None Series with duplicates dropped or None if inplace=True.
See also:
Examples
With the ‘keep’ parameter, the selection behaviour of duplicated values can be changed. The value ‘first’
keeps the first occurrence for each set of duplicated entries. The default value of keep is ‘first’.
>>> s.drop_duplicates()
0 lama
1 cow
3 beetle
5 hippo
Name: animal, dtype: object
The value ‘last’ for parameter ‘keep’ keeps the last occurrence for each set of duplicated entries.
>>> s.drop_duplicates(keep='last')
1 cow
3 beetle
4 lama
(continues on next page)
The value False for parameter ‘keep’ discards all sets of duplicated entries. Setting the value of ‘inplace’
to True performs the operation inplace and returns None.
pandas.Series.droplevel
Series.droplevel(level, axis=0)
Return DataFrame with requested index / column level(s) removed.
New in version 0.24.0.
Parameters
level [int, str, or list-like] If a string is given, must be the name of a level If list-like,
elements must be names or positional indexes of levels.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Axis along which the level(s) is removed:
• 0 or ‘index’: remove level(s) in column.
• 1 or ‘columns’: remove level(s) in row.
Returns
DataFrame DataFrame with requested index / column level(s) removed.
Examples
>>> df = pd.DataFrame([
... [1, 2, 3, 4],
... [5, 6, 7, 8],
... [9, 10, 11, 12]
... ]).set_index([0, 1]).rename_axis(['a', 'b'])
>>> df
level_1 c d
level_2 e f
a b
1 2 3 4
5 6 7 8
9 10 11 12
>>> df.droplevel('a')
level_1 c d
level_2 e f
b
2 3 4
6 7 8
10 11 12
pandas.Series.dropna
Examples
>>> ser.dropna()
0 1.0
1 2.0
dtype: float64
>>> ser.dropna(inplace=True)
>>> ser
0 1.0
1 2.0
dtype: float64
pandas.Series.dt
Series.dt()
Accessor object for datetimelike properties of the Series values.
Examples
>>> seconds_series
0 2000-01-01 00:00:00
1 2000-01-01 00:00:01
2 2000-01-01 00:00:02
dtype: datetime64[ns]
>>> seconds_series.dt.second
0 0
1 1
2 2
dtype: int64
>>> hours_series
(continues on next page)
>>> quarters_series
0 2000-03-31
1 2000-06-30
2 2000-09-30
dtype: datetime64[ns]
>>> quarters_series.dt.quarter
0 1
1 2
2 3
dtype: int64
Returns a Series indexed like the original Series. Raises TypeError if the Series does not contain date-
timelike values.
pandas.Series.duplicated
Series.duplicated(keep='first')
Indicate duplicate Series values.
Duplicated values are indicated as True values in the resulting Series. Either all duplicates, all except
the first or all except the last occurrence of duplicates can be indicated.
Parameters
keep [{‘first’, ‘last’, False}, default ‘first’] Method to handle dropping duplicates:
• ‘first’ : Mark duplicates as True except for the first occurrence.
• ‘last’ : Mark duplicates as True except for the last occurrence.
• False : Mark all duplicates as True.
Returns
Series Series indicating whether each value has occurred in the preceding values.
See also:
Examples
By default, for each set of duplicated values, the first occurrence is set on False and all others on True:
which is equivalent to
>>> animals.duplicated(keep='first')
0 False
1 False
2 True
3 False
4 True
dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:
>>> animals.duplicated(keep='last')
0 True
1 False
2 True
3 False
4 False
dtype: bool
>>> animals.duplicated(keep=False)
0 True
1 False
2 True
3 False
4 True
dtype: bool
pandas.Series.eq
Examples
pandas.Series.equals
Series.equals(other)
Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against each other to see if they have the
same shape and elements. NaNs in the same location are considered equal.
The row/column index do not need to have the same type, as long as the values are considered equal.
Corresponding columns must be of the same dtype.
Parameters
other [Series or DataFrame] The other Series or DataFrame to be compared with the first.
Returns
bool True if all elements are the same in both objects, False otherwise.
See also:
Series.eq Compare two Series objects of the same length and return a Series where each element is
True if the element in each Series is equal, False otherwise.
DataFrame.eq Compare two DataFrame objects of the same shape and return a DataFrame where
each element is True if the respective element in each DataFrame is equal, False otherwise.
testing.assert_series_equal Raises an AssertionError if left and right are not equal. Provides
an easy interface to ignore inequality in dtypes, indexes and precision among others.
testing.assert_frame_equal Like assert_series_equal, but targets DataFrames.
numpy.array_equal Return True if two arrays have the same shape and elements, False otherwise.
Examples
DataFrames df and exactly_equal have the same types and values for their elements and column labels,
which will return True.
DataFrames df and different_column_type have the same element types and values, but have different
types for the column labels, which will still return True.
DataFrames df and different_data_type have different types for the same values for their elements, and
will return False even though their column labels are the same values and types.
pandas.Series.ewm
ignore_na [bool, default False] Ignore missing values when calculating weights; specify
True to reproduce pre-0.15.0 behavior.
• When ignore_na=False (default), weights are based on absolute positions.
For example, the weights of 𝑥0 and 𝑥2 used in calculating the final weighted aver-
age of [𝑥0 , None, 𝑥2 ] are (1 − 𝛼)2 and 1 if adjust=True, and (1 − 𝛼)2 and 𝛼 if
adjust=False.
• When ignore_na=True (reproducing pre-0.15.0 behavior), weights are based
on relative positions. For example, the weights of 𝑥0 and 𝑥2 used in calculating the
final weighted average of [𝑥0 , None, 𝑥2 ] are 1 − 𝛼 and 1 if adjust=True, and
1 − 𝛼 and 𝛼 if adjust=False.
axis [{0, 1}, default 0] The axis to use. The value 0 identifies the rows, and 1 identifies
the columns.
Notes
Examples
>>> df.ewm(com=0.5).mean()
B
0 0.000000
1 0.750000
2 1.615385
3 1.615385
4 3.670213
pandas.Series.expanding
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the
window by setting center=True.
Examples
>>> df.expanding(2).sum()
B
0 NaN
1 1.0
2 3.0
3 3.0
4 7.0
pandas.Series.explode
Series.explode(ignore_index=False)
Transform each element of a list-like to a row.
New in version 0.25.0.
Parameters
ignore_index [bool, default False] If True, the resulting index will be labeled 0, 1, . . . , n
- 1.
New in version 1.1.0.
Returns
Series Exploded lists to rows; index will be duplicated for these rows.
See also:
Notes
This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype
of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in
a np.nan for that row. In addition, the ordering of elements in the output will be non-deterministic when
exploding sets.
Examples
>>> s.explode()
0 1
0 2
0 3
1 foo
2 NaN
3 3
3 4
dtype: object
pandas.Series.factorize
Series.factorize(sort=False, na_sentinel=- 1)
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an array when all that matters is identifying
distinct values. factorize is available as both a top-level function pandas.factorize(), and as a
method Series.factorize() and Index.factorize().
Parameters
sort [bool, default False] Sort uniques and shuffle codes to maintain the relationship.
na_sentinel [int or None, default -1] Value to mark “not found”. If None, will not drop
the NaN from the uniques of the values.
Changed in version 1.1.2.
Returns
codes [ndarray] An integer ndarray that’s an indexer into uniques. uniques.
take(codes) will have the same values as values.
uniques [ndarray, Index, or Categorical] The unique valid values. When values is Cate-
gorical, uniques is a Categorical. When values is some other pandas object, an Index
is returned. Otherwise, a 1-D ndarray is returned.
Note: Even if there’s a missing value in values, uniques will not contain an entry for
it.
See also:
Examples
These examples all show factorize as a top-level method like pd.factorize(values). The results
are identical for methods like Series.factorize().
With sort=True, the uniques will be sorted, and codes will be shuffled so that the relationship is the
maintained.
Missing values are indicated in codes with na_sentinel (-1 by default). Note that missing values are never
included in uniques.
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing
pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.
If NaN is in the values, and we want to include NaN in the uniques of the values, it can be achieved by
setting na_sentinel=None.
pandas.Series.ffill
pandas.Series.fillna
Examples
>>> df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
>>> values = {'A': 0, 'B': 1, 'C': 2, 'D': 3}
>>> df.fillna(value=values)
A B C D
0 0.0 2.0 2.0 0
1 3.0 4.0 2.0 1
2 0.0 1.0 2.0 5
3 0.0 3.0 2.0 4
pandas.Series.filter
See also:
Notes
The items, like, and regex parameters are enforced to be mutually exclusive.
axis defaults to the info axis that is used when indexing with [].
Examples
pandas.Series.first
Series.first(offset)
Select initial periods of time series data based on a date offset.
When having a DataFrame with dates as index, this function can select the first few rows based on a date
offset.
Parameters
offset [str, DateOffset or dateutil.relativedelta] The offset length of the data that will be
selected. For instance, ‘1M’ will display all the rows having their index within the
first month.
Returns
Series or DataFrame A subset of the caller.
Raises
TypeError If the index is not a DatetimeIndex
See also:
Examples
>>> ts.first('3D')
A
2018-04-09 1
2018-04-11 2
Notice the data for 3 first calendar days were returned, not the first 3 days observed in the dataset, and
therefore data for 2018-04-13 was not returned.
pandas.Series.first_valid_index
Series.first_valid_index()
Return index for first non-NA/null value.
Returns
scalar [type of index]
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
pandas.Series.floordiv
Series.rfloordiv Reverse of the Integer division operator, see Python documentation for more
details.
Examples
pandas.Series.ge
Examples
pandas.Series.get
Series.get(key, default=None)
Get item from object for given key (ex: DataFrame column).
Returns default value if not found.
Parameters
key [object]
Returns
value [same type as items contained in object]
pandas.Series.groupby
dropna [bool, default True] If True, and if group keys contain NA values, NA values
together with row/column will be dropped. If False, NA values will also be treated as
the key in groups
New in version 1.1.0.
Returns
SeriesGroupBy Returns a groupby object that contains information about the groups.
See also:
resample Convenience method for frequency conversion and resampling of time series.
Notes
Examples
>>> ser
Falcon 390.0
Falcon 350.0
Parrot 30.0
Parrot 20.0
Name: Max Speed, dtype: float64
>>> ser.groupby(["a", "b", "a", "b"]).mean()
a 210.0
b 185.0
Name: Max Speed, dtype: float64
>>> ser.groupby(level=0).mean()
Falcon 370.0
Parrot 25.0
Name: Max Speed, dtype: float64
>>> ser.groupby(ser > 100).mean()
Max Speed
False 25.0
True 370.0
Name: Max Speed, dtype: float64
Grouping by Indexes
We can groupby different levels of a hierarchical index using the level parameter:
>>> arrays = [['Falcon', 'Falcon', 'Parrot', 'Parrot'],
... ['Captive', 'Wild', 'Captive', 'Wild']]
>>> index = pd.MultiIndex.from_arrays(arrays, names=('Animal', 'Type'))
>>> ser = pd.Series([390., 350., 30., 20.], index=index, name="Max Speed")
>>> ser
Animal Type
Falcon Captive 390.0
Wild 350.0
Parrot Captive 30.0
Wild 20.0
(continues on next page)
We can also choose to include NA in group keys or not by defining dropna parameter, the default setting
is True:
pandas.Series.gt
Examples
pandas.Series.head
Series.head(n=5)
Return the first n rows.
This function returns the first n rows for the object based on position. It is useful for quickly testing if
your object has the right type of data in it.
For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].
Parameters
n [int, default 5] Number of rows to select.
Returns
same type as caller The first n rows of the caller object.
See also:
Examples
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
pandas.Series.hist
xlabelsize [int, default None] If specified changes the x-axis label size.
xrot [float, default None] Rotation of x axis labels.
ylabelsize [int, default None] If specified changes the y-axis label size.
yrot [float, default None] Rotation of y axis labels.
figsize [tuple, default None] Figure size in inches by default.
bins [int or sequence, default 10] Number of histogram bins to be used. If an integer is
given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin
edges, including left edge of first bin and right edge of last bin. In this case, bins is
returned unmodified.
backend [str, default None] Backend to use instead of the backend specified in the op-
tion plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify
the plotting.backend for the whole session, set pd.options.plotting.
backend.
New in version 1.0.0.
legend [bool, default False] Whether to show the legend.
New in version 1.1.0.
**kwargs To be passed to the actual plotting function.
Returns
matplotlib.AxesSubplot A histogram plot.
See also:
pandas.Series.idxmax
numpy.argmax Return indices of the maximum values along the given axis.
DataFrame.idxmax Return index of first occurrence of maximum over requested axis.
Series.idxmin Return index label of the first occurrence of minimum of values.
Notes
This method is the Series version of ndarray.argmax. This method returns the label of the maximum,
while ndarray.argmax returns the position. To get the position, use series.values.argmax().
Examples
>>> s.idxmax()
'C'
If skipna is False and there is an NA value in the data, the function returns nan.
>>> s.idxmax(skipna=False)
nan
pandas.Series.idxmin
numpy.argmin Return indices of the minimum values along the given axis.
DataFrame.idxmin Return index of first occurrence of minimum over requested axis.
Series.idxmax Return index label of the first occurrence of maximum of values.
Notes
This method is the Series version of ndarray.argmin. This method returns the label of the minimum,
while ndarray.argmin returns the position. To get the position, use series.values.argmin().
Examples
>>> s.idxmin()
'A'
If skipna is False and there is an NA value in the data, the function returns nan.
>>> s.idxmin(skipna=False)
nan
pandas.Series.infer_objects
Series.infer_objects()
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns un-
changed. The inference rules are the same as during normal Series/DataFrame construction.
Returns
converted [same type as input object]
See also:
Examples
>>> df.dtypes
A object
dtype: object
>>> df.infer_objects().dtypes
A int64
dtype: object
pandas.Series.interpolate
If limit is specified:
• If ‘method’ is ‘pad’ or ‘ffill’, ‘limit_direction’ must be ‘forward’.
• If ‘method’ is ‘backfill’ or ‘bfill’, ‘limit_direction’ must be ‘backwards’.
If ‘limit’ is not specified:
• If ‘method’ is ‘backfill’ or ‘bfill’, the default is ‘backward’
• else the default is ‘forward’
Changed in version 1.1.0: raises ValueError if limit_direction is ‘forward’ or ‘both’
and method is ‘backfill’ or ‘bfill’. raises ValueError if limit_direction is ‘backward’
or ‘both’ and method is ‘pad’ or ‘ffill’.
limit_area [{{None, ‘inside’, ‘outside’}}, default None] If limit is specified, consecutive
NaNs will be filled with this restriction.
• None: No fill restriction.
• ‘inside’: Only fill NaNs surrounded by valid values (interpolate).
• ‘outside’: Only fill NaNs outside valid values (extrapolate).
downcast [optional, ‘infer’ or None, defaults to None] Downcast dtypes if possible.
``**kwargs`` [optional] Keyword arguments to pass on to the interpolating function.
Returns
Series or DataFrame or None Returns the same object type as the caller, interpolated at
some or all NaN values or None if inplace=True.
See also:
Notes
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the
respective SciPy implementations of similar names. These use the actual numerical values of the index.
For more information on their behavior, see the SciPy documentation and SciPy tutorial.
Examples
Filling in NaN in a Series by padding, but filling at most two consecutive NaN at a time.
Filling in NaN in a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’ methods
require that you also specify an order (int).
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use
for interpolation. Note how the first entry in column ‘b’ remains NaN, because there is no entry before it
to use for interpolation.
pandas.Series.isin
Series.isin(values)
Whether elements in Series are contained in values.
Return a boolean Series showing whether each element in the Series matches an element in the passed
sequence of values exactly.
Parameters
values [set or list-like] The sequence of values to test. Passing in a single string will raise
a TypeError. Instead, turn a single string into a list of one element.
Returns
Series Series of booleans indicating if each element is in values.
Raises
TypeError
• If values is a string
See also:
Examples
Passing a single string as s.isin('lama') will raise an error. Use a list of one element instead:
>>> s.isin(['lama'])
0 True
1 False
2 True
3 False
4 True
5 False
Name: animal, dtype: bool
pandas.Series.isna
Series.isna()
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.
NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.
use_inf_as_na = True).
Returns
Series Mask of bool values for each element in Series that indicates whether an element
is an NA value.
See also:
Examples
>>> df.isna()
age born name toy
0 False True False True
1 False False False False
2 True False False False
>>> ser.isna()
0 False
1 False
2 True
dtype: bool
pandas.Series.isnull
Series.isnull()
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.
NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.
use_inf_as_na = True).
Returns
Series Mask of bool values for each element in Series that indicates whether an element
is an NA value.
See also:
Examples
>>> df.isna()
age born name toy
0 False True False True
1 False False False False
2 True False False False
>>> ser.isna()
0 False
1 False
2 True
dtype: bool
pandas.Series.item
Series.item()
Return the first element of the underlying data as a Python scalar.
Returns
scalar The first element of %(klass)s.
Raises
ValueError If the data is not length-1.
pandas.Series.items
Series.items()
Lazily iterate over (index, value) tuples.
This method returns an iterable tuple (index, value). This is convenient if you want to create a lazy iterator.
Returns
iterable Iterable of tuples containing the (index, value) pairs from a Series.
See also:
Examples
pandas.Series.iteritems
Series.iteritems()
Lazily iterate over (index, value) tuples.
This method returns an iterable tuple (index, value). This is convenient if you want to create a lazy iterator.
Returns
iterable Iterable of tuples containing the (index, value) pairs from a Series.
See also:
Examples
pandas.Series.keys
Series.keys()
Return alias for index.
Returns
Index Index of the Series.
pandas.Series.kurt
pandas.Series.kurtosis
pandas.Series.last
Series.last(offset)
Select final periods of time series data based on a date offset.
When having a DataFrame with dates as index, this function can select the last few rows based on a date
offset.
Parameters
offset [str, DateOffset, dateutil.relativedelta] The offset length of the data that will be
selected. For instance, ‘3D’ will display all the rows having their index within the last
3 days.
Returns
Series or DataFrame A subset of the caller.
Raises
TypeError If the index is not a DatetimeIndex
See also:
Examples
>>> ts.last('3D')
A
2018-04-13 3
2018-04-15 4
Notice the data for 3 last calendar days were returned, not the last 3 observed days in the dataset, and
therefore data for 2018-04-11 was not returned.
pandas.Series.last_valid_index
Series.last_valid_index()
Return index for last non-NA/null value.
Returns
scalar [type of index]
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
pandas.Series.le
Examples
pandas.Series.lt
Examples
pandas.Series.mad
pandas.Series.map
Series.map(arg, na_action=None)
Map values of Series according to input correspondence.
Used for substituting each value in a Series with another value, that may be derived from a function, a
dict or a Series.
Parameters
arg [function, collections.abc.Mapping subclass or Series] Mapping correspondence.
na_action [{None, ‘ignore’}, default None] If ‘ignore’, propagate NaN values, without
passing them to the mapping correspondence.
Returns
Series Same index as caller.
See also:
Notes
When arg is a dictionary, values in Series that are not in the dictionary (as keys) are converted to NaN.
However, if the dictionary is a dict subclass that defines __missing__ (i.e. provides a method for
default values), then this default is used rather than NaN.
Examples
map accepts a dict or a Series. Values that are not found in the dict are converted to NaN, unless
the dict has a default value (e.g. defaultdict):
To avoid applying the function to missing values (and keep them as NaN) na_action='ignore' can
be used:
pandas.Series.mask
Notes
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond is False the element is used; otherwise the corresponding element from the DataFrame other
is used.
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m,
df2) is equivalent to np.where(m, df1, df2).
For further details and examples see the mask documentation in indexing.
Examples
>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
>>> s.mask(s > 0)
0 0.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
pandas.Series.max
Examples
>>> s.max()
8
>>> s.max(level='blooded')
blooded
warm 4
cold 8
Name: legs, dtype: int64
>>> s.max(level=0)
blooded
warm 4
cold 8
Name: legs, dtype: int64
pandas.Series.mean
pandas.Series.median
pandas.Series.memory_usage
Series.memory_usage(index=True, deep=False)
Return the memory usage of the Series.
The memory usage can optionally include the contribution of the index and of elements of object dtype.
Parameters
index [bool, default True] Specifies whether to include the memory usage of the Series
index.
deep [bool, default False] If True, introspect the data deeply by interrogating object
dtypes for system-level memory consumption, and include it in the returned value.
Returns
int Bytes of memory consumed.
See also:
Examples
>>> s = pd.Series(range(3))
>>> s.memory_usage()
152
Not including the index gives the size of the rest of the data, which is necessarily smaller:
>>> s.memory_usage(index=False)
24
pandas.Series.min
Examples
>>> s.min()
0
>>> s.min(level='blooded')
blooded
warm 2
cold 0
Name: legs, dtype: int64
>>> s.min(level=0)
blooded
warm 2
cold 0
Name: legs, dtype: int64
pandas.Series.mod
Series.rmod Reverse of the Modulo operator, see Python documentation for more details.
Examples
pandas.Series.mode
Series.mode(dropna=True)
Return the mode(s) of the Series.
The mode is the value that appears most often. There can be multiple modes.
Always returns Series even if only one value is returned.
Parameters
dropna [bool, default True] Don’t consider counts of NaN/NaT.
New in version 0.24.0.
Returns
Series Modes of the Series in sorted order.
pandas.Series.mul
fill_value [None or float value, default None (NaN)] Fill existing missing (NaN) values,
and any new element needed for successful Series alignment, with this value before
computation. If data in both corresponding Series locations is missing the result of
filling (at that location) will be missing.
level [int or name] Broadcast across a level, matching Index values on the passed Multi-
Index level.
Returns
Series The result of the operation.
See also:
Series.rmul Reverse of the Multiplication operator, see Python documentation for more details.
Examples
pandas.Series.multiply
level [int or name] Broadcast across a level, matching Index values on the passed Multi-
Index level.
Returns
Series The result of the operation.
See also:
Series.rmul Reverse of the Multiplication operator, see Python documentation for more details.
Examples
pandas.Series.ne
Examples
pandas.Series.nlargest
Series.nlargest(n=5, keep='first')
Return the largest n elements.
Parameters
n [int, default 5] Return this many descending sorted values.
keep [{‘first’, ‘last’, ‘all’}, default ‘first’] When there are duplicate values that cannot all
fit in a Series of n elements:
• first [return the first n occurrences in order] of appearance.
• last [return the last n occurrences in reverse] order of appearance.
• all [keep all occurrences. This can result in a Series of] size larger than n.
Returns
Series The n largest values in the Series, sorted in decreasing order.
See also:
Notes
Examples
>>> s.nlargest()
France 65000000
Italy 59000000
Malta 434000
Maldives 434000
Brunei 434000
dtype: int64
The n largest elements where n=3. Default keep value is ‘first’ so Malta will be kept.
>>> s.nlargest(3)
France 65000000
Italy 59000000
Malta 434000
dtype: int64
The n largest elements where n=3 and keeping the last duplicates. Brunei will be kept since it is the last
with value 434000 based on the index order.
The n largest elements where n=3 with all duplicates kept. Note that the returned Series has five elements
due to the three duplicates.
pandas.Series.notna
Series.notna()
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to
True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set
pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN,
get mapped to False values.
Returns
Series Mask of bool values for each element in Series that indicates whether an element
is not an NA value.
See also:
Examples
>>> df.notna()
age born name toy
0 True False True False
1 True True True True
2 False True True True
>>> ser.notna()
0 True
1 True
2 False
dtype: bool
pandas.Series.notnull
Series.notnull()
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to
True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set
pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN,
get mapped to False values.
Returns
Series Mask of bool values for each element in Series that indicates whether an element
is not an NA value.
See also:
Examples
>>> df.notna()
age born name toy
(continues on next page)
>>> ser.notna()
0 True
1 True
2 False
dtype: bool
pandas.Series.nsmallest
Series.nsmallest(n=5, keep='first')
Return the smallest n elements.
Parameters
n [int, default 5] Return this many ascending sorted values.
keep [{‘first’, ‘last’, ‘all’}, default ‘first’] When there are duplicate values that cannot all
fit in a Series of n elements:
• first [return the first n occurrences in order] of appearance.
• last [return the last n occurrences in reverse] order of appearance.
• all [keep all occurrences. This can result in a Series of] size larger than n.
Returns
Series The n smallest values in the Series, sorted in increasing order.
See also:
Notes
Faster than .sort_values().head(n) for small n relative to the size of the Series object.
Examples
>>> s.nsmallest()
Montserrat 5200
Nauru 11300
Tuvalu 11300
Anguilla 11300
Iceland 337000
dtype: int64
The n smallest elements where n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be kept.
>>> s.nsmallest(3)
Montserrat 5200
Nauru 11300
Tuvalu 11300
dtype: int64
The n smallest elements where n=3 and keeping the last duplicates. Anguilla and Tuvalu will be kept
since they are the last with value 11300 based on the index order.
The n smallest elements where n=3 with all duplicates kept. Note that the returned Series has four
elements due to the three duplicates.
pandas.Series.nunique
Series.nunique(dropna=True)
Return number of unique elements in the object.
Excludes NA values by default.
Parameters
dropna [bool, default True] Don’t include NaN in the count.
Returns
int
See also:
Examples
>>> s.nunique()
4
pandas.Series.pad
pandas.Series.pct_change
Examples
Series
>>> s.pct_change()
0 NaN
1 0.011111
2 -0.065934
dtype: float64
>>> s.pct_change(periods=2)
0 NaN
1 NaN
2 -0.055556
dtype: float64
See the percentage change in a Series where filling NAs with last valid observation forward to next valid.
>>> s.pct_change(fill_method='ffill')
0 NaN
1 0.011111
2 0.000000
3 -0.065934
dtype: float64
DataFrame
Percentage change in French franc, Deutsche Mark, and Italian lira from 1980-01-01 to 1980-03-01.
>>> df = pd.DataFrame({
... 'FR': [4.0405, 4.0963, 4.3149],
... 'GR': [1.7246, 1.7482, 1.8519],
... 'IT': [804.74, 810.01, 860.13]},
... index=['1980-01-01', '1980-02-01', '1980-03-01'])
>>> df
FR GR IT
1980-01-01 4.0405 1.7246 804.74
1980-02-01 4.0963 1.7482 810.01
1980-03-01 4.3149 1.8519 860.13
>>> df.pct_change()
FR GR IT
1980-01-01 NaN NaN NaN
1980-02-01 0.013810 0.013684 0.006549
1980-03-01 0.053365 0.059318 0.061876
Percentage of change in GOOG and APPL stock volume. Shows computing the percentage change be-
tween columns.
>>> df = pd.DataFrame({
... '2016': [1769950, 30586265],
... '2015': [1500923, 40912316],
... '2014': [1371819, 41403351]},
... index=['GOOG', 'APPL'])
>>> df
2016 2015 2014
GOOG 1769950 1500923 1371819
APPL 30586265 40912316 41403351
>>> df.pct_change(axis='columns')
2016 2015 2014
GOOG NaN -0.151997 -0.086016
APPL NaN 0.337604 0.012002
pandas.Series.pipe
Notes
Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead
of writing
>>> (df.pipe(h)
... .pipe(g, arg1=a)
... .pipe(func, arg2=b, arg3=c)
... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which
keyword expects the data. For example, suppose f takes its data as arg2:
>>> (df.pipe(h)
... .pipe(g, arg1=a)
... .pipe((func, 'arg2'), arg1=a, arg3=c)
... )
pandas.Series.plot
Series.plot(*args, **kwargs)
Make plots of Series or DataFrame.
Uses the backend specified by the option plotting.backend. By default, matplotlib is used.
Parameters
data [Series or DataFrame] The object for which the method is called.
x [label or position, default None] Only used if data is a DataFrame.
y [label, position or list of label, positions, default None] Allows plotting of one column
versus another. Only used if data is a DataFrame.
kind [str] The kind of plot to produce:
• ‘line’ : line plot (default)
• ‘bar’ : vertical bar plot
• ‘barh’ : horizontal bar plot
• ‘hist’ : histogram
• ‘box’ : boxplot
• ‘kde’ : Kernel Density Estimation plot
• ‘density’ : same as ‘kde’
• ‘area’ : area plot
• ‘pie’ : pie plot
• ‘scatter’ : scatter plot
• ‘hexbin’ : hexbin plot.
ax [matplotlib axes object, default None] An axes of the current figure.
subplots [bool, default False] Make separate subplots for each column.
sharex [bool, default True if ax is None else False] In case subplots=True, share x
axis and set some x axis labels to invisible; defaults to True if ax is None otherwise
False if an ax is passed in; Be aware, that passing in both an ax and sharex=True
will alter all x axis labels for all axis in a figure.
sharey [bool, default False] In case subplots=True, share y axis and set some y axis
labels to invisible.
layout [tuple, optional] (rows, columns) for the layout of subplots.
figsize [a tuple (width, height) in inches] Size of a figure object.
use_index [bool, default True] Use index as ticks for x axis.
title [str or list] Title to use for the plot. If a string is passed, print the string at the top of
the figure. If a list is passed and subplots is True, print each item in the list above the
corresponding subplot.
grid [bool, default None (matlab style default)] Axis grid lines.
legend [bool or {‘reverse’}] Place legend on axis subplots.
style [list or dict] The matplotlib line style per column.
logx [bool or ‘sym’, default False] Use log scaling or symlog scaling on x axis. .. ver-
sionchanged:: 0.25.0
logy [bool or ‘sym’ default False] Use log scaling or symlog scaling on y axis. .. version-
changed:: 0.25.0
loglog [bool or ‘sym’, default False] Use log scaling or symlog scaling on both x and y
axes. .. versionchanged:: 0.25.0
xticks [sequence] Values to use for the xticks.
yticks [sequence] Values to use for the yticks.
xlim [2-tuple/list] Set the x limits of the current axes.
ylim [2-tuple/list] Set the y limits of the current axes.
xlabel [label, optional] Name to use for the xlabel on x-axis. Default uses index name as
xlabel, or the x-column name for planar plots.
New in version 1.1.0.
Changed in version 1.2.0: Now applicable to planar plots (scatter, hexbin).
ylabel [label, optional] Name to use for the ylabel on y-axis. Default will show no ylabel,
or the y-column name for planar plots.
New in version 1.1.0.
Changed in version 1.2.0: Now applicable to planar plots (scatter, hexbin).
rot [int, default None] Rotation for ticks (xticks for vertical, yticks for horizontal plots).
fontsize [int, default None] Font size for xticks and yticks.
colormap [str or matplotlib colormap object, default None] Colormap to select colors
from. If string, load colormap with that name from matplotlib.
colorbar [bool, optional] If True, plot colorbar (only relevant for ‘scatter’ and ‘hexbin’
plots).
position [float] Specify relative alignments for bar plot layout. From 0 (left/bottom-end)
to 1 (right/top-end). Default is 0.5 (center).
table [bool, Series or DataFrame, default False] If True, draw a table using the data in
the DataFrame and the data will be transposed to meet matplotlib’s default layout. If
a Series or DataFrame is passed, use passed data to draw a table.
yerr [DataFrame, Series, array-like, dict and str] See Plotting with Error Bars for detail.
xerr [DataFrame, Series, array-like, dict and str] Equivalent to yerr.
stacked [bool, default False in line and bar plots, and True in area plot] If True, create
stacked plot.
sort_columns [bool, default False] Sort column names to determine plot ordering.
secondary_y [bool or sequence, default False] Whether to plot on the secondary y-axis
if a list/tuple, which columns to plot on secondary y-axis.
mark_right [bool, default True] When using a secondary_y axis, automatically mark the
column labels with “(right)” in the legend.
include_bool [bool, default is False] If True, boolean values can be plotted.
backend [str, default None] Backend to use instead of the backend specified in the op-
tion plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify
the plotting.backend for the whole session, set pd.options.plotting.
backend.
New in version 1.0.0.
**kwargs Options to pass to matplotlib plotting method.
Returns
matplotlib.axes.Axes or numpy.ndarray of them If the backend is not the de-
fault matplotlib one, the return value will be the object returned by the backend.
Notes
pandas.Series.pop
Series.pop(item)
Return item and drops from series. Raise KeyError if not found.
Parameters
item [label] Index of the element that needs to be removed.
Returns
Value that is popped from series.
Examples
>>> ser.pop(0)
1
>>> ser
1 2
2 3
dtype: int64
pandas.Series.pow
Series.rpow Reverse of the Exponential power operator, see Python documentation for more details.
Examples
pandas.Series.prod
Examples
>>> pd.Series([]).prod()
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
pandas.Series.product
Examples
>>> pd.Series([]).prod()
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
pandas.Series.quantile
Series.quantile(q=0.5, interpolation='linear')
Return value at the given quantile.
Parameters
q [float or array-like, default 0.5 (50% quantile)] The quantile(s) to compute, which can
lie in range: 0 <= q <= 1.
interpolation [{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}] This optional parame-
ter specifies the interpolation method to use, when the desired quantile lies between
two data points i and j:
• linear: i + (j - i) * fraction, where fraction is the fractional part of the index sur-
rounded by i and j.
• lower: i.
• higher: j.
• nearest: i or j whichever is nearest.
• midpoint: (i + j) / 2.
Returns
float or Series If q is an array, a Series will be returned where the index is q and the
values are the quantiles, otherwise a float will be returned.
See also:
Examples
pandas.Series.radd
Examples
pandas.Series.rank
Examples
The following example shows how the method behaves with the above parameters:
• default_rank: this is the default behaviour obtained without using any parameter.
• max_rank: setting method = 'max' the records that have the same values are ranked using the
highest rank (e.g.: since ‘cat’ and ‘dog’ are both in the 2nd and 3rd position, rank 3 is assigned.)
• NA_bottom: choosing na_option = 'bottom', if there are records with NaN values they are
placed at the bottom of the ranking.
• pct_rank: when setting pct = True, the ranking is expressed as percentile rank.
pandas.Series.ravel
Series.ravel(order='C')
Return the flattened underlying data as an ndarray.
Returns
numpy.ndarray or ndarray-like Flattened data of the Series.
See also:
pandas.Series.rdiv
Series.truediv Element-wise Floating division, see Python documentation for more details.
Examples
pandas.Series.rdivmod
Series.divmod Element-wise Integer division and modulo, see Python documentation for more de-
tails.
Examples
pandas.Series.reindex
Series.reindex(index=None, **kwargs)
Conform Series to new index with optional filling logic.
Places NA/NaN in locations having no value in the previous index. A new object is produced unless the
new index is equivalent to the current one and copy=False.
Parameters
index [array-like, optional] New labels / index to conform to, should be specified using
keywords. Preferably an Index object to avoid duplicating data.
method [{None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}] Method to use for filling holes
in reindexed DataFrame. Please note: this is only applicable to DataFrames/Series
with a monotonically increasing/decreasing index.
• None (default): don’t fill gaps
• pad / ffill: Propagate last valid observation forward to next valid.
• backfill / bfill: Use next valid observation to fill gap.
• nearest: Use nearest valid observations to fill gap.
copy [bool, default True] Return a new object, even if the passed indexes are the same.
level [int or name] Broadcast across a level, matching Index values on the passed Multi-
Index level.
fill_value [scalar, default np.NaN] Value to use for missing values. Defaults to NaN, but
can be any “compatible” value.
limit [int, default None] Maximum number of consecutive elements to forward or back-
ward fill.
tolerance [optional] Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most satisfy the equa-
tion abs(index[indexer] - target) <= tolerance.
Tolerance may be a scalar value, which applies the same tolerance to all values, or
list-like, which applies variable tolerance per element. List-like includes list, tuple,
array, Series, and must be the same size as the index and its dtype must exactly match
the index’s type.
Returns
Series with changed index.
See also:
Examples
Create a new index and reindex the dataframe. By default values in the new index that do not have
corresponding records in the dataframe are assigned NaN.
We can fill in the missing values by passing a value to the keyword fill_value. Because the index is
not monotonically increasing or decreasing, we cannot use arguments to the keyword method to fill the
NaN values.
To further illustrate the filling functionality in reindex, we will create a dataframe with a monotonically
increasing index (for example, a sequence of dates).
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by
default filled with NaN. If desired, we can fill in the missing values using one of several options.
For example, to back-propagate the last valid value to fill the NaN values, pass bfill as an argument to
the method keyword.
Please note that the NaN value present in the original dataframe (at index value 2010-01-03) will not be
filled by any of the value propagation schemes. This is because filling while reindexing does not look at
dataframe values, but only compares the original and desired indexes. If you do want to fill in the NaN
values present in the original dataframe, use the fillna() method.
See the user guide for more.
pandas.Series.reindex_like
Series or DataFrame Same type as caller, but with changed indices on each axis.
See also:
Notes
Examples
>>> df1
temp_celsius temp_fahrenheit windspeed
2014-02-12 24.3 75.7 high
2014-02-13 31.0 87.8 high
2014-02-14 22.0 71.6 medium
2014-02-15 35.0 95.0 medium
>>> df2
temp_celsius windspeed
2014-02-12 28.0 low
2014-02-13 30.0 low
2014-02-15 35.1 medium
>>> df2.reindex_like(df1)
temp_celsius temp_fahrenheit windspeed
2014-02-12 28.0 NaN low
2014-02-13 30.0 NaN low
2014-02-14 NaN NaN NaN
2014-02-15 35.1 NaN medium
pandas.Series.rename
Examples
pandas.Series.rename_axis
Notes
Examples
Series
DataFrame
MultiIndex
>>> df.rename_axis(columns=str.upper)
LIMBS num_legs num_arms
type name
mammal dog 4 0
cat 4 0
monkey 2 2
pandas.Series.reorder_levels
Series.reorder_levels(order)
Rearrange index levels using input order.
May not drop or duplicate levels.
Parameters
order [list of int representing new level order] Reference level by number or key.
Returns
type of caller (new object)
pandas.Series.repeat
Series.repeat(repeats, axis=None)
Repeat elements of a Series.
Returns a new Series where each element of the current Series is repeated consecutively a given number
of times.
Parameters
repeats [int or array of ints] The number of repetitions for each element. This should be
a non-negative integer. Repeating 0 times will return an empty Series.
axis [None] Must be None. Has no effect but is accepted for compatibility with numpy.
Returns
Series Newly created Series with repeated elements.
See also:
Examples
pandas.Series.replace
‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter should be
None.
– For a DataFrame a dict can specify that different values should be replaced in
different columns. For example, {'a': 1, 'b': 'z'} looks for the value
1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these values with
whatever is specified in value. The value parameter should not be None in this
case. You can treat this as a special case of passing two lists except that you are
specifying the column to search in.
– For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}}, are
read as follows: look in column ‘a’ for the value ‘b’ and replace it with NaN.
The value parameter should be None to use a nested dict in this way. You can
nest regular expressions as well. Note that column names (the top-level dictio-
nary keys in a nested dictionary) cannot be regular expressions.
• None:
– This means that the regex argument must be a string, compiled regular expres-
sion, or list, dict, ndarray or Series of such elements. If value is also None then
this must be a nested dictionary or Series.
See the examples section for examples of each of these.
value [scalar, dict, list, str, regex, default None] Value to replace any values matching
to_replace with. For a DataFrame a dict of values can be used to specify which
value to use for each column (columns not in the dict will not be filled). Regular
expressions, strings and lists or dicts of such objects are also allowed.
inplace [bool, default False] If True, in place. Note: this will modify any other views on
this object (e.g. a column from a DataFrame). Returns the caller if this is True.
limit [int or None, default None] Maximum size gap to forward or backward fill.
regex [bool or same types as to_replace, default False] Whether to interpret to_replace
and/or value as regular expressions. If this is True then to_replace must be a string.
Alternatively, this could be a regular expression or a list, dict, or array of regular
expressions in which case to_replace must be None.
method [{‘pad’, ‘ffill’, ‘bfill’, None}] The method to use when for replacement, when
to_replace is a scalar, list or tuple and value is None.
Returns
Series or None Object after replacement or None if inplace=True.
Raises
AssertionError
• If regex is not a bool and to_replace is not None.
TypeError
• If to_replace is not a scalar, array-like, dict, or None
• If to_replace is a dict and value is not a list, dict, ndarray, or Series
• If to_replace is None and regex is not compilable into a regular expression or is a
list, dict, ndarray, or Series.
• When replacing multiple bool or datetime64 objects and the arguments to
to_replace does not match the type of the value being replaced
ValueError
• If a list or an ndarray is passed to to_replace and value but they are not the
same length.
See also:
Notes
• Regex substitution is performed under the hood with re.sub. The rules for substitution for re.
sub are the same.
• Regular expressions will only substitute on strings, meaning you cannot provide, for example, a
regular expression matching floating point numbers and expect the columns in your frame that have
a numeric dtype to be matched. However, if those floating point numbers are strings, then you can
do this.
• This method has a lot of options. You are encouraged to experiment and play with this method to
gain intuition about how it works.
• When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and
value(s) in the dict are the value parameter.
Examples
List-like `to_replace`
dict-like `to_replace`
>>> df.replace({0: 10, 1: 100})
A B C
0 10 5 a
1 100 6 b
2 2 7 c
3 3 8 d
4 4 9 e
When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value param-
eter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None},
value=None, method=None):
When value=None and to_replace is a scalar, list or tuple, replace uses the method parameter (default
‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2
and ‘b’ in row 4 in this case. The command s.replace('a', None) is actually equivalent to s.
replace(to_replace='a', value=None, method='pad'):
pandas.Series.resample
Notes
Examples
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum()
2000-01-01 00:00:00 3
2000-01-01 00:03:00 12
2000-01-01 00:06:00 21
Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the
left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels.
For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the
summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if
it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval
as illustrated in the example below this one.
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
Upsample the series into 30 second bins and fill the NaN values using the pad method.
>>> series.resample('30S').pad()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 0
2000-01-01 00:01:00 1
2000-01-01 00:01:30 1
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the NaN values using the bfill method.
>>> series.resample('30S').bfill()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 1
2000-01-01 00:01:00 1
2000-01-01 00:01:30 2
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or
end of rule.
Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.
Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.
For DataFrame objects, the keyword on can be used to specify the column instead of the index for resam-
pling.
For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling
needs to take place.
If you want to adjust the start of the bins based on a fixed timestamp:
>>> ts.resample('17min').sum()
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, dtype: int64
If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:
To replace the use of the deprecated base argument, you can now use offset, in this example it is equivalent
to have base=2:
pandas.Series.reset_index
Examples
>>> s.reset_index()
idx foo
0 a 1
1 b 2
2 c 3
3 d 4
>>> s.reset_index(name='values')
idx values
0 a 1
1 b 2
2 c 3
3 d 4
>>> s.reset_index(drop=True)
0 1
1 2
2 3
3 4
Name: foo, dtype: int64
To update the Series in place, without generating a new one set inplace to True. Note that it also requires
drop=True.
>>> s2.reset_index(level='a')
a foo
b
one bar 0
two bar 1
one baz 2
two baz 3
If level is not set, all levels are removed from the Index.
>>> s2.reset_index()
a b foo
0 bar one 0
1 bar two 1
2 baz one 2
3 baz two 3
pandas.Series.rfloordiv
Series.floordiv Element-wise Integer division, see Python documentation for more details.
Examples
pandas.Series.rmod
Examples
pandas.Series.rmul
Examples
pandas.Series.rolling
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the
window by setting center=True.
To learn more about the offsets & frequency strings, please see this link.
If win_type=None, all points are evenly weighted; otherwise, win_type can accept a string of any
scipy.signal window function.
Certain Scipy window types require additional parameters to be passed in the aggregation function. The
additional parameters must match the keywords specified in the Scipy window type method signature.
Please see the third example below on how to add the additional parameters.
Examples
Rolling sum with a window length of 2, using the ‘triang’ window type.
Rolling sum with a window length of 2, using the ‘gaussian’ window type (note how we need to specify
std).
Rolling sum with a window length of 2, min_periods defaults to the window length.
>>> df.rolling(2).sum()
B
0 NaN
1 1.0
2 3.0
3 NaN
4 NaN
>>> df
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
Contrasting to an integer rolling window, this will roll a variable length window corresponding to the time
period. The default for min_periods is 1.
>>> df.rolling('2s').sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
pandas.Series.round
Examples
pandas.Series.rpow
Series.pow Element-wise Exponential power, see Python documentation for more details.
Examples
pandas.Series.rsub
Examples
pandas.Series.rtruediv
Series.truediv Element-wise Floating division, see Python documentation for more details.
Examples
pandas.Series.sample
Returns
Series or DataFrame A new object of same type as caller containing n items randomly
sampled from the caller object.
See also:
Notes
Examples
Extract 3 random elements from the Series df['num_legs']: Note that we use random_state to
ensure the reproducibility of the examples.
An upsample sample of the DataFrame with replacement: Note that replace parameter has to be True
for frac parameter > 1.
Using a DataFrame column as weights. Rows with larger value in the num_specimen_seen column are
more likely to be sampled.
pandas.Series.searchsorted
Note: The Series must be monotonically sorted, otherwise wrong locations will likely be returned.
Pandas does not check this for you.
Parameters
value [array_like] Values to insert into self.
side [{‘left’, ‘right’}, optional] If ‘left’, the index of the first suitable location found is
given. If ‘right’, return the last such index. If there is no suitable index, return either
0 or N (where N is the length of self ).
sorter [1-D array_like, optional] Optional array of integer indices that sort self into as-
cending order. They are typically the result of np.argsort.
Returns
int or array of int A scalar or array of insertion points with the same shape as value.
Changed in version 0.24.0: If value is a scalar, an int is now always returned. Previ-
ously, scalar inputs returned an 1-item array for Series and Categorical.
See also:
Notes
Examples
>>> ser.searchsorted(4)
3
>>> ser.searchsorted('3/14/2000')
3
>>> ser.searchsorted('bread')
1
If the values are not monotonically sorted, wrong locations may be returned:
>>> ser = pd.Series([2, 1, 3])
>>> ser
0 2
(continues on next page)
>>> ser.searchsorted(1)
0 # wrong result, correct would be 1
pandas.Series.sem
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
pandas.Series.set_axis
See also:
Examples
pandas.Series.set_flags
Notes
This method returns a new object that’s a view on the same data as the input. Mutating the input or the
output values will be reflected in the other.
This method is intended to be used in method chains.
“Flags” differ from “metadata”. Flags reflect properties of the pandas object (the Series or DataFrame).
Metadata refer to properties of the dataset, and should be stored in DataFrame.attrs.
Examples
pandas.Series.shift
Examples
>>> df.shift(periods=3)
Col1 Col2 Col3
2020-01-01 NaN NaN NaN
2020-01-02 NaN NaN NaN
2020-01-03 NaN NaN NaN
2020-01-04 10.0 13.0 17.0
2020-01-05 20.0 23.0 27.0
pandas.Series.skew
pandas.Series.slice_shift
Series.slice_shift(periods=1, axis=0)
Equivalent to shift without copying data. The shifted data will not include the dropped periods and the
shifted axis will be smaller than the original.
Deprecated since version 1.2.0: slice_shift is deprecated, use DataFrame/Series.shift instead.
Parameters
periods [int] Number of periods to move, can be positive or negative.
Returns
shifted [same type as caller]
Notes
While the slice_shift is faster than shift, you may pay for it later during alignment.
pandas.Series.sort_index
ascending [bool or list of bools, default True] Sort ascending vs. descending. When the
index is a MultiIndex the sort direction can be controlled for each level individually.
inplace [bool, default False] If True, perform operation in-place.
kind [{‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’] Choice of sorting algo-
rithm. See also numpy.sort() for more information. ‘mergesort’ is the only stable
algorithm. For DataFrames, this option is only applied when sorting on a single col-
umn or label.
na_position [{‘first’, ‘last’}, default ‘last’] If ‘first’ puts NaNs at the beginning, ‘last’
puts NaNs at the end. Not implemented for MultiIndex.
sort_remaining [bool, default True] If True and sorting by level and index is multilevel,
sort by other levels too (in order) after sorting by specified level.
ignore_index [bool, default False] If True, the resulting axis will be labeled 0, 1, . . . , n -
1.
New in version 1.0.0.
key [callable, optional] If not None, apply the key function to the index values before
sorting. This is similar to the key argument in the builtin sorted() function, with
the notable difference that this key function should be vectorized. It should expect an
Index and return an Index of the same shape.
New in version 1.1.0.
Returns
Series or None The original Series sorted by the labels or None if inplace=True.
See also:
Examples
Sort Descending
>>> s.sort_index(ascending=False)
4 d
3 a
2 b
1 c
dtype: object
Sort Inplace
>>> s.sort_index(inplace=True)
>>> s
1 c
2 b
3 a
4 d
dtype: object
By default NaNs are put at the end, but use na_position to place them at the beginning
>>> s = pd.Series(['a', 'b', 'c', 'd'], index=[3, 2, 1, np.nan])
>>> s.sort_index(na_position='first')
NaN d
1.0 c
2.0 b
3.0 a
dtype: object
pandas.Series.sort_values
Examples
>>> s.sort_values(ascending=True)
1 1.0
2 3.0
4 5.0
3 10.0
0 NaN
dtype: float64
>>> s.sort_values(ascending=False)
3 10.0
4 5.0
2 3.0
1 1.0
0 NaN
dtype: float64
>>> s.sort_values(na_position='first')
0 NaN
1 1.0
2 3.0
4 5.0
3 10.0
dtype: float64
>>> s.sort_values()
3 a
1 b
4 c
2 d
0 z
dtype: object
Sort using a key function. Your key function will be given the Series of values and should return an
array-like.
NumPy ufuncs work well here. For example, we can sort by the sin of the value
More complicated user-defined functions can be used, as long as they expect a Series and return an array-
like
pandas.Series.sparse
Series.sparse()
Accessor for SparseSparse from other sparse matrix data types.
pandas.Series.squeeze
Series.squeeze(axis=None)
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or
a single row are squeezed to a Series. Otherwise the object is unchanged.
This method is most useful when you don’t know if your object is a Series or DataFrame, but you do
know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’, None}, default None] A specific axis to squeeze. By
default, all length-1 axes are squeezed.
Returns
DataFrame, Series, or scalar The projection after squeezing axis or all the axes.
See also:
Examples
>>> even_primes.squeeze()
2
Squeezing objects with more than one value in every axis does nothing:
>>> odd_primes.squeeze()
1 3
2 5
3 7
dtype: int64
Slicing a single column will produce a DataFrame with the columns having only one value:
>>> df_a.squeeze('columns')
0 1
1 3
Name: a, dtype: int64
Slicing a single row from a single column will produce a single scalar DataFrame:
>>> df_0a.squeeze('rows')
a 1
Name: 0, dtype: int64
>>> df_0a.squeeze()
1
pandas.Series.std
numeric_only [bool, default None] Include only float, int, boolean columns. If None,
will attempt to use everything, then use only numeric data. Not implemented for
Series.
Returns
scalar or Series (if level specified)
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
pandas.Series.str
Series.str()
Vectorized string functions for Series and Index.
NAs stay NA unless handled otherwise by a particular method. Patterned after Python’s string methods,
with some inspiration from R’s stringr package.
Examples
>>> s = pd.Series(["A_Str_Series"])
>>> s
0 A_Str_Series
dtype: object
>>> s.str.split("_")
0 [A, Str, Series]
dtype: object
pandas.Series.sub
Returns
Series The result of the operation.
See also:
Series.rsub Reverse of the Subtraction operator, see Python documentation for more details.
Examples
pandas.Series.subtract
Series.rsub Reverse of the Subtraction operator, see Python documentation for more details.
Examples
pandas.Series.sum
Examples
>>> s.sum()
14
>>> s.sum(level='blooded')
blooded
warm 6
cold 8
Name: legs, dtype: int64
>>> s.sum(level=0)
blooded
warm 6
cold 8
Name: legs, dtype: int64
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty
series to be NaN, pass min_count=1.
>>> pd.Series([]).sum(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).sum()
0.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
pandas.Series.swapaxes
pandas.Series.swaplevel
pandas.Series.tail
Series.tail(n=5)
Return the last n rows.
This function returns last n rows from the object based on position. It is useful for quickly verifying data,
for example, after sorting or appending rows.
For negative values of n, this function returns all rows except the first n rows, equivalent to df[n:].
Parameters
n [int, default 5] Number of rows to select.
Returns
type of caller The last n rows of the caller object.
See also:
Examples
>>> df.tail()
animal
4 monkey
5 parrot
6 shark
7 whale
8 zebra
>>> df.tail(3)
animal
6 shark
7 whale
8 zebra
>>> df.tail(-3)
animal
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra
pandas.Series.take
axis [{0 or ‘index’, 1 or ‘columns’, None}, default 0] The axis on which to select ele-
ments. 0 means that we are selecting rows, 1 means that we are selecting columns.
is_copy [bool] Before pandas 1.0, is_copy=False can be specified to ensure that the
return value is an actual copy. Starting with pandas 1.0, take always returns a copy,
and the keyword is therefore deprecated.
Deprecated since version 1.0.0.
**kwargs For compatibility with numpy.take(). Has no effect on the output.
Returns
taken [same type as caller] An array-like containing the elements taken from the object.
See also:
Examples
We may take elements using negative integers for positive indices, starting from the end of the object, just
like with Python lists.
pandas.Series.to_clipboard
Notes
Examples
>>> df.to_clipboard(sep=',')
... # Wrote the following to the system clipboard:
... # ,A,B,C
... # 0,1,2,3
... # 1,4,5,6
We can omit the index by passing the keyword index and setting it to false.
pandas.Series.to_csv
Changed in version 1.0.0: May now be a dict with key ‘method’ as compression mode
and other entries as additional compression options if compression mode is ‘zip’.
Changed in version 1.1.0: Passing compression options as keys in dict is supported
for compression modes ‘gzip’ and ‘bz2’ as well as ‘zip’.
Changed in version 1.2.0: Compression is supported for binary file objects.
Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to
gzip.open instead of gzip.GzipFile which prevented setting mtime.
quoting [optional constant from csv module] Defaults to csv.QUOTE_MINIMAL.
If you have set a float_format then floats are converted to strings and thus
csv.QUOTE_NONNUMERIC will treat them as non-numeric.
quotechar [str, default ‘"’] String of length 1. Character used to quote fields.
line_terminator [str, optional] The newline character or character sequence to use in the
output file. Defaults to os.linesep, which depends on the OS in which this method is
called (‘n’ for linux, ‘rn’ for Windows, i.e.).
Changed in version 0.24.0.
chunksize [int or None] Rows to write at a time.
date_format [str, default None] Format string for datetime objects.
doublequote [bool, default True] Control quoting of quotechar inside a field.
escapechar [str, default None] String of length 1. Character used to escape sep and
quotechar when appropriate.
decimal [str, default ‘.’] Character recognized as decimal separator. E.g. use ‘,’ for
European data.
errors [str, default ‘strict’] Specifies how encoding and decoding errors are to be handled.
See the errors argument for open() for a full list of options.
New in version 1.1.0.
storage_options [dict, optional] Extra options that make sense for a particular storage
connection, e.g. host, port, username, password, etc., if using a URL that will be
parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing
this argument with a non-fsspec URL. See the fsspec and backend storage implemen-
tation docs for the set of allowed keys and values.
New in version 1.2.0.
Returns
None or str If path_or_buf is None, returns the resulting csv format as a string. Other-
wise returns None.
See also:
Examples
pandas.Series.to_dict
Series.to_dict(into=<class 'dict'>)
Convert Series to {label -> value} dict or dict-like object.
Parameters
into [class, default dict] The collections.abc.Mapping subclass to use as the return object.
Can be the actual class or an empty instance of the mapping type you want. If you
want a collections.defaultdict, you must pass it initialized.
Returns
collections.abc.Mapping Key-value representation of Series.
Examples
pandas.Series.to_excel
Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it
is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already
exists will result in the contents of the existing file being erased.
Parameters
excel_writer [path-like, file-like, or ExcelWriter object] File path or existing Excel-
Writer.
sheet_name [str, default ‘Sheet1’] Name of sheet which will contain DataFrame.
na_rep [str, default ‘’] Missing data representation.
float_format [str, optional] Format string for floating point numbers. For example
float_format="%.2f" will format 0.1234 to 0.12.
columns [sequence or list of str, optional] Columns to write.
header [bool or list of str, default True] Write out the column names. If a list of string is
given it is assumed to be aliases for the column names.
index [bool, default True] Write row names (index).
index_label [str or sequence, optional] Column label for index column(s) if desired. If
not specified, and header and index are True, then the index names are used. A
sequence should be given if the DataFrame uses MultiIndex.
startrow [int, default 0] Upper left cell row to dump data frame.
startcol [int, default 0] Upper left cell column to dump data frame.
engine [str, optional] Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set
this via the options io.excel.xlsx.writer, io.excel.xls.writer, and
io.excel.xlsm.writer.
Deprecated since version 1.2.0: As the xlwt package is no longer maintained, the
xlwt engine will be removed in a future version of pandas.
merge_cells [bool, default True] Write MultiIndex and Hierarchical Rows as merged
cells.
encoding [str, optional] Encoding of the resulting excel file. Only necessary for xlwt,
other writers support unicode natively.
inf_rep [str, default ‘inf’] Representation for infinity (there is no native representation for
infinity in Excel).
verbose [bool, default True] Display more information in the error logs.
freeze_panes [tuple of int (length 2), optional] Specifies the one-based bottommost row
and rightmost column that is to be frozen.
storage_options [dict, optional] Extra options that make sense for a particular storage
connection, e.g. host, port, username, password, etc., if using a URL that will be
parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if providing
this argument with a non-fsspec URL. See the fsspec and backend storage implemen-
tation docs for the set of allowed keys and values.
New in version 1.2.0.
See also:
Notes
For compatibility with to_csv(), to_excel serializes lists and dicts to strings before writing.
Once a workbook has been saved it is not possible write further data without rewriting the whole work-
book.
Examples
>>> df1.to_excel("output.xlsx",
... sheet_name='Sheet_name_1')
If you wish to write to more than one sheet in the workbook, it is necessary to specify an ExcelWriter
object:
To set the library that is used to write the Excel file, you can pass the engine keyword (the default engine
is automatically chosen depending on the file extension):
pandas.Series.to_frame
Series.to_frame(name=None)
Convert Series to DataFrame.
Parameters
name [object, default None] The passed name should substitute for the series name (if it
has one).
Returns
Examples
pandas.Series.to_hdf
• ‘table’: Table format. Write as a PyTables Table structure which may perform
worse but allow more flexible operations like searching / selecting subsets of the
data.
• If None, pd.get_option(‘io.hdf.default_format’) is checked, followed by fallback to
“fixed”
errors [str, default ‘strict’] Specifies how encoding and decoding errors are to be handled.
See the errors argument for open() for a full list of options.
encoding [str, default “UTF-8”]
min_itemsize [dict or int, optional] Map column names to minimum string sizes for
columns.
nan_rep [Any, optional] How to represent null values as str. Not allowed with ap-
pend=True.
data_columns [list of columns or True, optional] List of columns to create as indexed
data columns for on-disk queries, or True to use all columns. By default only the
axes of the object are indexed. See Query via data columns. Applicable only to
format=’table’.
See also:
Examples
>>> import os
>>> os.remove('data.h5')
pandas.Series.to_json
Notes
The behavior of indent=0 varies from the stdlib, which does not indent the output but does insert
newlines. Currently, indent=0 and the default indent=None are equivalent in pandas, though this
may change in a future release.
orient='table' contains a ‘pandas_version’ field under ‘schema’. This stores the version of pandas
used in the latest revision of the schema.
Examples
Encoding/decoding a Dataframe using 'records' formatted JSON. Note that index labels are not pre-
served with this encoding.
pandas.Series.to_latex
Examples
pandas.Series.to_list
Series.to_list()
Return a list of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Times-
tamp/Timedelta/Interval/Period)
Returns
list
See also:
numpy.ndarray.tolist Return the array as an a.ndim-levels deep nested list of Python scalars.
pandas.Series.to_markdown
Returns
str Series in Markdown-friendly format.
Notes
Examples
>>> print(s.to_markdown(tablefmt="grid"))
+----+----------+
| | animal |
+====+==========+
| 0 | elk |
+----+----------+
| 1 | pig |
+----+----------+
| 2 | dog |
+----+----------+
| 3 | quetzal |
+----+----------+
pandas.Series.to_numpy
**kwargs Additional keywords passed through to the to_numpy method of the un-
derlying array (for extension arrays).
New in version 1.0.0.
Returns
numpy.ndarray
See also:
Notes
The returned array will be the same up to equality (values equal in self will be equal in the returned array;
likewise for values that are not equal). When self contains an ExtensionArray, the dtype may be different.
For example, for a category-dtype Series, to_numpy() will return a NumPy array and the categorical
dtype will be lost.
For NumPy dtypes, this will be a reference to the actual data stored in this Series or Index (assuming
copy=False). Modifying the result in place will modify the data stored in the Series or Index (not that
we recommend doing that).
For extension types, to_numpy() may require copying data and coercing the result to a NumPy type
(possibly object), which may be expensive. When you need a no-copy reference to the underlying data,
Series.array should be used instead.
This table lays out the different dtypes and default return types of to_numpy() for various dtypes within
pandas.
Examples
Specify the dtype to control how datetime-aware data is represented. Use dtype=object to return an
ndarray of pandas Timestamp objects, each with the correct tz.
>>> ser = pd.Series(pd.date_range('2000', periods=2, tz="CET"))
>>> ser.to_numpy(dtype=object)
array([Timestamp('2000-01-01 00:00:00+0100', tz='CET', freq='D'),
Timestamp('2000-01-02 00:00:00+0100', tz='CET', freq='D')],
dtype=object)
>>> ser.to_numpy(dtype="datetime64[ns]")
...
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00...'],
dtype='datetime64[ns]')
pandas.Series.to_period
Series.to_period(freq=None, copy=True)
Convert Series from DatetimeIndex to PeriodIndex.
Parameters
freq [str, default None] Frequency associated with the PeriodIndex.
copy [bool, default True] Whether or not to return a copy.
Returns
Series Series with index converted to PeriodIndex.
pandas.Series.to_pickle
read_pickle Load pickled pandas object (or any object) from file.
Examples
>>> import os
>>> os.remove("./dummy.pkl")
pandas.Series.to_sql
Notes
Timezone aware datetime columns will be written as Timestamp with timezone type with
SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone unaware
timestamps local to the original timezone.
New in version 0.24.0.
References
[1], [2]
Examples
This is allowed to support operations that require that the same DBAPI connection is used for the entire
operation.
Specify the dtype (especially useful for integers with missing values). Notice that while pandas is forced
to store the data as floating point, the database supports nullable integers. When fetching the data with
Python, we get back integer scalars.
pandas.Series.to_string
pandas.Series.to_timestamp
pandas.Series.to_xarray
Series.to_xarray()
Return an xarray object from the pandas object.
Returns
xarray.DataArray or xarray.Dataset Data in the pandas structure converted to
Dataset if the object is a DataFrame, or a DataArray if the object is a Series.
See also:
Notes
Examples
>>> df.to_xarray()
<xarray.Dataset>
Dimensions: (index: 4)
Coordinates:
* index (index) int64 0 1 2 3
Data variables:
name (index) object 'falcon' 'parrot' 'lion' 'monkey'
class (index) object 'bird' 'bird' 'mammal' 'mammal'
max_speed (index) float64 389.0 24.0 80.5 nan
num_legs (index) int64 2 2 4 4
>>> df['max_speed'].to_xarray()
<xarray.DataArray 'max_speed' (index: 4)>
array([389. , 24. , 80.5, nan])
Coordinates:
* index (index) int64 0 1 2 3
>>> df_multiindex
speed
date animal
2018-01-01 falcon 350
parrot 18
2018-01-02 falcon 361
parrot 15
>>> df_multiindex.to_xarray()
<xarray.Dataset>
Dimensions: (animal: 2, date: 2)
Coordinates:
* date (date) datetime64[ns] 2018-01-01 2018-01-02
* animal (animal) object 'falcon' 'parrot'
Data variables:
speed (date, animal) int64 350 18 361 15
pandas.Series.tolist
Series.tolist()
Return a list of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Times-
tamp/Timedelta/Interval/Period)
Returns
list
See also:
numpy.ndarray.tolist Return the array as an a.ndim-levels deep nested list of Python scalars.
pandas.Series.transform
Examples
Even though the resulting Series must have the same length as the input Series, it is possible to provide
several input functions:
>>> s = pd.Series(range(3))
>>> s
0 0
1 1
2 2
dtype: int64
>>> s.transform([np.sqrt, np.exp])
sqrt exp
0 0.000000 1.000000
1 1.000000 2.718282
2 1.414214 7.389056
>>> df = pd.DataFrame({
... "c": [1, 1, 1, 2, 2, 2, 2],
... "type": ["m", "n", "o", "m", "m", "n", "n"]
... })
>>> df
c type
0 1 m
1 1 n
2 1 o
3 2 m
4 2 m
5 2 n
6 2 n
>>> df['size'] = df.groupby('c')['type'].transform(len)
>>> df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
pandas.Series.transpose
Series.transpose(*args, **kwargs)
Return the transpose, which is by definition self.
Returns
%(klass)s
pandas.Series.truediv
Series.rtruediv Reverse of the Floating division operator, see Python documentation for more
details.
Examples
pandas.Series.truncate
Notes
If the index being truncated contains only datetime values, before and after may be specified as strings
instead of Timestamps.
Examples
>>> df.truncate(before=pd.Timestamp('2016-01-05'),
... after=pd.Timestamp('2016-01-10')).tail()
A
2016-01-09 23:59:56 1
2016-01-09 23:59:57 1
2016-01-09 23:59:58 1
2016-01-09 23:59:59 1
2016-01-10 00:00:00 1
Because the index is a DatetimeIndex containing only dates, we can specify before and after as strings.
They will be coerced to Timestamps before truncation.
>>> df.truncate('2016-01-05', '2016-01-10').tail()
A
2016-01-09 23:59:56 1
2016-01-09 23:59:57 1
2016-01-09 23:59:58 1
2016-01-09 23:59:59 1
2016-01-10 00:00:00 1
Note that truncate assumes a 0 value for any unspecified time component (midnight). This differs
from partial string slicing, which returns any partially matching dates.
pandas.Series.tshift
Notes
If freq is not specified then tries to use the freq or inferred_freq attributes of the index. If neither of those
attributes exist, a ValueError is thrown
pandas.Series.tz_convert
pandas.Series.tz_localize
Examples
Be careful with DST changes. When there is sequential data, pandas can infer the DST time:
>>> s = pd.Series(range(7),
... index=pd.DatetimeIndex(['2018-10-28 01:30:00',
... '2018-10-28 02:00:00',
... '2018-10-28 02:30:00',
... '2018-10-28 02:00:00',
... '2018-10-28 02:30:00',
... '2018-10-28 03:00:00',
... '2018-10-28 03:30:00']))
>>> s.tz_localize('CET', ambiguous='infer')
2018-10-28 01:30:00+02:00 0
2018-10-28 02:00:00+02:00 1
2018-10-28 02:30:00+02:00 2
2018-10-28 02:00:00+01:00 3
2018-10-28 02:30:00+01:00 4
2018-10-28 03:00:00+01:00 5
2018-10-28 03:30:00+01:00 6
dtype: int64
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous
parameter to set the DST explicitly
>>> s = pd.Series(range(3),
... index=pd.DatetimeIndex(['2018-10-28 01:20:00',
... '2018-10-28 02:36:00',
... '2018-10-28 03:46:00']))
>>> s.tz_localize('CET', ambiguous=np.array([True, True, False]))
2018-10-28 01:20:00+02:00 0
2018-10-28 02:36:00+02:00 1
2018-10-28 03:46:00+01:00 2
dtype: int64
If the DST transition causes nonexistent times, you can shift these dates forward or backward with a
timedelta object or ‘shift_forward’ or ‘shift_backward’.
>>> s = pd.Series(range(2),
... index=pd.DatetimeIndex(['2015-03-29 02:30:00',
... '2015-03-29 03:30:00']))
>>> s.tz_localize('Europe/Warsaw', nonexistent='shift_forward')
2015-03-29 03:00:00+02:00 0
2015-03-29 03:30:00+02:00 1
dtype: int64
>>> s.tz_localize('Europe/Warsaw', nonexistent='shift_backward')
2015-03-29 01:59:59.999999999+01:00 0
2015-03-29 03:30:00+02:00 1
dtype: int64
>>> s.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta('1H'))
(continues on next page)
pandas.Series.unique
Series.unique()
Return unique values of Series object.
Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.
Returns
ndarray or ExtensionArray The unique values returned as a NumPy array. See
Notes.
See also:
Notes
Returns the unique values as a NumPy array. In case of an extension-array backed Series, a new
ExtensionArray of that type with just the unique values is returned. This includes
• Categorical
• Period
• Datetime with Timezone
• Interval
• Sparse
• IntegerNA
See Examples section.
Examples
>>> pd.Series(pd.Categorical(list('baabc'))).unique()
['b', 'a', 'c']
Categories (3, object): ['b', 'a', 'c']
pandas.Series.unstack
Series.unstack(level=- 1, fill_value=None)
Unstack, also known as pivot, Series with MultiIndex to produce DataFrame.
Parameters
level [int, str, or list of these, default last level] Level(s) to unstack, can pass level name.
fill_value [scalar value, default None] Value to use when replacing NaN values.
Returns
DataFrame Unstacked Series.
Examples
>>> s.unstack(level=-1)
a b
one 1 2
two 3 4
>>> s.unstack(level=0)
one two
a 1 3
b 2 4
pandas.Series.update
Series.update(other)
Modify Series in place using values from passed Series.
Uses non-NA values from passed Series to make updates. Aligns on index.
Parameters
other [Series, or object coercible into Series]
Examples
If other contains NaNs the corresponding values are not updated in the original Series.
other can also be a non-Series object type that is coercible into a Series
pandas.Series.value_counts
Examples
With normalize set to True, returns the relative frequency by dividing all values by the sum of values.
>>> s = pd.Series([3, 1, 2, 3, 4, np.nan])
>>> s.value_counts(normalize=True)
3.0 0.4
(continues on next page)
bins
Bins can be useful for going from a continuous variable to a categorical variable; instead of counting
unique apparitions of values, divide the index in the specified number of half-open bins.
>>> s.value_counts(bins=3)
(0.996, 2.0] 2
(2.0, 3.0] 2
(3.0, 4.0] 1
dtype: int64
dropna
With dropna set to False we can also see NaN index values.
>>> s.value_counts(dropna=False)
3.0 2
2.0 1
NaN 1
4.0 1
1.0 1
dtype: int64
pandas.Series.var
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
pandas.Series.view
Series.view(dtype=None)
Create a new view of the Series.
This function will return a new Series with a view of the same underlying values in memory, optionally
reinterpreted with a new data type. The new data type must preserve the same size in bytes as to not cause
index misalignment.
Parameters
dtype [data type] Data type object or one of their string representations.
Returns
Series A new Series object as a view of the same data in memory.
See also:
numpy.ndarray.view Equivalent numpy function to create a new view of the same data in memory.
Notes
Series are instantiated with dtype=float64 by default. While numpy.ndarray.view() will re-
turn a view with the same data type as the original array, Series.view() (without specified dtype)
will try using float64 and may fail if the original data type size in bytes is not the same.
Examples
The 8 bit signed integer representation of -1 is 0b11111111, but the same bytes represent 255 if read as
an 8 bit unsigned integer:
>>> us = s.view('uint8')
>>> us
0 254
1 255
2 0
3 1
4 2
dtype: uint8
pandas.Series.where
Notes
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond is True the element is used; otherwise the corresponding element from the DataFrame other is
used.
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m,
df2) is equivalent to np.where(m, df1, df2).
For further details and examples see the where documentation in indexing.
Examples
>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
>>> s.mask(s > 0)
0 0.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
pandas.Series.xs
Notes
Examples
>>> df.xs('mammal')
num_legs num_wings
animal locomotion
cat walks 4 0
dog walks 4 0
bat flies 2 2
3.3.2 Attributes
Axes
pandas.Series.empty
property Series.empty
Indicator whether DataFrame is empty.
True if DataFrame is entirely empty (no items), meaning any of the axes are of length 0.
Returns
bool If DataFrame is empty, return True, if not return False.
See also:
Series.dropna Return series without null values.
DataFrame.dropna Return DataFrame with labels on given axis omitted where (all or any) data are miss-
ing.
Notes
If DataFrame contains only NaNs, it is still not considered empty. See the example below.
Examples
If we only have NaNs in our DataFrame, it is not considered empty! We will need to drop the NaNs to make the
DataFrame empty:
3.3.3 Conversion
pandas.Series.__array__
Series.__array__(dtype=None)
Return the values as a NumPy array.
Users should not call this directly. Rather, it is invoked by numpy.array() and numpy.asarray().
Parameters
dtype [str or numpy.dtype, optional] The dtype to use for the resulting NumPy array. By
default, the dtype is inferred from the data.
Returns
numpy.ndarray The values in the series converted to a numpy.ndarray with the speci-
fied dtype.
See also:
array Create a new array from data.
Series.array Zero-copy view to the array backing the Series.
Series.to_numpy Series method for similar behavior.
Examples
Or the values may be localized to UTC and the tzinfo discarded with dtype='datetime64[ns]'
Series.get(key[, default]) Get item from object for given key (ex: DataFrame col-
umn).
Series.at Access a single value for a row/column label pair.
Series.iat Access a single value for a row/column pair by integer
position.
Series.loc Access a group of rows and columns by label(s) or a
boolean array.
Series.iloc Purely integer-location based indexing for selection by
position.
Series.__iter__() Return an iterator of the values.
continues on next page
pandas.Series.__iter__
Series.__iter__()
Return an iterator of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Times-
tamp/Timedelta/Interval/Period)
Returns
iterator
For more information on .at, .iat, .loc, and .iloc, see the indexing documentation.
Series.add(other[, level, fill_value, axis]) Return Addition of series and other, element-wise (bi-
nary operator add).
Series.sub(other[, level, fill_value, axis]) Return Subtraction of series and other, element-wise
(binary operator sub).
Series.mul(other[, level, fill_value, axis]) Return Multiplication of series and other, element-wise
(binary operator mul).
Series.div(other[, level, fill_value, axis]) Return Floating division of series and other, element-
wise (binary operator truediv).
Series.truediv(other[, level, fill_value, axis]) Return Floating division of series and other, element-
wise (binary operator truediv).
Series.floordiv(other[, level, fill_value, axis]) Return Integer division of series and other, element-
wise (binary operator floordiv).
Series.mod(other[, level, fill_value, axis]) Return Modulo of series and other, element-wise (bi-
nary operator mod).
Series.pow(other[, level, fill_value, axis]) Return Exponential power of series and other, element-
wise (binary operator pow).
Series.radd(other[, level, fill_value, axis]) Return Addition of series and other, element-wise (bi-
nary operator radd).
Series.rsub(other[, level, fill_value, axis]) Return Subtraction of series and other, element-wise
(binary operator rsub).
Series.rmul(other[, level, fill_value, axis]) Return Multiplication of series and other, element-wise
(binary operator rmul).
Series.rdiv(other[, level, fill_value, axis]) Return Floating division of series and other, element-
wise (binary operator rtruediv).
Series.rtruediv(other[, level, fill_value, axis]) Return Floating division of series and other, element-
wise (binary operator rtruediv).
continues on next page
Series.align(other[, join, axis, level, . . . ]) Align two objects on their axes with the specified join
method.
Series.drop([labels, axis, index, columns, . . . ]) Return Series with specified index labels removed.
Series.droplevel(level[, axis]) Return DataFrame with requested index / column
level(s) removed.
Series.drop_duplicates([keep, inplace]) Return Series with duplicate values removed.
Series.duplicated([keep]) Indicate duplicate Series values.
Series.equals(other) Test whether two objects contain the same elements.
Series.first(offset) Select initial periods of time series data based on a date
offset.
Series.head([n]) Return the first n rows.
Series.idxmax([axis, skipna]) Return the row label of the maximum value.
Series.idxmin([axis, skipna]) Return the row label of the minimum value.
Series.isin(values) Whether elements in Series are contained in values.
Series.last(offset) Select final periods of time series data based on a date
offset.
Series.reindex([index]) Conform Series to new index with optional filling logic.
Series.reindex_like(other[, method, copy, . . . ]) Return an object with matching indices as other object.
Series.rename([index, axis, copy, inplace, . . . ]) Alter Series index labels or name.
Series.rename_axis([mapper, index, columns, Set the name of the axis for the index or columns.
. . . ])
Series.reset_index([level, drop, name, in- Generate a new DataFrame or Series with the index re-
place]) set.
Series.sample([n, frac, replace, weights, . . . ]) Return a random sample of items from an axis of object.
Series.set_axis(labels[, axis, inplace]) Assign desired index to given axis.
Series.take(indices[, axis, is_copy]) Return the elements in the given positional indices
along an axis.
Series.tail([n]) Return the last n rows.
Series.truncate([before, after, axis, copy]) Truncate a Series or DataFrame before and after some
index value.
Series.where(cond[, other, inplace, axis, . . . ]) Replace values where the condition is False.
Series.mask(cond[, other, inplace, axis, . . . ]) Replace values where the condition is True.
Series.add_prefix(prefix) Prefix labels with string prefix.
Series.add_suffix(suffix) Suffix labels with string suffix.
Series.filter([items, like, regex, axis]) Subset the dataframe rows or columns according to the
specified index labels.
Series.argsort([axis, kind, order]) Return the integer indices that would sort the Series val-
ues.
Series.argmin([axis, skipna]) Return int position of the smallest value in the Series.
Series.argmax([axis, skipna]) Return int position of the largest value in the Series.
Series.reorder_levels(order) Rearrange index levels using input order.
Series.sort_values([axis, ascending, . . . ]) Sort by the values.
Series.sort_index([axis, level, ascending, . . . ]) Sort Series by index labels.
Series.swaplevel([i, j, copy]) Swap levels i and j in a MultiIndex.
Series.unstack([level, fill_value]) Unstack, also known as pivot, Series with MultiIndex to
produce DataFrame.
Series.explode([ignore_index]) Transform each element of a list-like to a row.
Series.searchsorted(value[, side, sorter]) Find indices where elements should be inserted to main-
tain order.
Series.ravel([order]) Return the flattened underlying data as an ndarray.
Series.repeat(repeats[, axis]) Repeat elements of a Series.
Series.squeeze([axis]) Squeeze 1 dimensional axis objects into scalars.
Series.view([dtype]) Create a new view of the Series.
3.3.13 Accessors
pandas provides dtype-specific methods under various accessors. These are separate namespaces within Series that
only apply to specific data types.
Datetimelike properties
Series.dt can be used to access the values of the series as datetimelike and return several properties. These can be
accessed like Series.dt.<property>.
Datetime properties
pandas.Series.dt.date
Series.dt.date
Returns numpy array of python datetime.date objects (namely, the date part of Timestamps without timezone
information).
pandas.Series.dt.time
Series.dt.time
Returns numpy array of datetime.time. The time part of the Timestamps.
pandas.Series.dt.timetz
Series.dt.timetz
Returns numpy array of datetime.time also containing timezone information. The time part of the Timestamps.
pandas.Series.dt.year
Series.dt.year
The year of the datetime.
Examples
pandas.Series.dt.month
Series.dt.month
The month as January=1, December=12.
Examples
pandas.Series.dt.day
Series.dt.day
The day of the datetime.
Examples
pandas.Series.dt.hour
Series.dt.hour
The hours of the datetime.
Examples
pandas.Series.dt.minute
Series.dt.minute
The minutes of the datetime.
Examples
pandas.Series.dt.second
Series.dt.second
The seconds of the datetime.
Examples
pandas.Series.dt.microsecond
Series.dt.microsecond
The microseconds of the datetime.
Examples
pandas.Series.dt.nanosecond
Series.dt.nanosecond
The nanoseconds of the datetime.
Examples
pandas.Series.dt.week
Series.dt.week
The week ordinal of the year.
Deprecated since version 1.1.0.
Series.dt.weekofyear and Series.dt.week have been deprecated. Please use Series.dt.isocalendar().week instead.
pandas.Series.dt.weekofyear
Series.dt.weekofyear
The week ordinal of the year.
Deprecated since version 1.1.0.
Series.dt.weekofyear and Series.dt.week have been deprecated. Please use Series.dt.isocalendar().week instead.
pandas.Series.dt.dayofweek
Series.dt.dayofweek
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday
which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or
DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.Series.dt.day_of_week
Series.dt.day_of_week
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday
which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or
DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.Series.dt.weekday
Series.dt.weekday
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on Sunday
which is denoted by 6. This method is available on both Series with datetime values (using the dt accessor) or
DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.Series.dt.dayofyear
Series.dt.dayofyear
The ordinal day of the year.
pandas.Series.dt.day_of_year
Series.dt.day_of_year
The ordinal day of the year.
pandas.Series.dt.quarter
Series.dt.quarter
The quarter of the date.
pandas.Series.dt.is_month_start
Series.dt.is_month_start
Indicates whether the date is the first day of the month.
Returns
Series or array For Series, returns a Series with boolean values. For DatetimeIndex, returns
a boolean array.
See also:
is_month_start Return a boolean indicating whether the date is the first day of the month.
is_month_end Return a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.
>>> s = pd.Series(pd.date_range("2018-02-27", periods=3))
>>> s
0 2018-02-27
1 2018-02-28
2 2018-03-01
dtype: datetime64[ns]
>>> s.dt.is_month_start
0 False
1 False
2 True
dtype: bool
>>> s.dt.is_month_end
0 False
1 True
2 False
dtype: bool
pandas.Series.dt.is_month_end
Series.dt.is_month_end
Indicates whether the date is the last day of the month.
Returns
Series or array For Series, returns a Series with boolean values. For DatetimeIndex, returns
a boolean array.
See also:
is_month_start Return a boolean indicating whether the date is the first day of the month.
is_month_end Return a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.
>>> s = pd.Series(pd.date_range("2018-02-27", periods=3))
>>> s
0 2018-02-27
1 2018-02-28
2 2018-03-01
dtype: datetime64[ns]
>>> s.dt.is_month_start
0 False
1 False
2 True
dtype: bool
(continues on next page)
pandas.Series.dt.is_quarter_start
Series.dt.is_quarter_start
Indicator for whether the date is the first day of a quarter.
Returns
is_quarter_start [Series or DatetimeIndex] The same type as the original data with boolean
values. Series will have the same name and index. DatetimeIndex will have the same
name.
See also:
quarter Return the quarter of the date.
is_quarter_end Similar property for indicating the quarter start.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.
>>> idx.is_quarter_start
array([False, False, True, False])
pandas.Series.dt.is_quarter_end
Series.dt.is_quarter_end
Indicator for whether the date is the last day of a quarter.
Returns
is_quarter_end [Series or DatetimeIndex] The same type as the original data with boolean
values. Series will have the same name and index. DatetimeIndex will have the same
name.
See also:
quarter Return the quarter of the date.
is_quarter_start Similar property indicating the quarter start.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.
>>> idx.is_quarter_end
array([False, True, False, False])
pandas.Series.dt.is_year_start
Series.dt.is_year_start
Indicate whether the date is the first day of a year.
Returns
Series or DatetimeIndex The same type as the original data with boolean values. Series
will have the same name and index. DatetimeIndex will have the same name.
See also:
is_year_end Similar property indicating the last day of the year.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.
>>> dates.dt.is_year_start
0 False
1 False
2 True
dtype: bool
>>> idx.is_year_start
array([False, False, True])
pandas.Series.dt.is_year_end
Series.dt.is_year_end
Indicate whether the date is the last day of the year.
Returns
Series or DatetimeIndex The same type as the original data with boolean values. Series
will have the same name and index. DatetimeIndex will have the same name.
See also:
is_year_start Similar property indicating the start of the year.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.
>>> dates.dt.is_year_end
0 False
1 True
2 False
dtype: bool
>>> idx.is_year_end
array([False, True, False])
pandas.Series.dt.is_leap_year
Series.dt.is_leap_year
Boolean indicator if the date belongs to a leap year.
A leap year is a year, which has 366 days (instead of 365) including 29th of February as an intercalary day. Leap
years are years which are multiples of four with the exception of years divisible by 100 but not by 400.
Returns
Series or ndarray Booleans indicating if dates belong to a leap year.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on DatetimeIndex.
pandas.Series.dt.daysinmonth
Series.dt.daysinmonth
The number of days in the month.
pandas.Series.dt.days_in_month
Series.dt.days_in_month
The number of days in the month.
pandas.Series.dt.tz
Series.dt.tz
Return timezone, if any.
Returns
datetime.tzinfo, pytz.tzinfo.BaseTZInfo, dateutil.tz.tz.tzfile, or None Returns None
when the array is tz-naive.
pandas.Series.dt.freq
Series.dt.freq
Datetime methods
pandas.Series.dt.to_period
Series.dt.to_period(*args, **kwargs)
Cast to PeriodArray/Index at a particular frequency.
Converts DatetimeArray/Index to PeriodArray/Index.
Parameters
freq [str or Offset, optional] One of pandas’ offset strings or an Offset object. Will be in-
ferred by default.
Returns
PeriodArray/Index
Raises
ValueError When converting a DatetimeArray/Index with non-regular values, so that a fre-
quency cannot be inferred.
See also:
PeriodIndex Immutable ndarray holding ordinal values.
DatetimeIndex.to_pydatetime Return DatetimeIndex as object.
Examples
pandas.Series.dt.to_pydatetime
Series.dt.to_pydatetime()
Return the data as an array of native Python datetime objects.
Timezone information is retained if present.
Warning: Python’s datetime uses microsecond resolution, which is lower than pandas (nanosecond). The
values are truncated.
Returns
numpy.ndarray Object dtype array containing native Python datetime objects.
See also:
datetime.datetime Standard library value for a datetime.
Examples
>>> s.dt.to_pydatetime()
array([datetime.datetime(2018, 3, 10, 0, 0),
datetime.datetime(2018, 3, 11, 0, 0)], dtype=object)
>>> s.dt.to_pydatetime()
array([datetime.datetime(2018, 3, 10, 0, 0),
datetime.datetime(2018, 3, 10, 0, 0)], dtype=object)
pandas.Series.dt.tz_localize
Series.dt.tz_localize(*args, **kwargs)
Localize tz-naive Datetime Array/Index to tz-aware Datetime Array/Index.
This method takes a time zone (tz) naive Datetime Array/Index object and makes this time zone aware. It does
not move the time to another time zone. Time zone localization helps to switch from time zone aware to time
zone unaware objects.
Parameters
tz [str, pytz.timezone, dateutil.tz.tzfile or None] Time zone to convert timestamps to. Passing
None will remove the time zone information preserving local time.
ambiguous [‘infer’, ‘NaT’, bool array, default ‘raise’] When clocks moved backward due to
DST, ambiguous times may arise. For example in Central European Time (UTC+01),
when going from 03:00 DST to 02:00 non-DST, 02:30:00 local time occurs both at
00:30:00 UTC and at 01:30:00 UTC. In such a situation, the ambiguous parameter
dictates how ambiguous times should be handled.
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False signifies a non-DST time (note
that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times.
nonexistent [‘shift_forward’, ‘shift_backward, ‘NaT’, timedelta, default ‘raise’] A nonex-
istent time does not exist in a particular timezone where clocks moved forward due to
DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing time
• ‘shift_backward’ will shift the nonexistent time backward to the closest existing
time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
New in version 0.24.0.
Returns
Same type as self Array/Index converted to the specified time zone.
Raises
TypeError If the Datetime Array/Index is tz-aware and tz is not None.
See also:
DatetimeIndex.tz_convert Convert tz-aware DatetimeIndex from one time zone to another.
Examples
With the tz=None, we can remove the time zone information while keeping the local time (not converted to
UTC):
>>> tz_aware.tz_localize(None)
DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
'2018-03-03 09:00:00'],
dtype='datetime64[ns]', freq=None)
Be careful with DST changes. When there is sequential data, pandas can infer the DST time:
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous
parameter to set the DST explicitly
If the DST transition causes nonexistent times, you can shift these dates forward or backwards with a timedelta
object or ‘shift_forward’ or ‘shift_backwards’.
pandas.Series.dt.tz_convert
Series.dt.tz_convert(*args, **kwargs)
Convert tz-aware Datetime Array/Index from one time zone to another.
Parameters
tz [str, pytz.timezone, dateutil.tz.tzfile or None] Time zone for time. Corresponding times-
tamps would be converted to this time zone of the Datetime Array/Index. A tz of None
will convert to UTC and remove the timezone information.
Returns
Array or Index
Raises
TypeError If Datetime Array/Index is tz-naive.
See also:
DatetimeIndex.tz A timezone that has a variable offset from UTC.
DatetimeIndex.tz_localize Localize tz-naive DatetimeIndex to a given time zone, or remove time-
zone from a tz-aware DatetimeIndex.
Examples
With the tz parameter, we can change the DatetimeIndex to other time zones:
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
'2014-08-01 10:00:00+02:00',
'2014-08-01 11:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert('US/Central')
DatetimeIndex(['2014-08-01 02:00:00-05:00',
'2014-08-01 03:00:00-05:00',
'2014-08-01 04:00:00-05:00'],
dtype='datetime64[ns, US/Central]', freq='H')
With the tz=None, we can remove the timezone (after converting to UTC if necessary):
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
'2014-08-01 10:00:00+02:00',
'2014-08-01 11:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert(None)
DatetimeIndex(['2014-08-01 07:00:00',
'2014-08-01 08:00:00',
'2014-08-01 09:00:00'],
dtype='datetime64[ns]', freq='H')
pandas.Series.dt.normalize
Series.dt.normalize(*args, **kwargs)
Convert times to midnight.
The time component of the date-time is converted to midnight i.e. 00:00:00. This is useful in cases, when the
time does not matter. Length is unaltered. The timezones are unaffected.
This method is available on Series with datetime values under the .dt accessor, and directly on Datetime
Array/Index.
Returns
DatetimeArray, DatetimeIndex or Series The same type as the original data. Series will
have the same name and index. DatetimeIndex will have the same name.
See also:
floor Floor the datetimes to the specified freq.
ceil Ceil the datetimes to the specified freq.
round Round the datetimes to the specified freq.
Examples
pandas.Series.dt.strftime
Series.dt.strftime(*args, **kwargs)
Convert to Index using specified date_format.
Return an Index of formatted strings specified by date_format, which supports the same string format as the
python standard library. Details of the string format can be found in python string format doc.
Parameters
date_format [str] Date format string (e.g. “%Y-%m-%d”).
Returns
ndarray NumPy ndarray of formatted strings.
See also:
to_datetime Convert the given argument to datetime.
DatetimeIndex.normalize Return DatetimeIndex with times to midnight.
DatetimeIndex.round Round the DatetimeIndex to the specified freq.
DatetimeIndex.floor Floor the DatetimeIndex to the specified freq.
Examples
pandas.Series.dt.round
Series.dt.round(*args, **kwargs)
Perform round operation on the data to the specified freq.
Parameters
freq [str or Offset] The frequency level to round the index to. Must be a fixed frequency
like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq
values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for DatetimeIndex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False designates a non-DST time
(note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times.
New in version 0.24.0.
nonexistent [‘shift_forward’, ‘shift_backward’, ‘NaT’, timedelta, default ‘raise’] A nonex-
istent time does not exist in a particular timezone where clocks moved forward due to
DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing time
• ‘shift_backward’ will shift the nonexistent time backward to the closest existing
time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a DatetimeIndex or
TimedeltaIndex, or a Series with the same index for a Series.
Raises
ValueError if the freq cannot be converted.
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.round("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.Series.dt.floor
Series.dt.floor(*args, **kwargs)
Perform floor operation on the data to the specified freq.
Parameters
freq [str or Offset] The frequency level to floor the index to. Must be a fixed frequency like
‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq
values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for DatetimeIndex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False designates a non-DST time
(note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times.
New in version 0.24.0.
nonexistent [‘shift_forward’, ‘shift_backward’, ‘NaT’, timedelta, default ‘raise’] A nonex-
istent time does not exist in a particular timezone where clocks moved forward due to
DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing time
• ‘shift_backward’ will shift the nonexistent time backward to the closest existing
time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a DatetimeIndex or
TimedeltaIndex, or a Series with the same index for a Series.
Raises
ValueError if the freq cannot be converted.
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.floor("H")
0 2018-01-01 11:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.Series.dt.ceil
Series.dt.ceil(*args, **kwargs)
Perform ceil operation on the data to the specified freq.
Parameters
freq [str or Offset] The frequency level to ceil the index to. Must be a fixed frequency like
‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible freq
values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for DatetimeIndex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False designates a non-DST time
(note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times.
New in version 0.24.0.
nonexistent [‘shift_forward’, ‘shift_backward’, ‘NaT’, timedelta, default ‘raise’] A nonex-
istent time does not exist in a particular timezone where clocks moved forward due to
DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing time
• ‘shift_backward’ will shift the nonexistent time backward to the closest existing
time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a DatetimeIndex or
TimedeltaIndex, or a Series with the same index for a Series.
Raises
ValueError if the freq cannot be converted.
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.ceil("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 13:00:00
dtype: datetime64[ns]
pandas.Series.dt.month_name
Series.dt.month_name(*args, **kwargs)
Return the month names of the DateTimeIndex with specified locale.
Parameters
locale [str, optional] Locale determining the language in which to return the month name.
Default is English locale.
Returns
Index Index of month names.
Examples
pandas.Series.dt.day_name
Series.dt.day_name(*args, **kwargs)
Return the day names of the DateTimeIndex with specified locale.
Parameters
locale [str, optional] Locale determining the language in which to return the day name. De-
fault is English locale.
Returns
Index Index of day names.
Examples
Period properties
Series.dt.qyear
Series.dt.start_time
Series.dt.end_time
pandas.Series.dt.qyear
Series.dt.qyear
pandas.Series.dt.start_time
Series.dt.start_time
pandas.Series.dt.end_time
Series.dt.end_time
Timedelta properties
pandas.Series.dt.days
Series.dt.days
Number of days for each element.
pandas.Series.dt.seconds
Series.dt.seconds
Number of seconds (>= 0 and less than 1 day) for each element.
pandas.Series.dt.microseconds
Series.dt.microseconds
Number of microseconds (>= 0 and less than 1 second) for each element.
pandas.Series.dt.nanoseconds
Series.dt.nanoseconds
Number of nanoseconds (>= 0 and less than 1 microsecond) for each element.
pandas.Series.dt.components
Series.dt.components
Return a Dataframe of the components of the Timedeltas.
Returns
DataFrame
Examples
Timedelta methods
pandas.Series.dt.to_pytimedelta
Series.dt.to_pytimedelta()
Return an array of native datetime.timedelta objects.
Python’s standard datetime library uses a different representation timedelta’s. This method converts a Series of
pandas Timedeltas to datetime.timedelta format with the same length as the original Series.
Returns
numpy.ndarray Array of 1D containing data with datetime.timedelta type.
See also:
datetime.timedelta A duration expressing the difference between two date, time, or datetime.
Examples
>>> s.dt.to_pytimedelta()
array([datetime.timedelta(0), datetime.timedelta(days=1),
(continues on next page)
pandas.Series.dt.total_seconds
Series.dt.total_seconds(*args, **kwargs)
Return total duration of each element expressed in seconds.
This method is available directly on TimedeltaArray, TimedeltaIndex and on Series containing timedelta values
under the .dt namespace.
Returns
seconds [[ndarray, Float64Index, Series]] When the calling object is a TimedeltaArray, the
return type is ndarray. When the calling object is a TimedeltaIndex, the return type is
a Float64Index. When the calling object is a Series, the return type is Series of type
float64 whose index is the same as the original.
See also:
datetime.timedelta.total_seconds Standard library version of this method.
TimedeltaIndex.components Return a DataFrame with components of each Timedelta.
Examples
Series
>>> s.dt.total_seconds()
0 0.0
1 86400.0
2 172800.0
3 259200.0
4 345600.0
dtype: float64
TimedeltaIndex
>>> idx.total_seconds()
Float64Index([0.0, 86400.0, 172800.0, 259200.00000000003, 345600.0],
dtype='float64')
String handling
Series.str can be used to access the values of the series as strings and apply several methods to it. These can be
accessed like Series.str.<function/property>.
pandas.Series.str.capitalize
Series.str.capitalize()
Convert strings in the Series/Index to be capitalized.
Equivalent to str.capitalize().
Returns
Series or Index of object
See also:
Series.str.lower Converts all characters to lowercase.
Series.str.upper Converts all characters to uppercase.
Series.str.title Converts first character of each word to uppercase and remaining to lowercase.
Series.str.capitalize Converts first character to uppercase and remaining to lowercase.
Series.str.swapcase Converts uppercase to lowercase and lowercase to uppercase.
Series.str.casefold Removes all case distinctions in the string.
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.casefold
Series.str.casefold()
Convert strings in the Series/Index to be casefolded.
New in version 0.25.0.
Equivalent to str.casefold().
Returns
Series or Index of object
See also:
Series.str.lower Converts all characters to lowercase.
Series.str.upper Converts all characters to uppercase.
Series.str.title Converts first character of each word to uppercase and remaining to lowercase.
Series.str.capitalize Converts first character to uppercase and remaining to lowercase.
Series.str.swapcase Converts uppercase to lowercase and lowercase to uppercase.
Series.str.casefold Removes all case distinctions in the string.
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
(continues on next page)
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.cat
Examples
When not passing others, all values are concatenated into a single string:
>>> s = pd.Series(['a', 'b', np.nan, 'd'])
>>> s.str.cat(sep=' ')
'a b d'
By default, NA values in the Series are ignored. Using na_rep, they can be given a representation:
>>> s.str.cat(sep=' ', na_rep='?')
'a b ? d'
If others is specified, corresponding values are concatenated with the separator. Result will be a Series of strings.
>>> s.str.cat(['A', 'B', 'C', 'D'], sep=',')
0 a,A
1 b,B
2 NaN
3 d,D
dtype: object
Missing values will remain missing in the result, but can again be represented using na_rep
>>> s.str.cat(['A', 'B', 'C', 'D'], sep=',', na_rep='-')
0 a,A
1 b,B
2 -,C
3 d,D
dtype: object
Series with different indexes can be aligned before concatenation. The join-keyword works as in other methods.
>>> t = pd.Series(['d', 'a', 'e', 'c'], index=[3, 0, 4, 2])
>>> s.str.cat(t, join='left', na_rep='-')
0 aa
1 b-
2 -c
3 dd
dtype: object
>>>
>>> s.str.cat(t, join='outer', na_rep='-')
0 aa
1 b-
2 -c
3 dd
4 -e
dtype: object
>>>
(continues on next page)
pandas.Series.str.center
pandas.Series.str.contains
Examples
Specifying na to be False instead of NaN replaces NaN values with False. If Series or Index does not contain
NaN values the resultant dtype will be bool, otherwise, an object dtype.
>>> import re
>>> s1.str.contains('PARROT', flags=re.IGNORECASE, regex=True)
0 False
1 False
2 True
3 False
4 NaN
dtype: object
Ensure pat is a not a literal pattern when regex is set to True. Note in the following example one might expect
only s2[1] and s2[3] to return True. However, ‘.0’ as a regex matches any character followed by a 0.
pandas.Series.str.count
Series.str.count(pat, flags=0)
Count occurrences of pattern in each string of the Series/Index.
This function is used to count the number of times a particular regex pattern is repeated in each of the string
elements of the Series.
Parameters
pat [str] Valid regular expression.
flags [int, default 0, meaning no flags] Flags for the re module. For a complete list, see here.
**kwargs For compatibility with other string methods. Not used.
Returns
Series or Index Same type as the calling object containing the integer counts.
See also:
re Standard library module for regular expressions.
str.count Standard library version, without regular expression support.
Notes
Some characters need to be escaped when passing in pat. eg. '$' has a special meaning in regex and must be
escaped when finding this literal character.
Examples
pandas.Series.str.decode
Series.str.decode(encoding, errors='strict')
Decode character string in the Series/Index using indicated encoding.
Equivalent to str.decode() in python2 and bytes.decode() in python3.
Parameters
encoding [str]
errors [str, optional]
Returns
Series or Index
pandas.Series.str.encode
Series.str.encode(encoding, errors='strict')
Encode character string in the Series/Index using indicated encoding.
Equivalent to str.encode().
Parameters
encoding [str]
errors [str, optional]
Returns
encoded [Series/Index of objects]
pandas.Series.str.endswith
Series.str.endswith(pat, na=None)
Test if the end of each string element matches a pattern.
Equivalent to str.endswith().
Parameters
pat [str] Character sequence. Regular expressions are not accepted.
na [object, default NaN] Object shown if element tested is not a string. The default depends
on dtype of the array. For object-dtype, numpy.nan is used. For StringDtype,
pandas.NA is used.
Returns
Series or Index of bool A Series of booleans indicating whether the given pattern matches
the end of each string element.
See also:
str.endswith Python standard library string method.
Series.str.startswith Same as endswith, but tests the start of string.
Series.str.contains Tests if string element contains a pattern.
Examples
>>> s.str.endswith('t')
0 True
1 False
2 False
3 NaN
dtype: object
pandas.Series.str.extract
Examples
A pattern with two groups will return a DataFrame with two columns. Non-matches will be NaN.
>>> s = pd.Series(['a1', 'b2', 'c3'])
>>> s.str.extract(r'([ab])(\d)')
0 1
0 a 1
1 b 2
2 NaN NaN
>>> s.str.extract(r'(?P<letter>[ab])(?P<digit>\d)')
letter digit
0 a 1
1 b 2
2 NaN NaN
A pattern with one group will return a DataFrame with one column if expand=True.
pandas.Series.str.extractall
Series.str.extractall(pat, flags=0)
Extract capture groups in the regex pat as columns in DataFrame.
For each subject string in the Series, extract groups from all matches of regular expression pat. When each
subject string in the Series has exactly one match, extractall(pat).xs(0, level=’match’) is the same as extract(pat).
Parameters
pat [str] Regular expression pattern with capturing groups.
flags [int, default 0 (no flags)] A re module flag, for example re.IGNORECASE. These
allow to modify regular expression matching for things like case, spaces, etc. Multiple
flags can be combined with the bitwise OR operator, for example re.IGNORECASE
| re.MULTILINE.
Returns
DataFrame A DataFrame with one row for each match, and one column for each group.
Its rows have a MultiIndex with first levels that come from the subject Series.
The last level is named ‘match’ and indexes the matches in each item of the Series.
Any capture group names in regular expression pat will be used for column names;
otherwise capture group numbers will be used.
See also:
extract Returns first match only (not all matches).
Examples
A pattern with one group will return a DataFrame with one column. Indices with no matches will not appear in
the result.
Capture group names are used for column names of the result.
>>> s.str.extractall(r"[ab](?P<digit>\d)")
digit
match
A 0 1
1 2
B 0 1
A pattern with two groups will return a DataFrame with two columns.
>>> s.str.extractall(r"(?P<letter>[ab])(?P<digit>\d)")
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
>>> s.str.extractall(r"(?P<letter>[ab])?(?P<digit>\d)")
letter digit
match
A 0 a 1
1 a 2
B 0 b 1
C 0 NaN 1
pandas.Series.str.find
pandas.Series.str.findall
Series.str.findall(pat, flags=0)
Find all occurrences of pattern or regular expression in the Series/Index.
Equivalent to applying re.findall() to all the elements in the Series/Index.
Parameters
pat [str] Pattern or regular expression.
flags [int, default 0] Flags from re module, e.g. re.IGNORECASE (default is 0, which means
no flags).
Returns
Series/Index of lists of strings All non-overlapping matches of pattern or regular expres-
sion in each string of this Series/Index.
See also:
count Count occurrences of pattern or regular expression in each string of the Series/Index.
extractall For each string in the Series, extract groups from all matches of regular expression and return a
DataFrame with one row for each match and one column for each group.
re.findall The equivalent re function to all non-overlapping matches of pattern or regular expression in
string, as a list of strings.
Examples
>>> s.str.findall('Monkey')
0 []
1 [Monkey]
2 []
dtype: object
On the other hand, the search for the pattern ‘MONKEY’ doesn’t return any match:
>>> s.str.findall('MONKEY')
0 []
1 []
2 []
dtype: object
Flags can be added to the pattern or regular expression. For instance, to find the pattern ‘MONKEY’ ignoring
the case:
>>> import re
>>> s.str.findall('MONKEY', flags=re.IGNORECASE)
0 []
(continues on next page)
When the pattern matches more than one string in the Series, all matches are returned:
>>> s.str.findall('on')
0 [on]
1 [on]
2 []
dtype: object
Regular expressions are supported too. For instance, the search for all the strings ending with the word ‘on’ is
shown next:
>>> s.str.findall('on$')
0 [on]
1 []
2 []
dtype: object
If the pattern is found more than once in the same string, then a list of multiple strings is returned:
>>> s.str.findall('b')
0 []
1 []
2 [b, b]
dtype: object
pandas.Series.str.get
Series.str.get(i)
Extract element from each component at specified position.
Extract element from lists, tuples, or strings in each element in the Series/Index.
Parameters
i [int] Position of element to extract.
Returns
Series or Index
Examples
>>> s = pd.Series(["String",
... (1, 2, 3),
... ["a", "b", "c"],
... 123,
... -456,
... {1: "Hello", "2": "World"}])
>>> s
0 String
1 (1, 2, 3)
(continues on next page)
>>> s.str.get(1)
0 t
1 2
2 b
3 NaN
4 NaN
5 Hello
dtype: object
>>> s.str.get(-1)
0 g
1 3
2 c
3 NaN
4 NaN
5 None
dtype: object
pandas.Series.str.index
pandas.Series.str.join
Series.str.join(sep)
Join lists contained as elements in the Series/Index with passed delimiter.
If the elements of a Series are lists themselves, join the content of these lists using the delimiter passed to the
function. This function is an equivalent to str.join().
Parameters
sep [str] Delimiter to use between list entries.
Returns
Series/Index: object The list entries concatenated by intervening occurrences of the delim-
iter.
Raises
AttributeError If the supplied Series contains neither strings nor lists.
See also:
str.join Standard library version of this method.
Series.str.split Split strings around given separator/delimiter.
Notes
If any of the list items is not a string object, the result of the join will be NaN.
Examples
Join all lists using a ‘-‘. The lists containing object(s) of types other than str will produce a NaN.
>>> s.str.join('-')
0 lion-elephant-zebra
1 NaN
2 NaN
3 NaN
4 NaN
dtype: object
pandas.Series.str.len
Series.str.len()
Compute the length of each element in the Series/Index.
The element may be a sequence (such as a string, tuple or list) or a collection (such as a dictionary).
Returns
Series or Index of int A Series or Index of integer values indicating the length of each ele-
ment in the Series or Index.
See also:
str.len Python built-in function returning the length of an object.
Series.size Returns the length of the Series.
Examples
Returns the length (number of characters) in a string. Returns the number of entries for dictionaries, lists or
tuples.
>>> s = pd.Series(['dog',
... '',
... 5,
... {'foo' : 'bar'},
... [2, 3, 5, 7],
... ('one', 'two', 'three')])
>>> s
0 dog
1
2 5
3 {'foo': 'bar'}
4 [2, 3, 5, 7]
5 (one, two, three)
dtype: object
>>> s.str.len()
0 3.0
1 0.0
2 NaN
3 1.0
4 4.0
5 3.0
dtype: float64
pandas.Series.str.ljust
pandas.Series.str.lower
Series.str.lower()
Convert strings in the Series/Index to lowercase.
Equivalent to str.lower().
Returns
Series or Index of object
See also:
Series.str.lower Converts all characters to lowercase.
Series.str.upper Converts all characters to uppercase.
Series.str.title Converts first character of each word to uppercase and remaining to lowercase.
Series.str.capitalize Converts first character to uppercase and remaining to lowercase.
Series.str.swapcase Converts uppercase to lowercase and lowercase to uppercase.
Series.str.casefold Removes all case distinctions in the string.
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
(continues on next page)
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.lstrip
Series.str.lstrip(to_strip=None)
Remove leading characters.
Strip whitespaces (including newlines) or a set of specified characters from each string in the Series/Index from
left side. Equivalent to str.lstrip().
Parameters
to_strip [str or None, default None] Specifying the set of characters to be removed. All
combinations of this set of characters will be stripped. If None then whitespaces are
removed.
Returns
Series or Index of object
See also:
Series.str.strip Remove leading and trailing characters in Series/Index.
Series.str.lstrip Remove leading characters in Series/Index.
Series.str.rstrip Remove trailing characters in Series/Index.
Examples
>>> s.str.strip()
0 1. Ant.
1 2. Bee!
2 3. Cat?
3 NaN
dtype: object
>>> s.str.lstrip('123.')
0 Ant.
1 Bee!\n
(continues on next page)
pandas.Series.str.match
pandas.Series.str.normalize
Series.str.normalize(form)
Return the Unicode normal form for the strings in the Series/Index.
For more information on the forms, see the unicodedata.normalize().
Parameters
form [{‘NFC’, ‘NFKC’, ‘NFD’, ‘NFKD’}] Unicode form.
Returns
normalized [Series/Index of objects]
pandas.Series.str.pad
Examples
>>> s.str.pad(width=10)
0 caribou
1 tiger
dtype: object
pandas.Series.str.partition
Examples
>>> s.str.partition()
0 1 2
0 Linda van der Berg
1 George Pitt-Rivers
>>> s.str.rpartition()
0 1 2
0 Linda van der Berg
1 George Pitt-Rivers
>>> s.str.partition('-')
0 1 2
0 Linda van der Berg
1 George Pitt - Rivers
>>> idx.str.partition()
MultiIndex([('X', ' ', '123'),
('Y', ' ', '999')],
)
>>> idx.str.partition(expand=False)
Index([('X', ' ', '123'), ('Y', ' ', '999')], dtype='object')
pandas.Series.str.repeat
Series.str.repeat(repeats)
Duplicate each string in the Series or Index.
Parameters
repeats [int or sequence of int] Same value for all (int) or different value per (sequence).
Returns
Series or Index of object Series or Index of repeated string objects specified by input pa-
rameter repeats.
Examples
>>> s.str.repeat(repeats=2)
0 aa
1 bb
2 cc
dtype: object
pandas.Series.str.replace
Notes
When pat is a compiled regex, all flags should be included in the compiled regex. Use of case, flags, or
regex=False with a compiled regex will raise an error.
Examples
When pat is a string and regex is True (the default), the given pat is compiled as a regex. When repl is a string,
it replaces matching regex patterns as with re.sub(). NaN value(s) in the Series are left as is:
When pat is a string and regex is False, every pat is replaced with repl as with str.replace():
When repl is a callable, it is called on every pat using re.sub(). The callable should expect one positional
argument (a regex object) and return a string.
To get the idea:
>>> import re
>>> regex_pat = re.compile(r'FUZ', flags=re.IGNORECASE)
>>> pd.Series(['foo', 'fuz', np.nan]).str.replace(regex_pat, 'bar')
0 foo
1 bar
2 NaN
dtype: object
pandas.Series.str.rfind
pandas.Series.str.rindex
pandas.Series.str.rjust
pandas.Series.str.rpartition
Examples
>>> s.str.partition()
0 1 2
0 Linda van der Berg
1 George Pitt-Rivers
>>> s.str.rpartition()
0 1 2
0 Linda van der Berg
1 George Pitt-Rivers
>>> s.str.partition('-')
0 1 2
0 Linda van der Berg
1 George Pitt - Rivers
pandas.Series.str.rstrip
Series.str.rstrip(to_strip=None)
Remove trailing characters.
Strip whitespaces (including newlines) or a set of specified characters from each string in the Series/Index from
right side. Equivalent to str.rstrip().
Parameters
to_strip [str or None, default None] Specifying the set of characters to be removed. All
combinations of this set of characters will be stripped. If None then whitespaces are
removed.
Returns
Series or Index of object
See also:
Series.str.strip Remove leading and trailing characters in Series/Index.
Series.str.lstrip Remove leading characters in Series/Index.
Series.str.rstrip Remove trailing characters in Series/Index.
Examples
>>> s.str.strip()
0 1. Ant.
1 2. Bee!
2 3. Cat?
3 NaN
dtype: object
>>> s.str.lstrip('123.')
0 Ant.
1 Bee!\n
2 Cat?\t
3 NaN
dtype: object
pandas.Series.str.slice
Examples
>>> s.str.slice(start=1)
0 oala
1 ox
(continues on next page)
>>> s.str.slice(start=-1)
0 a
1 x
2 n
dtype: object
>>> s.str.slice(stop=2)
0 ko
1 fo
2 ch
dtype: object
>>> s.str.slice(step=2)
0 kaa
1 fx
2 caeen
dtype: object
pandas.Series.str.slice_replace
Examples
Specify just start, meaning replace start until the end of the string with repl.
Specify just stop, meaning the start of the string to stop is replaced with repl, and the rest of the string is included.
Specify start and stop, meaning the slice from start to stop is replaced with repl. Everything before or after start
and stop is included as is.
pandas.Series.str.split
expand [bool, default False] Expand the split strings into separate columns.
• If True, return DataFrame/MultiIndex expanding dimensionality.
• If False, return Series/Index, containing lists of strings.
Returns
Series, Index, DataFrame or MultiIndex Type matches caller unless expand=True (see
Notes).
See also:
Series.str.split Split strings around given separator/delimiter.
Series.str.rsplit Splits string around given separator/delimiter, starting from the right.
Series.str.join Join lists contained as elements in the Series/Index with passed delimiter.
str.split Standard library version for split.
str.rsplit Standard library version for rsplit.
Notes
Examples
>>> s = pd.Series(
... [
... "this is a regular sentence",
... "https://docs.python.org/3/tutorial/index.html",
... np.nan
... ]
... )
>>> s
0 this is a regular sentence
1 https://docs.python.org/3/tutorial/index.html
2 NaN
dtype: object
>>> s.str.split()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
Without the n parameter, the outputs of rsplit and split are identical.
>>> s.str.rsplit()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
The n parameter can be used to limit the number of splits on the delimiter. The outputs of split and rsplit are
different.
>>> s.str.split(n=2)
0 [this, is, a regular sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
>>> s.str.rsplit(n=2)
0 [this is a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
>>> s.str.split(pat="/")
0 [this is a regular sentence]
1 [https:, , docs.python.org, 3, tutorial, index...
2 NaN
dtype: object
When using expand=True, the split elements will expand out into separate columns. If NaN is present, it is
propagated throughout the columns during the split.
>>> s.str.split(expand=True)
0 1 2 3 4
0 this is a regular sentence
1 https://docs.python.org/3/tutorial/index.html None None None None
2 NaN NaN NaN NaN NaN
For slightly more complex use cases like splitting the html document name from a url, a combination of param-
eter settings can be used.
>>> s = pd.Series(["1+1=2"])
>>> s
0 1+1=2
dtype: object
>>> s.str.split(r"\+|=", expand=True)
0 1 2
0 1 1 2
pandas.Series.str.rsplit
Notes
Examples
>>> s = pd.Series(
... [
... "this is a regular sentence",
... "https://docs.python.org/3/tutorial/index.html",
... np.nan
... ]
... )
>>> s
0 this is a regular sentence
1 https://docs.python.org/3/tutorial/index.html
2 NaN
dtype: object
>>> s.str.split()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
Without the n parameter, the outputs of rsplit and split are identical.
>>> s.str.rsplit()
0 [this, is, a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
The n parameter can be used to limit the number of splits on the delimiter. The outputs of split and rsplit are
different.
>>> s.str.split(n=2)
0 [this, is, a regular sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
>>> s.str.rsplit(n=2)
0 [this is a, regular, sentence]
1 [https://docs.python.org/3/tutorial/index.html]
2 NaN
dtype: object
>>> s.str.split(pat="/")
0 [this is a regular sentence]
1 [https:, , docs.python.org, 3, tutorial, index...
2 NaN
dtype: object
When using expand=True, the split elements will expand out into separate columns. If NaN is present, it is
propagated throughout the columns during the split.
>>> s.str.split(expand=True)
0 1 2 3 4
0 this is a regular sentence
1 https://docs.python.org/3/tutorial/index.html None None None None
2 NaN NaN NaN NaN NaN
For slightly more complex use cases like splitting the html document name from a url, a combination of param-
eter settings can be used.
>>> s = pd.Series(["1+1=2"])
>>> s
0 1+1=2
dtype: object
>>> s.str.split(r"\+|=", expand=True)
0 1 2
0 1 1 2
pandas.Series.str.startswith
Series.str.startswith(pat, na=None)
Test if the start of each string element matches a pattern.
Equivalent to str.startswith().
Parameters
pat [str] Character sequence. Regular expressions are not accepted.
na [object, default NaN] Object shown if element tested is not a string. The default depends
on dtype of the array. For object-dtype, numpy.nan is used. For StringDtype,
pandas.NA is used.
Returns
Series or Index of bool A Series of booleans indicating whether the given pattern matches
the start of each string element.
See also:
str.startswith Python standard library string method.
Series.str.endswith Same as startswith, but tests the end of string.
Series.str.contains Tests if string element contains a pattern.
Examples
>>> s.str.startswith('b')
0 True
1 False
2 False
3 NaN
dtype: object
pandas.Series.str.strip
Series.str.strip(to_strip=None)
Remove leading and trailing characters.
Strip whitespaces (including newlines) or a set of specified characters from each string in the Series/Index from
left and right sides. Equivalent to str.strip().
Parameters
to_strip [str or None, default None] Specifying the set of characters to be removed. All
combinations of this set of characters will be stripped. If None then whitespaces are
removed.
Returns
Series or Index of object
See also:
Series.str.strip Remove leading and trailing characters in Series/Index.
Series.str.lstrip Remove leading characters in Series/Index.
Series.str.rstrip Remove trailing characters in Series/Index.
Examples
>>> s.str.strip()
0 1. Ant.
1 2. Bee!
2 3. Cat?
3 NaN
dtype: object
>>> s.str.lstrip('123.')
0 Ant.
1 Bee!\n
2 Cat?\t
3 NaN
dtype: object
pandas.Series.str.swapcase
Series.str.swapcase()
Convert strings in the Series/Index to be swapcased.
Equivalent to str.swapcase().
Returns
Series or Index of object
See also:
Series.str.lower Converts all characters to lowercase.
Series.str.upper Converts all characters to uppercase.
Series.str.title Converts first character of each word to uppercase and remaining to lowercase.
Series.str.capitalize Converts first character to uppercase and remaining to lowercase.
Series.str.swapcase Converts uppercase to lowercase and lowercase to uppercase.
Series.str.casefold Removes all case distinctions in the string.
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.title
Series.str.title()
Convert strings in the Series/Index to titlecase.
Equivalent to str.title().
Returns
Series or Index of object
See also:
Series.str.lower Converts all characters to lowercase.
Series.str.upper Converts all characters to uppercase.
Series.str.title Converts first character of each word to uppercase and remaining to lowercase.
Series.str.capitalize Converts first character to uppercase and remaining to lowercase.
Series.str.swapcase Converts uppercase to lowercase and lowercase to uppercase.
Series.str.casefold Removes all case distinctions in the string.
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
(continues on next page)
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.translate
Series.str.translate(table)
Map all characters in the string through the given mapping table.
Equivalent to standard str.translate().
Parameters
table [dict] Table is a mapping of Unicode ordinals to Unicode ordinals, strings, or None.
Unmapped characters are left untouched. Characters mapped to None are deleted.
str.maketrans() is a helper function for making translation tables.
Returns
Series or Index
pandas.Series.str.upper
Series.str.upper()
Convert strings in the Series/Index to uppercase.
Equivalent to str.upper().
Returns
Series or Index of object
See also:
Series.str.lower Converts all characters to lowercase.
Series.str.upper Converts all characters to uppercase.
Series.str.title Converts first character of each word to uppercase and remaining to lowercase.
Series.str.capitalize Converts first character to uppercase and remaining to lowercase.
Series.str.swapcase Converts uppercase to lowercase and lowercase to uppercase.
Series.str.casefold Removes all case distinctions in the string.
Examples
>>> s.str.lower()
0 lower
1 capitals
2 this is a sentence
3 swapcase
dtype: object
>>> s.str.upper()
0 LOWER
1 CAPITALS
2 THIS IS A SENTENCE
3 SWAPCASE
dtype: object
>>> s.str.title()
0 Lower
1 Capitals
2 This Is A Sentence
3 Swapcase
dtype: object
>>> s.str.capitalize()
0 Lower
1 Capitals
2 This is a sentence
3 Swapcase
dtype: object
>>> s.str.swapcase()
0 LOWER
1 capitals
2 THIS IS A SENTENCE
3 sWaPcAsE
dtype: object
pandas.Series.str.wrap
Series.str.wrap(width, **kwargs)
Wrap strings in Series/Index at specified line width.
This method has the same keyword parameters and defaults as textwrap.TextWrapper.
Parameters
width [int] Maximum line width.
expand_tabs [bool, optional] If True, tab characters will be expanded to spaces (default:
True).
replace_whitespace [bool, optional] If True, each whitespace character (as defined by
string.whitespace) remaining after tab expansion will be replaced by a single space
(default: True).
drop_whitespace [bool, optional] If True, whitespace that, after wrapping, happens to end
up at the beginning or end of a line is dropped (default: True).
break_long_words [bool, optional] If True, then words longer than width will be broken in
order to ensure that no lines are longer than width. If it is false, long words will not be
broken, and some lines may be longer than width (default: True).
break_on_hyphens [bool, optional] If True, wrapping will occur preferably on whitespace
and right after hyphens in compound words, as it is customary in English. If false, only
whitespaces will be considered as potentially good places for line breaks, but you need
to set break_long_words to false if you want truly insecable words (default: True).
Returns
Series or Index
Notes
Internally, this method uses a textwrap.TextWrapper instance with default settings. To achieve behavior
matching R’s stringr library str_wrap function, use the arguments:
• expand_tabs = False
• replace_whitespace = True
• drop_whitespace = True
• break_long_words = False
• break_on_hyphens = False
Examples
pandas.Series.str.zfill
Series.str.zfill(width)
Pad strings in the Series/Index by prepending ‘0’ characters.
Strings in the Series/Index are padded with ‘0’ characters on the left of the string to reach a total string length
width. Strings in the Series/Index with length greater or equal to width are unchanged.
Parameters
width [int] Minimum length of resulting string; strings with length less than width be
prepended with ‘0’ characters.
Returns
Series/Index of objects.
See also:
Series.str.rjust Fills the left side of strings with an arbitrary character.
Series.str.ljust Fills the right side of strings with an arbitrary character.
Series.str.pad Fills the specified sides of strings with an arbitrary character.
Series.str.center Fills both sides of strings with an arbitrary character.
Notes
Differs from str.zfill() which has special handling for ‘+’/’-‘ in the string.
Examples
Note that 10 and NaN are not strings, therefore they are converted to NaN. The minus sign in '-1' is treated
as a regular character and the zero is added to the left of it (str.zfill() would have moved it to the left).
1000 remains unchanged as it is longer than width.
>>> s.str.zfill(3)
0 0-1
1 001
2 1000
3 NaN
(continues on next page)
pandas.Series.str.isalnum
Series.str.isalnum()
Check whether all characters in each string are alphanumeric.
This is equivalent to running the Python string method str.isalnum() for each element of the Series/Index.
If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as the origi-
nal Series/Index.
See also:
Series.str.isalpha Check whether all characters are alphabetic.
Series.str.isnumeric Check whether all characters are numeric.
Series.str.isalnum Check whether all characters are alphanumeric.
Series.str.isdigit Check whether all characters are digits.
Series.str.isdecimal Check whether all characters are decimal.
Series.str.isspace Check whether all characters are whitespace.
Series.str.islower Check whether all characters are lowercase.
Series.str.isupper Check whether all characters are uppercase.
Series.str.istitle Check whether all characters are titlecase.
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate to false
for an alphanumeric check.
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first letter of
each word is capitalized). Words are assumed to be as any sequence of non-numeric characters separated by
whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isalpha
Series.str.isalpha()
Check whether all characters in each string are alphabetic.
This is equivalent to running the Python string method str.isalpha() for each element of the Series/Index.
If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as the origi-
nal Series/Index.
See also:
Series.str.isalpha Check whether all characters are alphabetic.
Series.str.isnumeric Check whether all characters are numeric.
Series.str.isalnum Check whether all characters are alphanumeric.
Series.str.isdigit Check whether all characters are digits.
Series.str.isdecimal Check whether all characters are decimal.
Series.str.isspace Check whether all characters are whitespace.
Series.str.islower Check whether all characters are lowercase.
Series.str.isupper Check whether all characters are uppercase.
Series.str.istitle Check whether all characters are titlecase.
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate to false
for an alphanumeric check.
>>> s2 = pd.Series(['A B', '1.5', '3,000'])
>>> s2.str.isalnum()
0 False
1 False
2 False
dtype: bool
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
(continues on next page)
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first letter of
each word is capitalized). Words are assumed to be as any sequence of non-numeric characters separated by
whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isdigit
Series.str.isdigit()
Check whether all characters in each string are digits.
This is equivalent to running the Python string method str.isdigit() for each element of the Series/Index.
If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as the origi-
nal Series/Index.
See also:
Series.str.isalpha Check whether all characters are alphabetic.
Series.str.isnumeric Check whether all characters are numeric.
Series.str.isalnum Check whether all characters are alphanumeric.
Series.str.isdigit Check whether all characters are digits.
Series.str.isdecimal Check whether all characters are decimal.
Series.str.isspace Check whether all characters are whitespace.
Series.str.islower Check whether all characters are lowercase.
Series.str.isupper Check whether all characters are uppercase.
Series.str.istitle Check whether all characters are titlecase.
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate to false
for an alphanumeric check.
>>> s2 = pd.Series(['A B', '1.5', '3,000'])
>>> s2.str.isalnum()
(continues on next page)
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
(continues on next page)
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first letter of
each word is capitalized). Words are assumed to be as any sequence of non-numeric characters separated by
whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isspace
Series.str.isspace()
Check whether all characters in each string are alphanumeric.
This is equivalent to running the Python string method str.isalnum() for each element of the Series/Index.
If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as the origi-
nal Series/Index.
See also:
Series.str.isalpha Check whether all characters are alphabetic.
Series.str.isnumeric Check whether all characters are numeric.
Series.str.isalnum Check whether all characters are alphanumeric.
Series.str.isdigit Check whether all characters are digits.
Series.str.isdecimal Check whether all characters are decimal.
Series.str.isspace Check whether all characters are whitespace.
Series.str.islower Check whether all characters are lowercase.
Series.str.isupper Check whether all characters are uppercase.
Series.str.istitle Check whether all characters are titlecase.
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate to false
for an alphanumeric check.
>>> s2 = pd.Series(['A B', '1.5', '3,000'])
>>> s2.str.isalnum()
0 False
1 False
2 False
dtype: bool
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
(continues on next page)
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first letter of
each word is capitalized). Words are assumed to be as any sequence of non-numeric characters separated by
whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.islower
Series.str.islower()
Check whether all characters in each string are lowercase.
This is equivalent to running the Python string method str.islower() for each element of the Series/Index.
If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as the origi-
nal Series/Index.
See also:
Series.str.isalpha Check whether all characters are alphabetic.
Series.str.isnumeric Check whether all characters are numeric.
Series.str.isalnum Check whether all characters are alphanumeric.
Series.str.isdigit Check whether all characters are digits.
Series.str.isdecimal Check whether all characters are decimal.
Series.str.isspace Check whether all characters are whitespace.
Series.str.islower Check whether all characters are lowercase.
Series.str.isupper Check whether all characters are uppercase.
Series.str.istitle Check whether all characters are titlecase.
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate to false
for an alphanumeric check.
>>> s2 = pd.Series(['A B', '1.5', '3,000'])
>>> s2.str.isalnum()
(continues on next page)
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
(continues on next page)
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first letter of
each word is capitalized). Words are assumed to be as any sequence of non-numeric characters separated by
whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isupper
Series.str.isupper()
Check whether all characters in each string are uppercase.
This is equivalent to running the Python string method str.isupper() for each element of the Series/Index.
If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as the origi-
nal Series/Index.
See also:
Series.str.isalpha Check whether all characters are alphabetic.
Series.str.isnumeric Check whether all characters are numeric.
Series.str.isalnum Check whether all characters are alphanumeric.
Series.str.isdigit Check whether all characters are digits.
Series.str.isdecimal Check whether all characters are decimal.
Series.str.isspace Check whether all characters are whitespace.
Series.str.islower Check whether all characters are lowercase.
Series.str.isupper Check whether all characters are uppercase.
Series.str.istitle Check whether all characters are titlecase.
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate to false
for an alphanumeric check.
>>> s2 = pd.Series(['A B', '1.5', '3,000'])
>>> s2.str.isalnum()
0 False
1 False
2 False
dtype: bool
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
(continues on next page)
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first letter of
each word is capitalized). Words are assumed to be as any sequence of non-numeric characters separated by
whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.istitle
Series.str.istitle()
Check whether all characters in each string are titlecase.
This is equivalent to running the Python string method str.istitle() for each element of the Series/Index.
If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as the origi-
nal Series/Index.
See also:
Series.str.isalpha Check whether all characters are alphabetic.
Series.str.isnumeric Check whether all characters are numeric.
Series.str.isalnum Check whether all characters are alphanumeric.
Series.str.isdigit Check whether all characters are digits.
Series.str.isdecimal Check whether all characters are decimal.
Series.str.isspace Check whether all characters are whitespace.
Series.str.islower Check whether all characters are lowercase.
Series.str.isupper Check whether all characters are uppercase.
Series.str.istitle Check whether all characters are titlecase.
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate to false
for an alphanumeric check.
>>> s2 = pd.Series(['A B', '1.5', '3,000'])
>>> s2.str.isalnum()
(continues on next page)
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
(continues on next page)
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first letter of
each word is capitalized). Words are assumed to be as any sequence of non-numeric characters separated by
whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isnumeric
Series.str.isnumeric()
Check whether all characters in each string are numeric.
This is equivalent to running the Python string method str.isnumeric() for each element of the Se-
ries/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as the origi-
nal Series/Index.
See also:
Series.str.isalpha Check whether all characters are alphabetic.
Series.str.isnumeric Check whether all characters are numeric.
Series.str.isalnum Check whether all characters are alphanumeric.
Series.str.isdigit Check whether all characters are digits.
Series.str.isdecimal Check whether all characters are decimal.
Series.str.isspace Check whether all characters are whitespace.
Series.str.islower Check whether all characters are lowercase.
Series.str.isupper Check whether all characters are uppercase.
Series.str.istitle Check whether all characters are titlecase.
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate to false
for an alphanumeric check.
>>> s2 = pd.Series(['A B', '1.5', '3,000'])
>>> s2.str.isalnum()
0 False
1 False
2 False
dtype: bool
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
(continues on next page)
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
2 False
3 False
dtype: bool
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first letter of
each word is capitalized). Words are assumed to be as any sequence of non-numeric characters separated by
whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.isdecimal
Series.str.isdecimal()
Check whether all characters in each string are decimal.
This is equivalent to running the Python string method str.isdecimal() for each element of the Se-
ries/Index. If a string has zero characters, False is returned for that check.
Returns
Series or Index of bool Series or Index of boolean values with the same length as the origi-
nal Series/Index.
See also:
Series.str.isalpha Check whether all characters are alphabetic.
Series.str.isnumeric Check whether all characters are numeric.
Series.str.isalnum Check whether all characters are alphanumeric.
Series.str.isdigit Check whether all characters are digits.
Series.str.isdecimal Check whether all characters are decimal.
Series.str.isspace Check whether all characters are whitespace.
Series.str.islower Check whether all characters are lowercase.
Series.str.isupper Check whether all characters are uppercase.
Series.str.istitle Check whether all characters are titlecase.
Examples
>>> s1.str.isalpha()
0 True
1 False
2 False
3 False
dtype: bool
>>> s1.str.isnumeric()
0 False
1 False
2 True
3 False
dtype: bool
>>> s1.str.isalnum()
0 True
1 True
2 True
3 False
dtype: bool
Note that checks against characters mixed with any additional punctuation or whitespace will evaluate to false
for an alphanumeric check.
>>> s2 = pd.Series(['A B', '1.5', '3,000'])
>>> s2.str.isalnum()
(continues on next page)
The s3.str.isdecimal method checks for characters used to form numbers in base 10.
>>> s3.str.isdecimal()
0 True
1 False
2 False
3 False
dtype: bool
The s.str.isdigit method is the same as s3.str.isdecimal but also includes special digits, like
superscripted and subscripted digits in unicode.
>>> s3.str.isdigit()
0 True
1 True
2 False
3 False
dtype: bool
The s.str.isnumeric method is the same as s3.str.isdigit but also includes other characters that
can represent quantities such as unicode fractions.
>>> s3.str.isnumeric()
0 True
1 True
2 True
3 False
dtype: bool
>>> s5.str.islower()
0 True
1 False
(continues on next page)
>>> s5.str.isupper()
0 False
1 False
2 True
3 False
dtype: bool
The s5.str.istitle method checks for whether all words are in title case (whether only the first letter of
each word is capitalized). Words are assumed to be as any sequence of non-numeric characters separated by
whitespace characters.
>>> s5.str.istitle()
0 False
1 True
2 False
3 False
dtype: bool
pandas.Series.str.get_dummies
Series.str.get_dummies(sep='|')
Return DataFrame of dummy/indicator variables for Series.
Each string in Series is split by sep and returned as a DataFrame of dummy/indicator variables.
Parameters
sep [str, default “|”] String to split on.
Returns
DataFrame Dummy variables corresponding to values of the Series.
See also:
get_dummies Convert categorical variable into dummy/indicator variables.
Examples
Categorical accessor
Categorical-dtype specific methods and attributes are available under the Series.cat accessor.
pandas.Series.cat.categories
Series.cat.categories
The categories of this categorical.
Setting assigns new values to each category (effectively a rename of each individual category).
The assigned value has to be a list-like object. All items must be unique and the number of items in the new
categories must be the same as the number of items in the old categories.
Assigning to categories is a inplace operation!
Raises
ValueError If the new categories do not validate as categories or if the number of new
categories is unequal the number of old categories
See also:
rename_categories Rename categories.
reorder_categories Reorder categories.
add_categories Add new categories.
remove_categories Remove the specified categories.
remove_unused_categories Remove categories which are not used.
set_categories Set the categories to the specified ones.
pandas.Series.cat.ordered
Series.cat.ordered
Whether the categories have an ordered relationship.
pandas.Series.cat.codes
Series.cat.codes
Return Series of codes as well as the index.
pandas.Series.cat.rename_categories
Series.cat.rename_categories(*args, **kwargs)
Rename categories.
Parameters
new_categories [list-like, dict-like or callable] New categories which will replace old cate-
gories.
• list-like: all items must be unique and the number of items in the new categories
must match the existing number of categories.
• dict-like: specifies a mapping from old categories to new. Categories not contained
in the mapping are passed through and extra categories in the mapping are ignored.
• callable : a callable that is called on all items in the old categories and whose return
values comprise the new categories.
inplace [bool, default False] Whether or not to rename the categories inplace or return a
copy of this categorical with renamed categories.
Returns
cat [Categorical or None] Categorical with removed categories or None if inplace=True.
Raises
ValueError If new categories are list-like and do not have the same number of items than
the current categories or do not validate as categories
See also:
reorder_categories Reorder categories.
add_categories Add new categories.
remove_categories Remove the specified categories.
remove_unused_categories Remove categories which are not used.
set_categories Set the categories to the specified ones.
Examples
For dict-like new_categories, extra keys are ignored and categories not in the dictionary are passed through
pandas.Series.cat.reorder_categories
Series.cat.reorder_categories(*args, **kwargs)
Reorder categories as specified in new_categories.
new_categories need to include all old categories and no new category items.
Parameters
new_categories [Index-like] The categories in new order.
ordered [bool, optional] Whether or not the categorical is treated as a ordered categorical.
If not given, do not change the ordered information.
inplace [bool, default False] Whether or not to reorder the categories inplace or return a copy
of this categorical with reordered categories.
Returns
cat [Categorical or None] Categorical with removed categories or None if inplace=True.
Raises
ValueError If the new categories do not contain all old category items or any new ones
See also:
rename_categories Rename categories.
add_categories Add new categories.
remove_categories Remove the specified categories.
remove_unused_categories Remove categories which are not used.
set_categories Set the categories to the specified ones.
pandas.Series.cat.add_categories
Series.cat.add_categories(*args, **kwargs)
Add new categories.
new_categories will be included at the last/highest place in the categories and will be unused directly after this
call.
Parameters
new_categories [category or list-like of category] The new categories to be included.
inplace [bool, default False] Whether or not to add the categories inplace or return a copy of
this categorical with added categories.
Returns
cat [Categorical or None] Categorical with new categories added or None if
inplace=True.
Raises
ValueError If the new categories include old categories or do not validate as categories
See also:
rename_categories Rename categories.
pandas.Series.cat.remove_categories
Series.cat.remove_categories(*args, **kwargs)
Remove the specified categories.
removals must be included in the old categories. Values which were in the removed categories will be set to
NaN
Parameters
removals [category or list of categories] The categories which should be removed.
inplace [bool, default False] Whether or not to remove the categories inplace or return a
copy of this categorical with removed categories.
Returns
cat [Categorical or None] Categorical with removed categories or None if inplace=True.
Raises
ValueError If the removals are not contained in the categories
See also:
rename_categories Rename categories.
reorder_categories Reorder categories.
add_categories Add new categories.
remove_unused_categories Remove categories which are not used.
set_categories Set the categories to the specified ones.
pandas.Series.cat.remove_unused_categories
Series.cat.remove_unused_categories(*args, **kwargs)
Remove categories which are not used.
Parameters
inplace [bool, default False] Whether or not to drop unused categories inplace or return a
copy of this categorical with unused categories dropped.
Deprecated since version 1.2.0.
Returns
cat [Categorical or None] Categorical with unused categories dropped or None if
inplace=True.
See also:
rename_categories Rename categories.
reorder_categories Reorder categories.
add_categories Add new categories.
remove_categories Remove the specified categories.
set_categories Set the categories to the specified ones.
pandas.Series.cat.set_categories
Series.cat.set_categories(*args, **kwargs)
Set the categories to the specified new_categories.
new_categories can include new categories (which will result in unused categories) or remove old categories
(which results in values set to NaN). If rename==True, the categories will simple be renamed (less or more
items than in old categories will result in values set to NaN or in unused categories respectively).
This method can be used to perform more than one action of adding, removing, and reordering simultaneously
and is therefore faster than performing the individual steps via the more specialised methods.
On the other hand this methods does not do checks (e.g., whether the old categories are included in the new
categories on a reorder), which can result in surprising changes, for example when using special string dtypes,
which does not considers a S1 string equal to a single char python string.
Parameters
new_categories [Index-like] The categories in new order.
ordered [bool, default False] Whether or not the categorical is treated as a ordered categori-
cal. If not given, do not change the ordered information.
rename [bool, default False] Whether or not the new_categories should be considered as a
rename of the old categories or as reordered categories.
inplace [bool, default False] Whether or not to reorder the categories in-place or return a
copy of this categorical with reordered categories.
Returns
Categorical with reordered categories or None if inplace.
Raises
ValueError If new_categories does not validate as categories
See also:
rename_categories Rename categories.
reorder_categories Reorder categories.
add_categories Add new categories.
remove_categories Remove the specified categories.
remove_unused_categories Remove categories which are not used.
pandas.Series.cat.as_ordered
Series.cat.as_ordered(*args, **kwargs)
Set the Categorical to be ordered.
Parameters
inplace [bool, default False] Whether or not to set the ordered attribute in-place or return a
copy of this categorical with ordered set to True.
Returns
Categorical or None Ordered Categorical or None if inplace=True.
pandas.Series.cat.as_unordered
Series.cat.as_unordered(*args, **kwargs)
Set the Categorical to be unordered.
Parameters
inplace [bool, default False] Whether or not to set the ordered attribute in-place or return a
copy of this categorical with ordered set to False.
Returns
Categorical or None Unordered Categorical or None if inplace=True.
Sparse accessor
Sparse-dtype specific methods and attributes are provided under the Series.sparse accessor.
pandas.Series.sparse.npoints
Series.sparse.npoints
The number of non- fill_value points.
Examples
pandas.Series.sparse.density
Series.sparse.density
The percent of non- fill_value points, as decimal.
Examples
pandas.Series.sparse.fill_value
Series.sparse.fill_value
Elements in data that are fill_value are not stored.
For memory savings, this should be the most common value in the array.
pandas.Series.sparse.sp_values
Series.sparse.sp_values
An ndarray containing the non- fill_value values.
Examples
pandas.Series.sparse.from_coo
Examples
>>> A = sparse.coo_matrix(
... ([3.0, 1.0, 2.0], ([1, 0, 0], [0, 2, 3])), shape=(3, 4)
... )
>>> A
<3x4 sparse matrix of type '<class 'numpy.float64'>'
with 3 stored elements in COOrdinate format>
>>> A.todense()
matrix([[0., 0., 1., 2.],
[3., 0., 0., 0.],
[0., 0., 0., 0.]])
>>> ss = pd.Series.sparse.from_coo(A)
>>> ss
0 2 1.0
3 2.0
1 0 3.0
dtype: Sparse[float64, nan]
pandas.Series.sparse.to_coo
Examples
>>> ss = s.astype("Sparse")
>>> ss
A B C D
1 2 a 0 3.0
1 NaN
1 b 0 1.0
1 3.0
2 1 b 0 NaN
1 NaN
dtype: Sparse[float64, nan]
>>> rows
[(1, 1), (1, 2), (2, 1)]
>>> columns
[('a', 0), ('a', 1), ('b', 0), ('b', 1)]
Flags
Flags refer to attributes of the pandas object. Properties of the dataset (like the date is was recorded, the URL it was
accessed from, etc.) should be stored in Series.attrs.
pandas.Flags
Notes
>>> df = pd.DataFrame()
>>> df.flags
<Flags(allows_duplicate_labels=True)>
>>> df.flags.allows_duplicate_labels = False
>>> df.flags
<Flags(allows_duplicate_labels=False)>
Attributes
pandas.Flags.allows_duplicate_labels
property Flags.allows_duplicate_labels
Whether this object allows duplicate labels.
Setting allows_duplicate_labels=False ensures that the index (and columns of a DataFrame)
are unique. Most methods that accept and return a Series or DataFrame will propagate the value of
allows_duplicate_labels.
See Duplicate Labels for more.
See also:
Examples
Metadata
3.3.14 Plotting
Series.plot is both a callable method and a namespace attribute for specific plotting methods of the form
Series.plot.<kind>.
pandas.Series.plot.area
Examples
>>> df = pd.DataFrame({
... 'sales': [3, 2, 3, 9, 10, 6],
... 'signups': [5, 5, 6, 12, 14, 13],
... 'visits': [20, 42, 28, 62, 81, 50],
... }, index=pd.date_range(start='2018/01/01', end='2018/07/01',
... freq='M'))
>>> ax = df.plot.area()
Area plots are stacked by default. To produce an unstacked plot, pass stacked=False:
>>> ax = df.plot.area(stacked=False)
>>> ax = df.plot.area(y='sales')
>>> df = pd.DataFrame({
... 'sales': [3, 2, 3],
... 'visits': [20, 42, 28],
... 'day': [1, 2, 3],
... })
>>> ax = df.plot.area(x='day')
pandas.Series.plot.bar
Examples
Basic plot.
Plot a whole dataframe to a bar plot. Each column is assigned a distinct color, and each row is nested in a group
along the horizontal axis.
>>> ax = df.plot.bar(stacked=True)
Instead of nesting, the figure can be split by column with subplots=True. In this case, a numpy.ndarray
of matplotlib.axes.Axes are returned.
If you don’t like the default colours, you can specify how you’d like each column to be colored.
pandas.Series.plot.barh
• A single color string referred to by name, RGB or RGBA code, for instance
‘red’ or ‘#a98d19’.
• A sequence of color strings referred to by name, RGB or RGBA code, which
will be used for each column recursively. For instance [‘green’,’yellow’] each
column’s bar will be filled in green or yellow, alternatively.
• A dict of the form {column name [color}, so that each column will be] colored
accordingly. For example, if your columns are called a and b, then passing
{‘a’: ‘green’, ‘b’: ‘red’} will color bars for column a in green and bars for
column b in red.
New in version 1.1.0.
**kwargs Additional keyword arguments are documented in DataFrame.plot().
Returns
matplotlib.axes.Axes or np.ndarray of them An ndarray is returned with one
matplotlib.axes.Axes per column when subplots=True.
See also:
DataFrame.plot.bar Vertical bar plot.
DataFrame.plot Make plots of DataFrame using matplotlib.
matplotlib.axes.Axes.bar Plot a vertical bar plot using matplotlib.
Examples
Basic example
>>> ax = df.plot.barh(stacked=True)
pandas.Series.plot.box
Series.plot.box(by=None, **kwargs)
Make a box plot of the DataFrame columns.
A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box
extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from
the edges of box to show the range of the data. The position of the whiskers is set by default to 1.5*IQR (IQR
= Q3 - Q1) from the edges of the box. Outlier points are those past the end of the whiskers.
For further details see Wikipedia’s entry for boxplot.
A consideration when using this chart is that the box and the whiskers can overlap, which is very common when
plotting small sets of data.
Parameters
by [str or sequence] Column in the DataFrame to group by.
**kwargs Additional keywords are documented in DataFrame.plot().
Returns
matplotlib.axes.Axes or numpy.ndarray of them
See also:
DataFrame.boxplot Another method to draw a box plot.
Series.plot.box Draw a box plot from a Series object.
matplotlib.pyplot.boxplot Draw a box plot in matplotlib.
Examples
Draw a box plot from a DataFrame with four columns of randomly generated data.
pandas.Series.plot.density
Returns
matplotlib.axes.Axes or numpy.ndarray of them
See also:
scipy.stats.gaussian_kde Representation of a kernel-density estimate using Gaussian kernels. This
is the function used internally to estimate the PDF.
Examples
Given a Series of points randomly sampled from an unknown distribution, estimate its PDF using KDE with
automatic bandwidth determination and plot the results, evaluating them at 1000 equally spaced points (default):
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large
bandwidth value may result in under-fitting:
>>> ax = s.plot.kde(bw_method=0.3)
>>> ax = s.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
>>> df = pd.DataFrame({
... 'x': [1, 2, 2.5, 3, 3.5, 4, 5],
... 'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> ax = df.plot.kde()
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large
bandwidth value may result in under-fitting:
>>> ax = df.plot.kde(bw_method=0.3)
>>> ax = df.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
pandas.Series.plot.hist
Examples
When we draw a dice 6000 times, we expect to get each value around 1000 times. But when we draw two dices
and sum the result, the distribution is going to be quite different. A histogram illustrates those distributions.
>>> df = pd.DataFrame(
... np.random.randint(1, 7, 6000),
... columns = ['one'])
>>> df['two'] = df['one'] + np.random.randint(1, 7, 6000)
>>> ax = df.plot.hist(bins=12, alpha=0.5)
pandas.Series.plot.kde
Examples
Given a Series of points randomly sampled from an unknown distribution, estimate its PDF using KDE with
automatic bandwidth determination and plot the results, evaluating them at 1000 equally spaced points (default):
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large
bandwidth value may result in under-fitting:
>>> ax = s.plot.kde(bw_method=0.3)
>>> ax = s.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
>>> df = pd.DataFrame({
... 'x': [1, 2, 2.5, 3, 3.5, 4, 5],
... 'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> ax = df.plot.kde()
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large
bandwidth value may result in under-fitting:
>>> ax = df.plot.kde(bw_method=0.3)
>>> ax = df.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
pandas.Series.plot.line
Examples
The following example shows the populations for some animals over the years.
>>> df = pd.DataFrame({
... 'pig': [20, 18, 489, 675, 1776],
... 'horse': [4, 25, 281, 600, 1900]
... }, index=[1990, 1997, 2003, 2009, 2014])
>>> lines = df.plot.line()
Let’s repeat the same example, but specifying colors for each column (in this case, for each animal).
pandas.Series.plot.pie
Series.plot.pie(**kwargs)
Generate a pie plot.
A pie plot is a proportional representation of the numerical data in a column. This function wraps
matplotlib.pyplot.pie() for the specified column. If no column reference is passed and
subplots=True a pie plot is drawn for each numerical column independently.
Parameters
y [int or label, optional] Label or position of the column to plot. If not provided,
subplots=True argument must be passed.
**kwargs Keyword arguments to pass on to DataFrame.plot().
Returns
matplotlib.axes.Axes or np.ndarray of them A NumPy array is returned when subplots is
True.
See also:
Series.plot.pie Generate a pie plot for a Series.
DataFrame.plot Make plots of a DataFrame.
Examples
In the example below we have a DataFrame with the information about planet’s mass and radius. We pass the
‘mass’ column to the pie function to get a pie plot.
Series.hist([by, ax, grid, xlabelsize, . . . ]) Draw histogram of the input series using matplotlib.
3.4 DataFrame
3.4.1 Constructor
pandas.DataFrame
pandas: powerful Python data analysis toolkit, Release 1.2.0
dtype [dtype, default None] Data type to force. Only a single dtype is allowed. If None,
infer.
copy [bool, default False] Copy data from inputs. Only affects DataFrame / 2d ndarray input.
See also:
DataFrame.from_records Constructor from tuples, also record arrays.
DataFrame.from_dict From dicts of Series, arrays, or dicts.
read_csv Read a comma-separated values (csv) file into DataFrame.
read_table Read general delimited file into DataFrame.
read_clipboard Read text from clipboard into DataFrame.
Examples
>>> df.dtypes
col1 int64
col2 int64
dtype: object
Attributes
pandas.DataFrame.at
property DataFrame.at
Access a single value for a row/column label pair.
Similar to loc, in that both provide label-based lookups. Use at if you only need to get or set a single
value in a DataFrame or Series.
Raises
KeyError If ‘label’ does not exist in DataFrame.
See also:
Examples
>>> df.loc[5].at['B']
4
pandas.DataFrame.attrs
property DataFrame.attrs
Dictionary of global attributes of this dataset.
See also:
pandas.DataFrame.axes
property DataFrame.axes
Return a list representing the axes of the DataFrame.
It has the row axis labels and column axis labels as the only members. They are returned in that order.
Examples
pandas.DataFrame.columns
DataFrame.columns: pandas.core.indexes.base.Index
The column labels of the DataFrame.
pandas.DataFrame.dtypes
property DataFrame.dtypes
Return the dtypes in the DataFrame.
This returns a Series with the data type of each column. The result’s index is the original DataFrame’s
columns. Columns with mixed types are stored with the object dtype. See the User Guide for more.
Returns
pandas.Series The data type of each column.
Examples
pandas.DataFrame.empty
property DataFrame.empty
Indicator whether DataFrame is empty.
True if DataFrame is entirely empty (no items), meaning any of the axes are of length 0.
Returns
bool If DataFrame is empty, return True, if not return False.
See also:
Notes
If DataFrame contains only NaNs, it is still not considered empty. See the example below.
Examples
If we only have NaNs in our DataFrame, it is not considered empty! We will need to drop the NaNs to
make the DataFrame empty:
pandas.DataFrame.flags
property DataFrame.flags
Get the properties associated with this pandas object.
The available flags are
• Flags.allows_duplicate_labels
See also:
Notes
“Flags” differ from “metadata”. Flags reflect properties of the pandas object (the Series or DataFrame).
Metadata refer to properties of the dataset, and should be stored in DataFrame.attrs.
Examples
>>> df.flags.allows_duplicate_labels
True
>>> df.flags.allows_duplicate_labels = False
>>> df.flags["allows_duplicate_labels"]
False
>>> df.flags["allows_duplicate_labels"] = True
pandas.DataFrame.iat
property DataFrame.iat
Access a single value for a row/column pair by integer position.
Similar to iloc, in that both provide integer-based lookups. Use iat if you only need to get or set a
single value in a DataFrame or Series.
Raises
IndexError When integer position is out of bounds.
See also:
Examples
>>> df.iat[1, 2]
1
>>> df.iat[1, 2] = 10
>>> df.iat[1, 2]
10
>>> df.loc[0].iat[1]
2
pandas.DataFrame.iloc
property DataFrame.iloc
Purely integer-location based indexing for selection by position.
.iloc[] is primarily integer position based (from 0 to length-1 of the axis), but may also be used
with a boolean array.
Allowed inputs are:
• An integer, e.g. 5.
• A list or array of integers, e.g. [4, 3, 0].
• A slice object with ints, e.g. 1:7.
• A boolean array.
• A callable function with one argument (the calling Series or DataFrame) and that returns valid
output for indexing (one of the above). This is useful in method chains, when you don’t have a
reference to the calling object, but would like to base your selection on some value.
.iloc will raise IndexError if a requested indexer is out-of-bounds, except slice indexers which
allow out-of-bounds indexing (this conforms with python/numpy slice semantics).
See more at Selection by Position.
See also:
Examples
>>> type(df.iloc[0])
<class 'pandas.core.series.Series'>
>>> df.iloc[0]
a 1
b 2
c 3
d 4
Name: 0, dtype: int64
>>> df.iloc[[0]]
a b c d
0 1 2 3 4
>>> type(df.iloc[[0]])
<class 'pandas.core.frame.DataFrame'>
>>> df.iloc[:3]
a b c d
0 1 2 3 4
1 100 200 300 400
2 1000 2000 3000 4000
With a callable, useful in method chains. The x passed to the lambda is the DataFrame being sliced.
This selects the rows whose index label even.
>>> df.iloc[0, 1]
2
pandas.DataFrame.index
DataFrame.index: pandas.core.indexes.base.Index
The index (row labels) of the DataFrame.
pandas.DataFrame.loc
property DataFrame.loc
Access a group of rows and columns by label(s) or a boolean array.
.loc[] is primarily label based, but may also be used with a boolean array.
Allowed inputs are:
• A single label, e.g. 5 or 'a', (note that 5 is interpreted as a label of the index, and never as an
integer position along the index).
• A list or array of labels, e.g. ['a', 'b', 'c'].
• A slice object with labels, e.g. 'a':'f'.
Warning: Note that contrary to usual python slices, both the start and the stop are included
• A boolean array of the same length as the axis being sliced, e.g. [True, False, True].
• An alignable boolean Series. The index of the key will be aligned before masking.
• An alignable Index. The Index of the returned selection will be the input.
• A callable function with one argument (the calling Series or DataFrame) and that returns valid
output for indexing (one of the above)
See more at Selection by Label.
Raises
KeyError If any items are not found.
IndexingError If an indexed key is passed and its index is unalignable to the frame
index.
See also:
Examples
Getting values
>>> df.loc['viper']
max_speed 4
shield 5
Name: viper, dtype: int64
Slice with labels for row and single label for column. As mentioned above, note that both the start and
stop of the slice are included.
Setting values
Set value for all items matching the list of labels
>>> df.loc['cobra'] = 10
>>> df
max_speed shield
cobra 10 10
viper 4 50
sidewinder 7 50
Slice with integer labels for rows. As mentioned above, note that both the start and stop of the slice are
included.
>>> df.loc[7:9]
max_speed shield
7 1 2
8 4 5
9 7 8
>>> df.loc['cobra']
max_speed shield
mark i 12 2
mark ii 0 4
Single label for row and column. Similar to passing in a tuple, this returns a Series.
Single tuple for the index with a single label for the column
pandas.DataFrame.ndim
property DataFrame.ndim
Return an int representing the number of axes / array dimensions.
Return 1 if Series. Otherwise return 2 if DataFrame.
See also:
Examples
pandas.DataFrame.shape
property DataFrame.shape
Return a tuple representing the dimensionality of the DataFrame.
See also:
Examples
pandas.DataFrame.size
property DataFrame.size
Return an int representing the number of elements in this object.
Return the number of rows if Series. Otherwise return the number of rows times number of columns if
DataFrame.
See also:
Examples
pandas.DataFrame.style
property DataFrame.style
Returns a Styler object.
Contains methods for building a styled HTML representation of the DataFrame.
See also:
io.formats.style.Styler Helps style a DataFrame or Series according to the data with HTML
and CSS.
pandas.DataFrame.values
property DataFrame.values
Return a Numpy representation of the DataFrame.
Only the values in the DataFrame will be returned, the axes labels will be removed.
Returns
numpy.ndarray The values of the DataFrame.
See also:
Notes
The dtype will be a lower-common-denominator dtype (implicit upcasting); that is to say if the dtypes
(even of numeric types) are mixed, the one that accommodates all will be chosen. Use this with care if
you are not dealing with the blocks.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. If dtypes are int32 and uint8,
dtype will be upcast to int32. By numpy.find_common_type() convention, mixing int64 and uint64
will result in a float64 dtype.
Examples
A DataFrame where all columns are the same type (e.g., int64) results in an array of the same type.
A DataFrame with mixed type columns(e.g., str/object, int64, float32) results in an ndarray of the broadest
type that accommodates these mixed types (e.g., object).
Methods
pandas.DataFrame.abs
DataFrame.abs()
Return a Series/DataFrame with absolute numeric value of each element.
This function only applies to elements that are all numeric.
Returns
abs Series/DataFrame containing the absolute value of each element.
See also:
Notes
√
For complex inputs, 1.2 + 1j, the absolute value is 𝑎2 + 𝑏2 .
Examples
Select rows with data closest to certain value using argsort (from StackOverflow).
>>> df = pd.DataFrame({
... 'a': [4, 5, 6, 7],
... 'b': [10, 20, 30, 40],
... 'c': [100, 50, -30, -50]
... })
>>> df
a b c
0 4 10 100
1 5 20 50
2 6 30 -30
3 7 40 -50
>>> df.loc[(df.c - 43).abs().argsort()]
a b c
1 5 20 50
0 4 10 100
2 6 30 -30
3 7 40 -50
pandas.DataFrame.add
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.add_prefix
DataFrame.add_prefix(prefix)
Prefix labels with string prefix.
For Series, the row labels are prefixed. For DataFrame, the column labels are prefixed.
Parameters
prefix [str] The string to add before each label.
Returns
Series or DataFrame New Series or DataFrame with updated labels.
See also:
Examples
>>> s.add_prefix('item_')
item_0 1
item_1 2
item_2 3
item_3 4
dtype: int64
>>> df.add_prefix('col_')
col_A col_B
0 1 3
1 2 4
(continues on next page)
pandas.DataFrame.add_suffix
DataFrame.add_suffix(suffix)
Suffix labels with string suffix.
For Series, the row labels are suffixed. For DataFrame, the column labels are suffixed.
Parameters
suffix [str] The string to add after each label.
Returns
Series or DataFrame New Series or DataFrame with updated labels.
See also:
Examples
>>> s.add_suffix('_item')
0_item 1
1_item 2
2_item 3
3_item 4
dtype: int64
>>> df.add_suffix('_col')
A_col B_col
0 1 3
1 2 4
2 3 5
3 4 6
pandas.DataFrame.agg
Notes
Examples
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
pandas.DataFrame.aggregate
Notes
Examples
Aggregate different functions over the columns and rename the index of the resulting DataFrame.
pandas.DataFrame.align
pandas.DataFrame.all
• 1 / ‘columns’ : reduce the columns, return a Series whose index is the original
index.
• None : reduce all axes, return a scalar.
bool_only [bool, default None] Include only boolean columns. If None, will attempt
to use everything, then use only boolean data. Not implemented for Series.
skipna [bool, default True] Exclude NA/null values. If the entire row/column is NA
and skipna is True, then the result will be True, as for an empty row/column. If
skipna is False, then NA are treated as True, because these are not equal to zero.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical), count
along a particular level, collapsing into a Series.
**kwargs [any, default None] Additional keywords have no effect but might be ac-
cepted for compatibility with NumPy.
Returns
Series or DataFrame If level is specified, then, DataFrame is returned; otherwise, Se-
ries is returned.
See also:
Examples
Series
DataFrames
Create a dataframe from a dictionary.
>>> df.all()
col1 True
col2 False
dtype: bool
>>> df.all(axis='columns')
0 True
1 False
dtype: bool
>>> df.all(axis=None)
False
pandas.DataFrame.any
Examples
Series
For Series input, the output is a scalar indicating whether any element is True.
>>> pd.Series([False, False]).any()
False
>>> pd.Series([True, False]).any()
True
>>> pd.Series([]).any()
False
>>> pd.Series([np.nan]).any()
False
>>> pd.Series([np.nan]).any(skipna=False)
True
DataFrame
Whether each column contains at least one True element (the default).
>>> df = pd.DataFrame({"A": [1, 2], "B": [0, 2], "C": [0, 0]})
>>> df
A B C
0 1 0 0
1 2 2 0
>>> df.any()
A True
B True
C False
dtype: bool
>>> df.any(axis='columns')
0 True
1 True
dtype: bool
>>> df.any(axis='columns')
0 True
1 False
dtype: bool
>>> df.any(axis=None)
True
>>> pd.DataFrame([]).any()
Series([], dtype: bool)
pandas.DataFrame.append
Notes
If a list of dict/series is passed and the keys are all contained in the DataFrame’s index, the order of the
columns in the resulting DataFrame will be unchanged.
Iteratively appending rows to a DataFrame can be more computationally intensive than a single concate-
nate. A better solution is to append those rows to a list and then concatenate the list with the original
DataFrame all at once.
Examples
The following, while not recommended methods for generating DataFrames, show two ways to generate
a DataFrame from multiple data sources.
Less efficient:
>>> df = pd.DataFrame(columns=['A'])
>>> for i in range(5):
... df = df.append({'A': i}, ignore_index=True)
>>> df
A
0 0
1 1
2 2
3 3
4 4
More efficient:
pandas.DataFrame.apply
Examples
Using a numpy universal function (in this case the same as np.sqrt(df)):
>>> df.apply(np.sqrt)
A B
0 2.0 3.0
1 2.0 3.0
2 2.0 3.0
Returning a Series inside the function is similar to passing result_type='expand'. The resulting
column names will be the Series index.
Passing result_type='broadcast' will ensure the same shape result, whether list-like or scalar is
returned by the function, and broadcast it along the axis. The resulting column names will be the originals.
pandas.DataFrame.applymap
DataFrame.applymap(func, na_action=None)
Apply a function to a Dataframe elementwise.
This method applies a function that accepts and returns a scalar to every element of a DataFrame.
Parameters
func [callable] Python function, returns a single value from a single value.
na_action [{None, ‘ignore’}, default None] If ‘ignore’, propagate NaN values, without
passing them to func.
New in version 1.2.
Returns
DataFrame Transformed DataFrame.
See also:
Examples
Note that a vectorized version of func often exists, which will be much faster. You could square each
number elementwise.
>>> df ** 2
0 1
0 1.000000 4.494400
1 11.262736 20.857489
pandas.DataFrame.asfreq
Notes
To learn more about the frequency strings, please see this link.
Examples
>>> df.asfreq(freq='30S')
s
2000-01-01 00:00:00 0.0
(continues on next page)
pandas.DataFrame.asof
DataFrame.asof(where, subset=None)
Return the last row(s) without any NaNs before where.
The last row (for each element in where, if list) without any NaN is taken. In case of a DataFrame, the
last row without NaN considering only the subset of columns (if not None)
If there is no good value, NaN is returned for a Series or a Series of NaN values for a DataFrame
Parameters
where [date or array-like of dates] Date(s) before which the last row(s) are returned.
subset [str or array-like of str, default None] For DataFrame, if not None, only use
these columns to check for NaNs.
Returns
scalar, Series, or DataFrame The return can be:
• scalar : when self is a Series and where is a scalar
• Series: when self is a Series and where is an array-like, or when self is a
DataFrame and where is a scalar
• DataFrame : when self is a DataFrame and where is an array-like
Notes
Examples
>>> s.asof(20)
2.0
For a sequence where, a Series is returned. The first value is NaN, because the first element of where is
before the first index value.
Missing values are not considered. The following is 2.0, not NaN, even though NaN is at the index
location for 30.
>>> s.asof(30)
2.0
pandas.DataFrame.assign
DataFrame.assign(**kwargs)
Assign new columns to a DataFrame.
Returns a new object with all original columns in addition to new ones. Existing columns that are re-
assigned will be overwritten.
Parameters
**kwargs [dict of {str: callable or Series}] The column names are keywords. If the
values are callable, they are computed on the DataFrame and assigned to the new
columns. The callable must not change input DataFrame (though pandas doesn’t
check it). If the values are not callable, (e.g. a Series, scalar, or array), they are
simply assigned.
Returns
DataFrame A new DataFrame with the new columns in addition to all the existing
columns.
Notes
Assigning multiple columns within the same assign is possible. Later items in ‘**kwargs’ may refer to
newly created or modified columns in ‘df’; items are computed and assigned into ‘df’ in order.
Examples
Alternatively, the same behavior can be achieved by directly referencing an existing Series or sequence:
You can create multiple columns within the same assign where one of the columns depends on another
one defined within the same assign:
pandas.DataFrame.astype
Examples
Create a DataFrame:
>>> df.astype('int32').dtypes
col1 int32
col2 int32
dtype: object
Create a series:
>>> ser.astype('category')
0 1
1 2
dtype: category
Categories (2, int64): [1, 2]
Note that using copy=False and changing data on a new pandas object may propagate changes:
Datetimes are localized to UTC first before converting to the specified timezone:
pandas.DataFrame.at_time
Examples
>>> ts.at_time('12:00')
A
2018-04-09 12:00:00 2
2018-04-10 12:00:00 4
pandas.DataFrame.backfill
pandas.DataFrame.between_time
See also:
Examples
You get the times that are not between two times by setting start_time later than end_time:
pandas.DataFrame.bfill
pandas.DataFrame.bool
DataFrame.bool()
Return the bool of a single element Series or DataFrame.
This must be a boolean scalar value, either True or False. It will raise a ValueError if the Series or
DataFrame does not have exactly 1 element, or that element is not boolean (integer values 0 and 1 will
also raise an exception).
Returns
bool The value in the Series or DataFrame.
See also:
Examples
The method will only work for single element objects with a boolean value:
>>> pd.Series([True]).bool()
True
>>> pd.Series([False]).bool()
False
pandas.DataFrame.boxplot
return_type [{‘axes’, ‘dict’, ‘both’} or None, default ‘axes’] The kind of object to
return. The default is axes.
• ‘axes’ returns the matplotlib axes the boxplot is drawn on.
• ‘dict’ returns a dictionary whose values are the matplotlib Lines of the boxplot.
• ‘both’ returns a namedtuple with the axes and dict.
• when grouping with by, a Series mapping columns to return_type is re-
turned.
If return_type is None, a NumPy array of axes with the same shape as
layout is returned.
backend [str, default None] Backend to use instead of the backend specified in the
option plotting.backend. For instance, ‘matplotlib’. Alternatively, to
specify the plotting.backend for the whole session, set pd.options.
plotting.backend.
New in version 1.0.0.
**kwargs All other plotting keyword arguments to be passed to matplotlib.
pyplot.boxplot().
Returns
result See Notes.
See also:
Notes
Examples
Boxplots can be created for every column in the dataframe by df.boxplot() or indicating the columns
to be used:
>>> np.random.seed(1234)
>>> df = pd.DataFrame(np.random.randn(10, 4),
... columns=['Col1', 'Col2', 'Col3', 'Col4'])
>>> boxplot = df.boxplot(column=['Col1', 'Col2', 'Col3'])
Boxplots of variables distributions grouped by the values of a third variable can be created using the option
by. For instance:
A list of strings (i.e. ['X', 'Y']) can be passed to boxplot in order to group the data by combination
of the variables in the x-axis:
Additional formatting can be done to the boxplot, like suppressing the grid (grid=False), rotating the
labels in the x-axis (i.e. rot=45) or changing the fontsize (i.e. fontsize=15):
The parameter return_type can be used to select the type of element returned by boxplot. When
return_type='axes' is selected, the matplotlib axes on which the boxplot is drawn are returned:
If return_type is None, a NumPy array of axes with the same shape as layout is returned:
pandas.DataFrame.clip
Examples
>>> data = {'col_0': [9, -3, 0, -1, 5], 'col_1': [-2, -7, 6, 8, -5]}
>>> df = pd.DataFrame(data)
>>> df
col_0 col_1
0 9 -2
1 -3 -7
2 0 6
3 -1 8
4 5 -5
Clips using specific lower and upper thresholds per column element:
>>> t = pd.Series([2, -4, -1, 6, 3])
>>> t
0 2
1 -4
2 -1
3 6
4 3
dtype: int64
pandas.DataFrame.combine
overwrite [bool, default True] If True, columns in self that do not exist in other will
be overwritten with NaNs.
Returns
DataFrame Combination of the provided DataFrames.
See also:
Examples
Using fill_value fills Nones prior to passing the column to the merge function.
However, if the same element in both dataframes is None, that None is preserved
Example that demonstrates the use of overwrite and behavior when the axis differ between the dataframes.
>>> df2 = pd.DataFrame({'B': [3, 3], 'C': [1, 1], }, index=[1, 2])
>>> df2.combine(df1, take_smaller)
A B C
0 0.0 NaN NaN
1 0.0 3.0 NaN
2 NaN 3.0 NaN
pandas.DataFrame.combine_first
DataFrame.combine_first(other)
Update null elements with value in the same location in other.
Combine two DataFrame objects by filling null values in one DataFrame with non-null values from other
DataFrame. The row and column indexes of the resulting DataFrame will be the union of the two.
Parameters
other [DataFrame] Provided DataFrame to use to fill null values.
Returns
DataFrame
See also:
Examples
Null values still persist if the location of that null value does not exist in other
pandas.DataFrame.compare
Notes
Examples
>>> df = pd.DataFrame(
... {
... "col1": ["a", "a", "b", "b", "a"],
... "col2": [1.0, 2.0, 3.0, np.nan, 5.0],
... "col3": [1.0, 2.0, 3.0, 4.0, 5.0]
... },
... columns=["col1", "col2", "col3"],
... )
>>> df
col1 col2 col3
0 a 1.0 1.0
1 a 2.0 2.0
2 b 3.0 3.0
3 b NaN 4.0
4 a 5.0 5.0
>>> df.compare(df2)
col1 col3
self other self other
0 a c NaN NaN
2 NaN NaN 3.0 4.0
Keep all original rows and columns and also all original values
pandas.DataFrame.convert_dtypes
Notes
By default, convert_dtypes will attempt to convert a Series (or each Series in a DataFrame)
to dtypes that support pd.NA. By using the options convert_string, convert_integer,
convert_boolean and convert_boolean, it is possible to turn off individual conversions to
StringDtype, the integer extension types, BooleanDtype or floating extension types, respectively.
For object-dtyped columns, if infer_objects is True, use the inference rules as during normal
Series/DataFrame construction. Then, if possible, convert to StringDtype, BooleanDtype or an
appropriate integer or floating extension type, otherwise leave as object.
If the dtype is integer, convert to an appropriate integer extension type.
If the dtype is numeric, and consists of all integers, convert to an appropriate integer extension type.
Otherwise, convert to an appropriate floating extension type.
Changed in version 1.2: Starting with pandas 1.2, this method also converts float columns to the nullable
floating extension type.
In the future, as new dtypes are added that support pd.NA, the results of this method will change to
support those new dtypes.
Examples
>>> df = pd.DataFrame(
... {
... "a": pd.Series([1, 2, 3], dtype=np.dtype("int32")),
... "b": pd.Series(["x", "y", "z"], dtype=np.dtype("O")),
... "c": pd.Series([True, False, np.nan], dtype=np.dtype("O")),
... "d": pd.Series(["h", "i", np.nan], dtype=np.dtype("O")),
... "e": pd.Series([10, np.nan, 20], dtype=np.dtype("float")),
... "f": pd.Series([np.nan, 100.5, 200], dtype=np.dtype("float")),
... }
... )
>>> df
a b c d e f
0 1 x True h 10.0 NaN
1 2 y False i NaN 100.5
2 3 z NaN NaN 20.0 200.0
>>> df.dtypes
a int32
b object
c object
d object
e float64
(continues on next page)
>>> dfn.dtypes
a Int32
b string
c boolean
d string
e Int64
f Float64
dtype: object
>>> s.convert_dtypes()
0 a
1 b
2 <NA>
dtype: string
pandas.DataFrame.copy
DataFrame.copy(deep=True)
Make a copy of this object’s indices and data.
When deep=True (default), a new object will be created with a copy of the calling object’s data and
indices. Modifications to the data or indices of the copy will not be reflected in the original object (see
notes below).
When deep=False, a new object will be created without copying the calling object’s data or index
(only references to the data and index are copied). Any changes to the data of the original will be reflected
in the shallow copy (and vice versa).
Parameters
deep [bool, default True] Make a deep copy, including a copy of the data and the in-
dices. With deep=False neither the indices nor the data are copied.
Returns
copy [Series or DataFrame] Object type matches caller.
Notes
When deep=True, data is copied but actual Python objects will not be copied recursively, only the
reference to the object. This is in contrast to copy.deepcopy in the Standard Library, which recursively
copies object data (see examples below).
While Index objects are copied when deep=True, the underlying numpy array is not copied for per-
formance reasons. Since Index is immutable, the underlying data can be safely shared and a copy is not
needed.
Examples
>>> s is shallow
False
>>> s.values is shallow.values and s.index is shallow.index
True
>>> s is deep
False
>>> s.values is deep.values or s.index is deep.index
False
Updates to the data shared by shallow copy and original is reflected in both; deep copy remains unchanged.
>>> s[0] = 3
>>> shallow[1] = 4
>>> s
a 3
b 4
(continues on next page)
Note that when copying an object containing Python objects, a deep copy will copy the data, but will not
do so recursively. Updating a nested data object will be reflected in the deep copy.
pandas.DataFrame.corr
DataFrame.corr(method='pearson', min_periods=1)
Compute pairwise correlation of columns, excluding NA/null values.
Parameters
method [{‘pearson’, ‘kendall’, ‘spearman’} or callable] Method of correlation:
• pearson : standard correlation coefficient
• kendall : Kendall Tau correlation coefficient
• spearman : Spearman rank correlation
• callable: callable with input two 1d ndarrays and returning a float. Note
that the returned matrix from corr will have 1 along the diagonals and will
be symmetric regardless of the callable’s behavior.
New in version 0.24.0.
min_periods [int, optional] Minimum number of observations required per pair of
columns to have a valid result. Currently only available for Pearson and Spear-
man correlation.
Returns
DataFrame Correlation matrix.
See also:
Examples
pandas.DataFrame.corrwith
pandas.DataFrame.count
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] If 0 or ‘index’ counts are generated for
each column. If 1 or ‘columns’ counts are generated for each row.
level [int or str, optional] If the axis is a MultiIndex (hierarchical), count along a par-
ticular level, collapsing into a DataFrame. A str specifies the level name.
numeric_only [bool, default False] Include only float, int or boolean data.
Returns
Series or DataFrame For each column/row the number of non-NA/null entries. If
level is specified returns a DataFrame.
See also:
Examples
>>> df = pd.DataFrame({"Person":
... ["John", "Myla", "Lewis", "John", "Myla"],
... "Age": [24., np.nan, 21., 33, 26],
... "Single": [False, True, True, True, False]})
>>> df
Person Age Single
0 John 24.0 False
1 Myla NaN True
2 Lewis 21.0 True
3 John 33.0 True
4 Myla 26.0 False
>>> df.count()
Person 5
Age 4
Single 5
dtype: int64
>>> df.count(axis='columns')
0 3
1 2
2 3
3 3
4 3
dtype: int64
pandas.DataFrame.cov
DataFrame.cov(min_periods=None, ddof=1)
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the co-
variance matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias
from missing values.) A threshold can be set for the minimum number of observations for each value
created. Comparisons with observations below this threshold will be returned as NaN.
This method is generally used for the analysis of time series data to understand the relationship between
different measures across time.
Parameters
min_periods [int, optional] Minimum number of observations required per pair of
columns to have a valid result.
ddof [int, default 1] Delta degrees of freedom. The divisor used in calculations is N -
ddof, where N represents the number of elements.
New in version 1.1.0.
Returns
DataFrame The covariance matrix of the series of the DataFrame.
See also:
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the re-
turned covariance matrix will be an unbiased estimate of the variance and covariance between the member
Series.
However, for many applications this estimate may not be acceptable because the estimate covariance
matrix is not guaranteed to be positive semi-definite. This could lead to estimate correlations having
absolute values which are greater than one, and/or a non-invertible covariance matrix. See Estimation of
covariance matrices for more details.
Examples
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(1000, 5),
... columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()
a b c d e
a 0.998438 -0.020161 0.059277 -0.008943 0.014144
b -0.020161 1.059352 -0.008543 -0.024738 0.009826
c 0.059277 -0.008543 1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486 0.921297 -0.013692
e 0.014144 0.009826 -0.000271 -0.013692 0.977795
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(20, 3),
... columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan
>>> df.loc[df.index[5:10], 'b'] = np.nan
>>> df.cov(min_periods=12)
a b c
a 0.316741 NaN -0.150812
b NaN 1.248003 0.191417
c -0.150812 0.191417 0.895202
pandas.DataFrame.cummax
Examples
Series
>>> s.cummax()
0 2.0
1 NaN
2 5.0
3 5.0
4 5.0
dtype: float64
>>> s.cummax(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the maximum in each column. This is equivalent to axis=None
or axis='index'.
>>> df.cummax()
A B
0 2.0 1.0
1 3.0 NaN
2 3.0 1.0
To iterate over columns and find the maximum in each row, use axis=1
>>> df.cummax(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 1.0
pandas.DataFrame.cummin
Examples
Series
>>> s = pd.Series([2, np.nan, 5, -1, 0])
>>> s
0 2.0
1 NaN
2 5.0
3 -1.0
4 0.0
dtype: float64
DataFrame
>>> df = pd.DataFrame([[2.0, 1.0],
... [3.0, np.nan],
... [1.0, 0.0]],
... columns=list('AB'))
>>> df
A B
0 2.0 1.0
1 3.0 NaN
2 1.0 0.0
By default, iterates over rows and finds the minimum in each column. This is equivalent to axis=None
or axis='index'.
>>> df.cummin()
A B
0 2.0 1.0
1 2.0 NaN
2 1.0 0.0
To iterate over columns and find the minimum in each row, use axis=1
>>> df.cummin(axis=1)
A B
(continues on next page)
pandas.DataFrame.cumprod
Examples
Series
>>> s.cumprod()
0 2.0
1 NaN
2 10.0
3 -10.0
4 -0.0
dtype: float64
>>> s.cumprod(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the product in each column. This is equivalent to axis=None or
axis='index'.
>>> df.cumprod()
A B
0 2.0 1.0
1 6.0 NaN
2 6.0 0.0
To iterate over columns and find the product in each row, use axis=1
>>> df.cumprod(axis=1)
A B
0 2.0 2.0
1 3.0 NaN
2 1.0 0.0
pandas.DataFrame.cumsum
Examples
Series
>>> s.cumsum()
0 2.0
1 NaN
2 7.0
3 6.0
4 6.0
dtype: float64
>>> s.cumsum(skipna=False)
0 2.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
DataFrame
By default, iterates over rows and finds the sum in each column. This is equivalent to axis=None or
axis='index'.
>>> df.cumsum()
A B
0 2.0 1.0
1 5.0 NaN
2 6.0 1.0
To iterate over columns and find the sum in each row, use axis=1
>>> df.cumsum(axis=1)
A B
0 2.0 3.0
1 3.0 NaN
2 1.0 1.0
pandas.DataFrame.describe
Notes
For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and
upper percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile
is the same as the median.
For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and
freq. The top is the most common value. The freq is the most common value’s frequency. Times-
tamps also include the first and last items.
If multiple object values have the highest count, then the count and top results will be arbitrarily chosen
from among those with the highest count.
For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric
columns. If the dataframe consists only of object and categorical data without any numeric columns,
the default is to return an analysis of both the object and categorical columns. If include='all' is
provided as an option, the result will include a union of attributes of each type.
The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed
for the output. The parameters are ignored when analyzing a Series.
Examples
>>> s = pd.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count 3
mean 2006-09-01 08:00:00
min 2000-01-01 00:00:00
25% 2004-12-31 12:00:00
50% 2010-01-01 00:00:00
75% 2010-01-01 00:00:00
max 2010-01-01 00:00:00
dtype: object
>>> df.describe(include='all')
categorical numeric object
count 3 3.0 3
unique 3 NaN 3
top f NaN a
freq 1 NaN 1
mean NaN 2.0 NaN
std NaN 1.0 NaN
min NaN 1.0 NaN
25% NaN 1.5 NaN
50% NaN 2.0 NaN
75% NaN 2.5 NaN
max NaN 3.0 NaN
>>> df.numeric.describe()
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
Name: numeric, dtype: float64
>>> df.describe(include=[np.number])
numeric
count 3.0
mean 2.0
std 1.0
min 1.0
25% 1.5
50% 2.0
75% 2.5
max 3.0
>>> df.describe(include=[object])
object
count 3
(continues on next page)
>>> df.describe(include=['category'])
categorical
count 3
unique 3
top d
freq 1
>>> df.describe(exclude=[np.number])
categorical object
count 3 3
unique 3 3
top f a
freq 1 1
>>> df.describe(exclude=[object])
categorical numeric
count 3 3.0
unique 3 NaN
top f NaN
freq 1 NaN
mean NaN 2.0
std NaN 1.0
min NaN 1.0
25% NaN 1.5
50% NaN 2.0
75% NaN 2.5
max NaN 3.0
pandas.DataFrame.diff
DataFrame.diff(periods=1, axis=0)
First discrete difference of element.
Calculates the difference of a Dataframe element compared with another element in the Dataframe (default
is element in previous row).
Parameters
periods [int, default 1] Periods to shift for calculating difference, accepts negative val-
ues.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Take difference over rows (0) or
columns (1).
Returns
Dataframe First differences of the Series.
See also:
Notes
For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calcu-
lated according to current dtype in Dataframe, however dtype of the result is always float64.
Examples
>>> df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0
>>> df.diff(periods=-1)
a b c
0 -1.0 0.0 -3.0
1 -1.0 -1.0 -5.0
2 -1.0 -1.0 -7.0
3 -1.0 -2.0 -9.0
4 -1.0 -3.0 -11.0
5 NaN NaN NaN
pandas.DataFrame.div
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.divide
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.dot
DataFrame.dot(other)
Compute the matrix multiplication between the DataFrame and other.
This method computes the matrix product between the DataFrame and the values of an other Series,
DataFrame or a numpy array.
It can also be called using self @ other in Python >= 3.5.
Parameters
other [Series, DataFrame or array-like] The other object to compute the matrix product
with.
Returns
Series or DataFrame If other is a Series, return the matrix product between self and
other as a Series. If other is a DataFrame or a numpy.array, return the matrix
product of self and other in a DataFrame of a np.array.
See also:
Notes
The dimensions of DataFrame and other must be compatible in order to compute the matrix multiplication.
In addition, the column names of DataFrame and the index of other must contain the same values, as they
will be aligned prior to the multiplication.
The dot method for Series computes the inner product, instead of the matrix product here.
Examples
>>> arr = np.array([[0, 1], [1, 2], [-1, -1], [2, 0]])
>>> df.dot(arr)
0 1
0 1 4
1 2 2
Note how shuffling of the objects does not change the result.
pandas.DataFrame.drop
DataFrame.dropna Return DataFrame with labels on given axis omitted where (all or any) data are
missing.
DataFrame.drop_duplicates Return DataFrame with duplicate rows removed, optionally only
considering certain columns.
Series.drop Return Series with specified index labels removed.
Examples
Drop columns
>>> df.drop(['B', 'C'], axis=1)
A D
0 0 3
1 4 7
2 8 11
pandas.DataFrame.drop_duplicates
Examples
pandas.DataFrame.droplevel
DataFrame.droplevel(level, axis=0)
Return DataFrame with requested index / column level(s) removed.
New in version 0.24.0.
Parameters
level [int, str, or list-like] If a string is given, must be the name of a level If list-like,
elements must be names or positional indexes of levels.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Axis along which the level(s) is re-
moved:
• 0 or ‘index’: remove level(s) in column.
• 1 or ‘columns’: remove level(s) in row.
Returns
DataFrame DataFrame with requested index / column level(s) removed.
Examples
>>> df = pd.DataFrame([
... [1, 2, 3, 4],
... [5, 6, 7, 8],
... [9, 10, 11, 12]
... ]).set_index([0, 1]).rename_axis(['a', 'b'])
>>> df
level_1 c d
level_2 e f
a b
1 2 3 4
5 6 7 8
9 10 11 12
>>> df.droplevel('a')
level_1 c d
level_2 e f
b
2 3 4
6 7 8
10 11 12
pandas.DataFrame.dropna
Changed in version 1.0.0: Pass tuple or list to drop on multiple axes. Only a single
axis is allowed.
how [{‘any’, ‘all’}, default ‘any’] Determine if row or column is removed from
DataFrame, when we have at least one NA or all NA.
• ‘any’ : If any NA values are present, drop that row or column.
• ‘all’ : If all values are NA, drop that row or column.
thresh [int, optional] Require that many non-NA values.
subset [array-like, optional] Labels along other axis to consider, e.g. if you are drop-
ping rows these would be a list of columns to include.
inplace [bool, default False] If True, do operation inplace and return None.
Returns
DataFrame or None DataFrame with NA entries dropped from it or None if
inplace=True.
See also:
Examples
>>> df.dropna()
name toy born
1 Batman Batmobile 1940-04-25
>>> df.dropna(axis='columns')
name
0 Alfred
1 Batman
2 Catwoman
>>> df.dropna(how='all')
name toy born
0 Alfred NaN NaT
1 Batman Batmobile 1940-04-25
2 Catwoman Bullwhip NaT
pandas.DataFrame.duplicated
DataFrame.duplicated(subset=None, keep='first')
Return boolean Series denoting duplicate rows.
Considering certain columns is optional.
Parameters
subset [column label or sequence of labels, optional] Only consider certain columns
for identifying duplicates, by default use all of the columns.
keep [{‘first’, ‘last’, False}, default ‘first’] Determines which duplicates (if any) to
mark.
• first : Mark duplicates as True except for the first occurrence.
• last : Mark duplicates as True except for the last occurrence.
• False : Mark all duplicates as True.
Returns
Series Boolean series for each duplicated rows.
See also:
Examples
>>> df = pd.DataFrame({
... 'brand': ['Yum Yum', 'Yum Yum', 'Indomie', 'Indomie', 'Indomie'],
... 'style': ['cup', 'cup', 'cup', 'pack', 'pack'],
... 'rating': [4, 4, 3.5, 15, 5]
... })
>>> df
brand style rating
0 Yum Yum cup 4.0
1 Yum Yum cup 4.0
2 Indomie cup 3.5
3 Indomie pack 15.0
4 Indomie pack 5.0
By default, for each set of duplicated values, the first occurrence is set on False and all others on True.
>>> df.duplicated()
0 False
1 True
2 False
3 False
4 False
dtype: bool
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True.
>>> df.duplicated(keep='last')
0 True
1 False
2 False
3 False
4 False
dtype: bool
>>> df.duplicated(keep=False)
0 True
1 True
2 False
3 False
4 False
dtype: bool
>>> df.duplicated(subset=['brand'])
0 False
1 True
2 False
3 True
4 True
dtype: bool
pandas.DataFrame.eq
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df == 100
cost revenue
A False True
(continues on next page)
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:
>>> df != pd.Series([100, 250], index=["cost", "revenue"])
cost revenue
A True True
B True False
C False True
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other:
>>> df == [250, 100]
cost revenue
A True True
B False False
C False False
>>> df.gt(other)
cost revenue
(continues on next page)
pandas.DataFrame.equals
DataFrame.equals(other)
Test whether two objects contain the same elements.
This function allows two Series or DataFrames to be compared against each other to see if they have the
same shape and elements. NaNs in the same location are considered equal.
The row/column index do not need to have the same type, as long as the values are considered equal.
Corresponding columns must be of the same dtype.
Parameters
other [Series or DataFrame] The other Series or DataFrame to be compared with the
first.
Returns
bool True if all elements are the same in both objects, False otherwise.
See also:
Series.eq Compare two Series objects of the same length and return a Series where each element is
True if the element in each Series is equal, False otherwise.
DataFrame.eq Compare two DataFrame objects of the same shape and return a DataFrame where
each element is True if the respective element in each DataFrame is equal, False otherwise.
testing.assert_series_equal Raises an AssertionError if left and right are not equal. Provides
an easy interface to ignore inequality in dtypes, indexes and precision among others.
testing.assert_frame_equal Like assert_series_equal, but targets DataFrames.
numpy.array_equal Return True if two arrays have the same shape and elements, False otherwise.
Examples
DataFrames df and exactly_equal have the same types and values for their elements and column labels,
which will return True.
DataFrames df and different_column_type have the same element types and values, but have different
types for the column labels, which will still return True.
DataFrames df and different_data_type have different types for the same values for their elements, and
will return False even though their column labels are the same values and types.
pandas.DataFrame.eval
inplace [bool, default False] If the expression contains an assignment, whether to per-
form the operation inplace and mutate the existing DataFrame. Otherwise, a new
DataFrame is returned.
**kwargs See the documentation for eval() for complete details on the keyword
arguments accepted by query().
Returns
ndarray, scalar, pandas object, or None The result of the evaluation or None if
inplace=True.
See also:
Notes
For more details see the API documentation for eval(). For detailed examples see enhancing perfor-
mance with eval.
Examples
>>> df.eval(
... '''
... C = A + B
... D = A - B
... '''
... )
A B C D
0 1 10 11 -9
1 2 8 10 -6
2 3 6 9 -3
3 4 4 8 0
4 5 2 7 3
pandas.DataFrame.ewm
adjust [bool, default True] Divide by decaying adjustment factor in beginning peri-
ods to account for imbalance in relative weightings (viewing EWMA as a moving
average).
• When adjust=True (default), the EW function is calculated using weights
𝑤𝑖 = (1 − 𝛼)𝑖 . For example, the EW moving average of the series
[𝑥0 , 𝑥1 , ..., 𝑥𝑡 ] would be:
𝑥𝑡 + (1 − 𝛼)𝑥𝑡−1 + (1 − 𝛼)2 𝑥𝑡−2 + ... + (1 − 𝛼)𝑡 𝑥0
𝑦𝑡 =
1 + (1 − 𝛼) + (1 − 𝛼)2 + ... + (1 − 𝛼)𝑡
• When adjust=False, the exponentially weighted function is calculated re-
cursively:
𝑦0 = 𝑥0
𝑦𝑡 = (1 − 𝛼)𝑦𝑡−1 + 𝛼𝑥𝑡 ,
ignore_na [bool, default False] Ignore missing values when calculating weights; spec-
ify True to reproduce pre-0.15.0 behavior.
• When ignore_na=False (default), weights are based on absolute posi-
tions. For example, the weights of 𝑥0 and 𝑥2 used in calculating the final
weighted average of [𝑥0 , None, 𝑥2 ] are (1 − 𝛼)2 and 1 if adjust=True, and
(1 − 𝛼)2 and 𝛼 if adjust=False.
• When ignore_na=True (reproducing pre-0.15.0 behavior), weights are
based on relative positions. For example, the weights of 𝑥0 and 𝑥2 used in
calculating the final weighted average of [𝑥0 , None, 𝑥2 ] are 1 − 𝛼 and 1 if
adjust=True, and 1 − 𝛼 and 𝛼 if adjust=False.
axis [{0, 1}, default 0] The axis to use. The value 0 identifies the rows, and 1 identifies
the columns.
times [str, np.ndarray, Series, default None] New in version 1.1.0.
Times corresponding to the observations. Must be monotonically increasing and
datetime64[ns] dtype.
If str, the name of the column in the DataFrame representing the times.
If 1-D array like, a sequence with the same shape as the observations.
Only applicable to mean().
Returns
DataFrame A Window sub-classed for the particular operation.
See also:
Notes
Examples
>>> df.ewm(com=0.5).mean()
B
0 0.000000
1 0.750000
2 1.615385
3 1.615385
4 3.670213
pandas.DataFrame.expanding
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the
window by setting center=True.
Examples
>>> df.expanding(2).sum()
B
0 NaN
1 1.0
2 3.0
3 3.0
4 7.0
pandas.DataFrame.explode
DataFrame.explode(column, ignore_index=False)
Transform each element of a list-like to a row, replicating index values.
New in version 0.25.0.
Parameters
column [str or tuple] Column to explode.
ignore_index [bool, default False] If True, the resulting index will be labeled 0, 1, . . . ,
n - 1.
New in version 1.1.0.
Returns
DataFrame Exploded lists to rows of the subset columns; index will be duplicated for
these rows.
Raises
ValueError [] if columns of the frame are not unique.
See also:
Notes
This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype
of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result
in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when
exploding sets.
Examples
>>> df = pd.DataFrame({'A': [[1, 2, 3], 'foo', [], [3, 4]], 'B': 1})
>>> df
A B
0 [1, 2, 3] 1
1 foo 1
2 [] 1
3 [3, 4] 1
>>> df.explode('A')
A B
0 1 1
0 2 1
0 3 1
1 foo 1
2 NaN 1
3 3 1
3 4 1
pandas.DataFrame.ffill
pandas.DataFrame.fillna
inplace [bool, default False] If True, fill in-place. Note: this will modify any other
views on this object (e.g., a no-copy slice for a column in a DataFrame).
limit [int, default None] If method is specified, this is the maximum number of con-
secutive NaN values to forward/backward fill. In other words, if there is a gap
with more than this number of consecutive NaNs, it will only be partially filled.
If method is not specified, this is the maximum number of entries along the entire
axis where NaNs will be filled. Must be greater than 0 if not None.
downcast [dict, default is None] A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g.
float64 to int64 if possible).
Returns
DataFrame or None Object with missing values filled or None if inplace=True.
See also:
Examples
>>> df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
1 3.0 4.0 0.0 1
2 0.0 0.0 0.0 5
3 0.0 3.0 0.0 4
>>> df.fillna(method='ffill')
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
pandas.DataFrame.filter
Notes
The items, like, and regex parameters are enforced to be mutually exclusive.
axis defaults to the info axis that is used when indexing with [].
Examples
pandas.DataFrame.first
DataFrame.first(offset)
Select initial periods of time series data based on a date offset.
When having a DataFrame with dates as index, this function can select the first few rows based on a date
offset.
Parameters
offset [str, DateOffset or dateutil.relativedelta] The offset length of the data that will be
selected. For instance, ‘1M’ will display all the rows having their index within the
first month.
Returns
Series or DataFrame A subset of the caller.
Raises
TypeError If the index is not a DatetimeIndex
See also:
Examples
>>> ts.first('3D')
A
2018-04-09 1
2018-04-11 2
Notice the data for 3 first calendar days were returned, not the first 3 days observed in the dataset, and
therefore data for 2018-04-13 was not returned.
pandas.DataFrame.first_valid_index
DataFrame.first_valid_index()
Return index for first non-NA/null value.
Returns
scalar [type of index]
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
pandas.DataFrame.floordiv
fill_value [float or None, default None] Fill existing missing (NaN) values, and any
new element needed for successful DataFrame alignment, with this value before
computation. If data in both corresponding DataFrame locations is missing the
result will be missing.
Returns
DataFrame Result of the arithmetic operation.
See also:
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.from_dict
Examples
>>> data = {'col_1': [3, 2, 1, 0], 'col_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data)
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
>>> data = {'row_1': [3, 2, 1, 0], 'row_2': ['a', 'b', 'c', 'd']}
>>> pd.DataFrame.from_dict(data, orient='index')
0 1 2 3
row_1 3 2 1 0
row_2 a b c d
When using the ‘index’ orientation, the column names can be specified manually:
pandas.DataFrame.from_records
See also:
Examples
>>> data = np.array([(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')],
... dtype=[('col_1', 'i4'), ('col_2', 'U1')])
>>> pd.DataFrame.from_records(data)
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
>>> data = [(3, 'a'), (2, 'b'), (1, 'c'), (0, 'd')]
>>> pd.DataFrame.from_records(data, columns=['col_1', 'col_2'])
col_1 col_2
0 3 a
1 2 b
2 1 c
3 0 d
pandas.DataFrame.ge
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:
>>> df != pd.Series([100, 250], index=["cost", "revenue"])
cost revenue
A True True
(continues on next page)
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other:
>>> df == [250, 100]
cost revenue
A True True
B False False
C False False
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.get
DataFrame.get(key, default=None)
Get item from object for given key (ex: DataFrame column).
Returns default value if not found.
Parameters
key [object]
Returns
value [same type as items contained in object]
pandas.DataFrame.groupby
as_index [bool, default True] For aggregated output, return object with group labels
as the index. Only relevant for DataFrame input. as_index=False is effectively
“SQL-style” grouped output.
sort [bool, default True] Sort group keys. Get better performance by turning this off.
Note this does not influence the order of observations within each group. Groupby
preserves the order of rows within each group.
group_keys [bool, default True] When calling apply, add group keys to index to iden-
tify pieces.
squeeze [bool, default False] Reduce the dimensionality of the return type if possible,
otherwise return a consistent type.
Deprecated since version 1.1.0.
observed [bool, default False] This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers. If False: show all
values for categorical groupers.
dropna [bool, default True] If True, and if group keys contain NA values, NA values
together with row/column will be dropped. If False, NA values will also be treated
as the key in groups
New in version 1.1.0.
Returns
DataFrameGroupBy Returns a groupby object that contains information about the
groups.
See also:
resample Convenience method for frequency conversion and resampling of time series.
Notes
Examples
Hierarchical Indexes
We can groupby different levels of a hierarchical index using the level parameter:
We can also choose to include NA in group keys or not by setting dropna parameter, the default setting is
True:
>>> l = [[1, 2, 3], [1, None, 4], [2, 1, 3], [1, 2, 2]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by=["b"]).sum()
a c
b
1.0 2 3
2.0 2 5
>>> l = [["a", 12, 12], [None, 12.3, 33.], ["b", 12.3, 123], ["a", 1, 1]]
>>> df = pd.DataFrame(l, columns=["a", "b", "c"])
>>> df.groupby(by="a").sum()
b c
a
a 13.0 13.0
b 12.3 123.0
pandas.DataFrame.gt
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:
>>> df != pd.Series([100, 250], index=["cost", "revenue"])
cost revenue
A True True
B True False
C False True
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other:
>>> df == [250, 100]
cost revenue
A True True
B False False
C False False
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.head
DataFrame.head(n=5)
Return the first n rows.
This function returns the first n rows for the object based on position. It is useful for quickly testing if
your object has the right type of data in it.
For negative values of n, this function returns all rows except the last n rows, equivalent to df[:-n].
Parameters
n [int, default 5] Number of rows to select.
Returns
same type as caller The first n rows of the caller object.
See also:
Examples
>>> df.head()
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
>>> df.head(3)
animal
0 alligator
1 bee
2 falcon
>>> df.head(-3)
animal
0 alligator
1 bee
2 falcon
3 lion
4 monkey
5 parrot
pandas.DataFrame.hist
Examples
This example draws a histogram based on the length and width of some animals, displayed in three bins
>>> df = pd.DataFrame({
... 'length': [1.5, 0.5, 1.2, 0.9, 3],
... 'width': [0.7, 0.2, 0.15, 0.2, 1.1]
... }, index=['pig', 'rabbit', 'duck', 'chicken', 'horse'])
>>> hist = df.hist(bins=3)
pandas.DataFrame.idxmax
DataFrame.idxmax(axis=0, skipna=True)
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] The axis to use. 0 or ‘index’ for row-
wise, 1 or ‘columns’ for column-wise.
skipna [bool, default True] Exclude NA/null values. If an entire row/column is NA,
the result will be NA.
Returns
Series Indexes of maxima along the specified axis.
Raises
ValueError
• If the row/column is empty
See also:
Notes
Examples
>>> df
consumption co2_emissions
Pork 10.51 37.20
Wheat Products 103.11 19.66
Beef 55.48 1712.00
By default, it returns the index for the maximum value in each column.
>>> df.idxmax()
consumption Wheat Products
co2_emissions Beef
dtype: object
To return the index for the maximum value in each row, use axis="columns".
>>> df.idxmax(axis="columns")
Pork co2_emissions
Wheat Products consumption
Beef co2_emissions
dtype: object
pandas.DataFrame.idxmin
DataFrame.idxmin(axis=0, skipna=True)
Return index of first occurrence of minimum over requested axis.
NA/null values are excluded.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] The axis to use. 0 or ‘index’ for row-
wise, 1 or ‘columns’ for column-wise.
skipna [bool, default True] Exclude NA/null values. If an entire row/column is NA,
the result will be NA.
Returns
Series Indexes of minima along the specified axis.
Raises
ValueError
• If the row/column is empty
See also:
Notes
Examples
>>> df
consumption co2_emissions
Pork 10.51 37.20
Wheat Products 103.11 19.66
Beef 55.48 1712.00
By default, it returns the index for the minimum value in each column.
>>> df.idxmin()
consumption Pork
co2_emissions Wheat Products
dtype: object
To return the index for the minimum value in each row, use axis="columns".
>>> df.idxmin(axis="columns")
Pork consumption
Wheat Products co2_emissions
Beef consumption
dtype: object
pandas.DataFrame.infer_objects
DataFrame.infer_objects()
Attempt to infer better dtypes for object columns.
Attempts soft conversion of object-dtyped columns, leaving non-object and unconvertible columns un-
changed. The inference rules are the same as during normal Series/DataFrame construction.
Returns
converted [same type as input object]
See also:
Examples
>>> df.dtypes
A object
dtype: object
>>> df.infer_objects().dtypes
A int64
dtype: object
pandas.DataFrame.info
Examples
>>> df.info(verbose=True)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 int_col 5 non-null int64
1 text_col 5 non-null object
2 float_col 5 non-null float64
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes
Prints a summary of columns count and its dtypes but not per column information:
>>> df.info(verbose=False)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Columns: 3 entries, int_col to float_col
dtypes: float64(1), int64(1), object(1)
memory usage: 248.0+ bytes
Pipe output of DataFrame.info to buffer instead of sys.stdout, get buffer content and writes to a text file:
>>> import io
>>> buffer = io.StringIO()
>>> df.info(buf=buffer)
>>> s = buffer.getvalue()
>>> with open("df_info.txt", "w",
... encoding="utf-8") as f:
... f.write(s)
260
The memory_usage parameter allows deep introspection mode, specially useful for big DataFrames and
fine-tune memory optimization:
>>> df.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000000 entries, 0 to 999999
Data columns (total 3 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 column_1 1000000 non-null object
1 column_2 1000000 non-null object
2 column_3 1000000 non-null object
dtypes: object(3)
memory usage: 165.9 MB
pandas.DataFrame.insert
pandas.DataFrame.interpolate
• ‘time’: Works on daily and higher resolution data to interpolate given length
of interval.
• ‘index’, ‘values’: use the actual numerical values of the index.
• ‘pad’: Fill in NaNs using existing values.
• ‘nearest’, ‘zero’, ‘slinear’, ‘quadratic’, ‘cubic’, ‘spline’, ‘barycen-
tric’, ‘polynomial’: Passed to scipy.interpolate.interp1d. These meth-
ods use the numerical values of the index. Both ‘polynomial’ and
‘spline’ require that you also specify an order (int), e.g. df.
interpolate(method='polynomial', order=5).
• ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’, ‘akima’, ‘cubicspline’:
Wrappers around the SciPy interpolation methods of similar names. See Notes.
• ‘from_derivatives’: Refers to scipy.interpolate.BPoly.from_derivatives which
replaces ‘piecewise_polynomial’ interpolation method in scipy 0.18.
axis [{{0 or ‘index’, 1 or ‘columns’, None}}, default None] Axis to interpolate along.
limit [int, optional] Maximum number of consecutive NaNs to fill. Must be greater
than 0.
inplace [bool, default False] Update the data in place if possible.
limit_direction [{{‘forward’, ‘backward’, ‘both’}}, Optional] Consecutive NaNs will
be filled in this direction.
If limit is specified:
• If ‘method’ is ‘pad’ or ‘ffill’, ‘limit_direction’ must be ‘forward’.
• If ‘method’ is ‘backfill’ or ‘bfill’, ‘limit_direction’ must be ‘backwards’.
If ‘limit’ is not specified:
• If ‘method’ is ‘backfill’ or ‘bfill’, the default is ‘backward’
• else the default is ‘forward’
Changed in version 1.1.0: raises ValueError if limit_direction is ‘forward’ or ‘both’
and method is ‘backfill’ or ‘bfill’. raises ValueError if limit_direction is ‘backward’
or ‘both’ and method is ‘pad’ or ‘ffill’.
limit_area [{{None, ‘inside’, ‘outside’}}, default None] If limit is specified, consecu-
tive NaNs will be filled with this restriction.
• None: No fill restriction.
• ‘inside’: Only fill NaNs surrounded by valid values (interpolate).
• ‘outside’: Only fill NaNs outside valid values (extrapolate).
downcast [optional, ‘infer’ or None, defaults to None] Downcast dtypes if possible.
``**kwargs`` [optional] Keyword arguments to pass on to the interpolating function.
Returns
Series or DataFrame or None Returns the same object type as the caller, interpolated
at some or all NaN values or None if inplace=True.
See also:
Notes
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the
respective SciPy implementations of similar names. These use the actual numerical values of the index.
For more information on their behavior, see the SciPy documentation and SciPy tutorial.
Examples
Filling in NaN in a Series by padding, but filling at most two consecutive NaN at a time.
Filling in NaN in a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’ methods
require that you also specify an order (int).
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use
for interpolation. Note how the first entry in column ‘b’ remains NaN, because there is no entry before it
to use for interpolation.
pandas.DataFrame.isin
DataFrame.isin(values)
Whether each element in the DataFrame is contained in values.
Parameters
values [iterable, Series, DataFrame or dict] The result will only be true at a location if
all the labels match. If values is a Series, that’s the index. If values is a dict, the
keys must be the column names, which must match. If values is a DataFrame, then
both the index and column labels must match.
Returns
DataFrame DataFrame of booleans showing whether each element in the DataFrame
is contained in values.
See also:
Examples
When values is a list check whether every value in the DataFrame is present in the list (which animals
have 0 or 2 legs or wings)
When values is a dict, we can pass values to check for each column separately:
When values is a Series or DataFrame the index and column must match. Note that ‘falcon’ does not
match based on the number of legs in df2.
pandas.DataFrame.isna
DataFrame.isna()
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.
NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.
use_inf_as_na = True).
Returns
DataFrame Mask of bool values for each element in DataFrame that indicates whether
an element is an NA value.
See also:
Examples
>>> df.isna()
age born name toy
0 False True False True
1 False False False False
2 True False False False
>>> ser.isna()
0 False
(continues on next page)
pandas.DataFrame.isnull
DataFrame.isnull()
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None or numpy.
NaN, gets mapped to True values. Everything else gets mapped to False values. Characters such as empty
strings '' or numpy.inf are not considered NA values (unless you set pandas.options.mode.
use_inf_as_na = True).
Returns
DataFrame Mask of bool values for each element in DataFrame that indicates whether
an element is an NA value.
See also:
Examples
>>> df.isna()
age born name toy
0 False True False True
1 False False False False
2 True False False False
>>> ser.isna()
0 False
1 False
2 True
dtype: bool
pandas.DataFrame.items
DataFrame.items()
Iterate over (column name, Series) pairs.
Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.
Yields
label [object] The column names for the DataFrame being iterated over.
content [Series] The column entries belonging to each label, as a Series.
See also:
Examples
pandas.DataFrame.iteritems
DataFrame.iteritems()
Iterate over (column name, Series) pairs.
Iterates over the DataFrame columns, returning a tuple with the column name and the content as a Series.
Yields
label [object] The column names for the DataFrame being iterated over.
content [Series] The column entries belonging to each label, as a Series.
See also:
Examples
pandas.DataFrame.iterrows
DataFrame.iterrows()
Iterate over DataFrame rows as (index, Series) pairs.
Yields
index [label or tuple of label] The index of the row. A tuple for a MultiIndex.
data [Series] The data of the row as a Series.
See also:
Notes
1. Because iterrows returns a Series for each row, it does not preserve dtypes across the rows
(dtypes are preserved across columns for DataFrames). For example,
To preserve dtypes while iterating over the rows, it is better to use itertuples() which returns
namedtuples of the values and which is generally faster than iterrows.
2. You should never modify something you are iterating over. This is not guaranteed to work in all
cases. Depending on the data types, the iterator returns a copy and not a view, and writing to it will
have no effect.
pandas.DataFrame.itertuples
DataFrame.itertuples(index=True, name='Pandas')
Iterate over DataFrame rows as namedtuples.
Parameters
index [bool, default True] If True, return the index as the first element of the tuple.
name [str or None, default “Pandas”] The name of the returned namedtuples or None
to return regular tuples.
Returns
iterator An object to iterate over namedtuples for each row in the DataFrame with the
first field possibly being the index and following fields being the column values.
See also:
Notes
The column names will be renamed to positional names if they are invalid Python identifiers, repeated,
or start with an underscore. On python versions < 3.7 regular tuples are returned for DataFrames with a
large number of columns (>254).
Examples
By setting the index parameter to False we can remove the index as the first element of the tuple:
With the name parameter set we set a custom name for the yielded namedtuples:
pandas.DataFrame.join
how [{‘left’, ‘right’, ‘outer’, ‘inner’}, default ‘left’] How to handle the operation of the
two objects.
• left: use calling frame’s index (or column if on is specified)
• right: use other’s index.
• outer: form union of calling frame’s index (or column if on is specified) with
other’s index, and sort it. lexicographically.
• inner: form intersection of calling frame’s index (or column if on is specified)
with other’s index, preserving the order of the calling’s one.
lsuffix [str, default ‘’] Suffix to use from left frame’s overlapping columns.
rsuffix [str, default ‘’] Suffix to use from right frame’s overlapping columns.
sort [bool, default False] Order result DataFrame lexicographically by the join key. If
False, the order of the join key depends on the join type (how keyword).
Returns
DataFrame A dataframe containing columns from both the caller and other.
See also:
Notes
Parameters on, lsuffix, and rsuffix are not supported when passing a list of DataFrame objects.
Support for specifying index levels as the on parameter was added in version 0.23.0.
Examples
>>> df
key A
0 K0 A0
1 K1 A1
2 K2 A2
3 K3 A3
4 K4 A4
5 K5 A5
>>> other
key B
0 K0 B0
1 K1 B1
2 K2 B2
If we want to join using the key columns, we need to set key to be the index in both df and other. The
joined DataFrame will have key as its index.
>>> df.set_index('key').join(other.set_index('key'))
A B
key
K0 A0 B0
K1 A1 B1
K2 A2 B2
K3 A3 NaN
K4 A4 NaN
K5 A5 NaN
Another option to join using the key columns is to use the on parameter. DataFrame.join always uses
other’s index but we can use any column in df. This method preserves the original DataFrame’s index in
the result.
pandas.DataFrame.keys
DataFrame.keys()
Get the ‘info axis’ (see Indexing for more).
This is index for Series, columns for DataFrame.
Returns
Index Info axis.
pandas.DataFrame.kurt
pandas.DataFrame.kurtosis
pandas.DataFrame.last
DataFrame.last(offset)
Select final periods of time series data based on a date offset.
When having a DataFrame with dates as index, this function can select the last few rows based on a date
offset.
Parameters
offset [str, DateOffset, dateutil.relativedelta] The offset length of the data that will be
selected. For instance, ‘3D’ will display all the rows having their index within the
last 3 days.
Returns
Series or DataFrame A subset of the caller.
Raises
TypeError If the index is not a DatetimeIndex
See also:
Examples
>>> ts.last('3D')
A
2018-04-13 3
2018-04-15 4
Notice the data for 3 last calendar days were returned, not the last 3 observed days in the dataset, and
therefore data for 2018-04-11 was not returned.
pandas.DataFrame.last_valid_index
DataFrame.last_valid_index()
Return index for last non-NA/null value.
Returns
scalar [type of index]
Notes
If all elements are non-NA/null, returns None. Also returns None for empty Series/DataFrame.
pandas.DataFrame.le
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df == 100
cost revenue
A False True
B False False
C True False
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other:
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.lookup
DataFrame.lookup(row_labels, col_labels)
Label-based “fancy indexing” function for DataFrame. Given equal-length arrays of row and column
labels, return an array of the values corresponding to each (row, col) pair.
Deprecated since version 1.2.0: DataFrame.lookup is deprecated, use DataFrame.melt and DataFrame.loc
instead. For an example see lookup() in the user guide.
Parameters
row_labels [sequence] The row labels to use for lookup.
col_labels [sequence] The column labels to use for lookup.
Returns
numpy.ndarray The found values.
pandas.DataFrame.lt
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df == 100
cost revenue
A False True
B False False
C True False
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other:
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.mad
pandas.DataFrame.mask
Notes
The mask method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond is False the element is used; otherwise the corresponding element from the DataFrame other
is used.
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m,
df2) is equivalent to np.where(m, df1, df2).
For further details and examples see the mask documentation in indexing.
Examples
>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
>>> s.mask(s > 0)
0 0.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
pandas.DataFrame.max
Examples
>>> s.max()
8
>>> s.max(level='blooded')
blooded
warm 4
cold 8
Name: legs, dtype: int64
>>> s.max(level=0)
blooded
warm 4
cold 8
Name: legs, dtype: int64
pandas.DataFrame.mean
Returns
Series or DataFrame (if level specified)
pandas.DataFrame.median
pandas.DataFrame.melt
Examples
pandas.DataFrame.memory_usage
DataFrame.memory_usage(index=True, deep=False)
Return the memory usage of each column in bytes.
The memory usage can optionally include the contribution of the index and elements of object dtype.
This value is displayed in DataFrame.info by default. This can be suppressed by setting pandas.
options.display.memory_usage to False.
Parameters
index [bool, default True] Specifies whether to include the memory usage of the
DataFrame’s index in returned Series. If index=True, the memory usage of
the index is the first item in the output.
deep [bool, default False] If True, introspect the data deeply by interrogating object
dtypes for system-level memory consumption, and include it in the returned values.
Returns
Series A Series whose index is the original column names and whose values is the
memory usage of each column in bytes.
See also:
Examples
>>> df.memory_usage()
Index 128
int64 40000
float64 40000
complex128 80000
object 40000
bool 5000
dtype: int64
>>> df.memory_usage(index=False)
int64 40000
float64 40000
complex128 80000
object 40000
bool 5000
dtype: int64
>>> df.memory_usage(deep=True)
Index 128
int64 40000
float64 40000
complex128 80000
object 180000
bool 5000
dtype: int64
Use a Categorical for efficient storage of an object-dtype column with many repeated values.
>>> df['object'].astype('category').memory_usage(deep=True)
5244
pandas.DataFrame.merge
Notes
Support for specifying index levels as the on, left_on, and right_on parameters was added in version 0.23.0
Support for merging named Series objects was added in version 0.24.0
Examples
Merge df1 and df2 on the lkey and rkey columns. The value columns have the default suffixes, _x and _y,
appended.
Merge DataFrames df1 and df2 with specified left and right suffixes appended to any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey',
... suffixes=('_left', '_right'))
lkey value_left rkey value_right
0 foo 1 foo 5
1 foo 1 foo 8
2 foo 5 foo 5
3 foo 5 foo 8
4 bar 2 bar 6
5 baz 3 baz 7
Merge DataFrames df1 and df2, but raise an exception if the DataFrames have any overlapping columns.
>>> df1.merge(df2, left_on='lkey', right_on='rkey', suffixes=(False, False))
Traceback (most recent call last):
...
ValueError: columns overlap but no suffix specified:
Index(['value'], dtype='object')
pandas.DataFrame.min
Examples
>>> s.min()
0
>>> s.min(level='blooded')
blooded
warm 2
cold 0
Name: legs, dtype: int64
>>> s.min(level=0)
blooded
warm 2
cold 0
Name: legs, dtype: int64
pandas.DataFrame.mod
Returns
DataFrame Result of the arithmetic operation.
See also:
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.mode
Examples
By default, missing values are not considered, and the mode of wings are both 0 and 2. Because the
resulting DataFrame has two rows, the second row of species and legs contains NaN.
>>> df.mode()
species legs wings
0 bird 2.0 0.0
1 NaN NaN 2.0
Setting dropna=False NaN values are considered and they can be the mode (like for wings).
>>> df.mode(dropna=False)
species legs wings
0 bird 2 NaN
Setting numeric_only=True, only the mode of numeric columns is computed, and columns of other
types are ignored.
>>> df.mode(numeric_only=True)
legs wings
0 2.0 0.0
1 NaN 2.0
To compute the mode over columns and not rows, use the axis parameter:
pandas.DataFrame.mul
other [scalar, sequence, Series, or DataFrame] Any single or multiple element data
structure, or list-like object.
axis [{0 or ‘index’, 1 or ‘columns’}] Whether to compare by the index (0 or ‘index’)
or columns (1 or ‘columns’). For Series input, axis to match Series index on.
level [int or label] Broadcast across a level, matching Index values on the passed Mul-
tiIndex level.
fill_value [float or None, default None] Fill existing missing (NaN) values, and any
new element needed for successful DataFrame alignment, with this value before
computation. If data in both corresponding DataFrame locations is missing the
result will be missing.
Returns
DataFrame Result of the arithmetic operation.
See also:
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.multiply
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.ne
Notes
Mismatched indices will be unioned together. NaN values are considered different (i.e. NaN != NaN).
Examples
>>> df == 100
cost revenue
A False True
B False False
C True False
>>> df.eq(100)
cost revenue
A False True
B False False
C True False
When other is a Series, the columns of a DataFrame are aligned with the index of other and broadcast:
When comparing to an arbitrary sequence, the number of columns must match the number elements in
other:
>>> df.gt(other)
cost revenue
A False False
B False False
C False True
D False False
pandas.DataFrame.nlargest
Notes
This function cannot be used with all column types. For example, when specifying columns with object
or category dtypes, TypeError is raised.
Examples
In the following example, we will use nlargest to select the three rows having the largest values in
column “population”.
To order by the largest values in column “population” and then “GDP”, we can specify multiple columns
like in the next example.
pandas.DataFrame.notna
DataFrame.notna()
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to
True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set
pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN,
get mapped to False values.
Returns
DataFrame Mask of bool values for each element in DataFrame that indicates whether
an element is not an NA value.
See also:
Examples
>>> df.notna()
age born name toy
0 True False True False
1 True True True True
2 False True True True
>>> ser.notna()
0 True
1 True
2 False
dtype: bool
pandas.DataFrame.notnull
DataFrame.notnull()
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to
True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set
pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN,
get mapped to False values.
Returns
DataFrame Mask of bool values for each element in DataFrame that indicates whether
an element is not an NA value.
See also:
Examples
>>> df.notna()
age born name toy
0 True False True False
1 True True True True
2 False True True True
>>> ser.notna()
0 True
(continues on next page)
pandas.DataFrame.nsmallest
Examples
In the following example, we will use nsmallest to select the three rows having the smallest values in
column “population”.
>>> df.nsmallest(3, 'population')
population GDP alpha-2
Tuvalu 11300 38 TV
Anguilla 11300 311 AI
Iceland 337000 17036 IS
To order by the smallest values in column “population” and then “GDP”, we can specify multiple columns
like in the next example.
>>> df.nsmallest(3, ['population', 'GDP'])
population GDP alpha-2
Tuvalu 11300 38 TV
Anguilla 11300 311 AI
Nauru 337000 182 NR
pandas.DataFrame.nunique
DataFrame.nunique(axis=0, dropna=True)
Count distinct observations over requested axis.
Return Series with number of distinct observations. Can ignore NaN values.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] The axis to use. 0 or ‘index’ for row-
wise, 1 or ‘columns’ for column-wise.
dropna [bool, default True] Don’t include NaN in the counts.
Returns
Series
See also:
Examples
>>> df.nunique(axis=1)
0 1
1 2
2 2
dtype: int64
pandas.DataFrame.pad
pandas.DataFrame.pct_change
See also:
Examples
Series
>>> s = pd.Series([90, 91, 85])
>>> s
0 90
1 91
2 85
dtype: int64
>>> s.pct_change()
0 NaN
1 0.011111
2 -0.065934
dtype: float64
>>> s.pct_change(periods=2)
0 NaN
1 NaN
2 -0.055556
dtype: float64
See the percentage change in a Series where filling NAs with last valid observation forward to next valid.
>>> s = pd.Series([90, 91, None, 85])
>>> s
0 90.0
1 91.0
2 NaN
3 85.0
dtype: float64
>>> s.pct_change(fill_method='ffill')
0 NaN
1 0.011111
2 0.000000
3 -0.065934
dtype: float64
DataFrame
Percentage change in French franc, Deutsche Mark, and Italian lira from 1980-01-01 to 1980-03-01.
>>> df = pd.DataFrame({
... 'FR': [4.0405, 4.0963, 4.3149],
... 'GR': [1.7246, 1.7482, 1.8519],
(continues on next page)
>>> df.pct_change()
FR GR IT
1980-01-01 NaN NaN NaN
1980-02-01 0.013810 0.013684 0.006549
1980-03-01 0.053365 0.059318 0.061876
Percentage of change in GOOG and APPL stock volume. Shows computing the percentage change be-
tween columns.
>>> df = pd.DataFrame({
... '2016': [1769950, 30586265],
... '2015': [1500923, 40912316],
... '2014': [1371819, 41403351]},
... index=['GOOG', 'APPL'])
>>> df
2016 2015 2014
GOOG 1769950 1500923 1371819
APPL 30586265 40912316 41403351
>>> df.pct_change(axis='columns')
2016 2015 2014
GOOG NaN -0.151997 -0.086016
APPL NaN 0.337604 0.012002
pandas.DataFrame.pipe
Notes
Use .pipe when chaining together functions that expect Series, DataFrames or GroupBy objects. Instead
of writing
>>> (df.pipe(h)
... .pipe(g, arg1=a)
... .pipe(func, arg2=b, arg3=c)
... )
If you have a function that takes the data as (say) the second argument, pass a tuple indicating which
keyword expects the data. For example, suppose f takes its data as arg2:
>>> (df.pipe(h)
... .pipe(g, arg1=a)
... .pipe((func, 'arg2'), arg1=a, arg3=c)
... )
pandas.DataFrame.pivot
DataFrame.pivot_table Generalization of pivot that can handle duplicate values for one in-
dex/column pair.
DataFrame.unstack Pivot based on the index values instead of a column.
wide_to_long Wide panel to long format. Less flexible but more user-friendly than melt.
Notes
For finer-tuned control, see hierarchical indexing documentation along with the related stack/unstack
methods.
Examples
You could also assign a list of column names or a list of index names.
>>> df = pd.DataFrame({
... "lev1": [1, 1, 1, 2, 2, 2],
... "lev2": [1, 1, 2, 1, 1, 2],
... "lev3": [1, 2, 1, 2, 1, 2],
... "lev4": [1, 2, 3, 4, 5, 6],
... "values": [0, 1, 2, 3, 4, 5]})
(continues on next page)
Notice that the first two rows are the same for our index and columns arguments.
>>> df.pivot(index='foo', columns='bar', values='baz')
Traceback (most recent call last):
...
ValueError: Index contains duplicate entries, cannot reshape
pandas.DataFrame.pivot_table
index [column, Grouper, array, or list of the previous] If an array is passed, it must be
the same length as the data. The list can contain any of the other types (except list).
Keys to group by on the pivot table index. If an array is passed, it is being used as
the same manner as column values.
columns [column, Grouper, array, or list of the previous] If an array is passed, it must
be the same length as the data. The list can contain any of the other types (except
list). Keys to group by on the pivot table column. If an array is passed, it is being
used as the same manner as column values.
aggfunc [function, list of functions, dict, default numpy.mean] If list of functions
passed, the resulting pivot table will have hierarchical columns whose top level
are the function names (inferred from the function objects themselves) If dict is
passed, the key is column to aggregate and value is function or list of functions.
fill_value [scalar, default None] Value to replace missing values with (in the resulting
pivot table, after aggregation).
margins [bool, default False] Add all row / columns (e.g. for subtotal / grand totals).
dropna [bool, default True] Do not include columns whose entries are all NaN.
margins_name [str, default ‘All’] Name of the row / column that will contain the totals
when margins is True.
observed [bool, default False] This only applies if any of the groupers are Categoricals.
If True: only show observed values for categorical groupers. If False: show all
values for categorical groupers.
Changed in version 0.25.0.
Returns
DataFrame An Excel style pivot table.
See also:
Examples
The next example aggregates by taking the mean across multiple columns.
We can also calculate multiple types of aggregations for any given value column.
pandas.DataFrame.plot
DataFrame.plot(*args, **kwargs)
Make plots of Series or DataFrame.
Uses the backend specified by the option plotting.backend. By default, matplotlib is used.
Parameters
data [Series or DataFrame] The object for which the method is called.
x [label or position, default None] Only used if data is a DataFrame.
y [label, position or list of label, positions, default None] Allows plotting of one column
versus another. Only used if data is a DataFrame.
kind [str] The kind of plot to produce:
• ‘line’ : line plot (default)
• ‘bar’ : vertical bar plot
• ‘barh’ : horizontal bar plot
• ‘hist’ : histogram
• ‘box’ : boxplot
• ‘kde’ : Kernel Density Estimation plot
• ‘density’ : same as ‘kde’
• ‘area’ : area plot
• ‘pie’ : pie plot
• ‘scatter’ : scatter plot
• ‘hexbin’ : hexbin plot.
ax [matplotlib axes object, default None] An axes of the current figure.
subplots [bool, default False] Make separate subplots for each column.
sharex [bool, default True if ax is None else False] In case subplots=True, share x
axis and set some x axis labels to invisible; defaults to True if ax is None otherwise
False if an ax is passed in; Be aware, that passing in both an ax and sharex=True
will alter all x axis labels for all axis in a figure.
sharey [bool, default False] In case subplots=True, share y axis and set some y
axis labels to invisible.
layout [tuple, optional] (rows, columns) for the layout of subplots.
figsize [a tuple (width, height) in inches] Size of a figure object.
use_index [bool, default True] Use index as ticks for x axis.
title [str or list] Title to use for the plot. If a string is passed, print the string at the top
of the figure. If a list is passed and subplots is True, print each item in the list above
the corresponding subplot.
grid [bool, default None (matlab style default)] Axis grid lines.
legend [bool or {‘reverse’}] Place legend on axis subplots.
style [list or dict] The matplotlib line style per column.
logx [bool or ‘sym’, default False] Use log scaling or symlog scaling on x axis. ..
versionchanged:: 0.25.0
logy [bool or ‘sym’ default False] Use log scaling or symlog scaling on y axis. ..
versionchanged:: 0.25.0
loglog [bool or ‘sym’, default False] Use log scaling or symlog scaling on both x and
y axes. .. versionchanged:: 0.25.0
xticks [sequence] Values to use for the xticks.
yticks [sequence] Values to use for the yticks.
xlim [2-tuple/list] Set the x limits of the current axes.
ylim [2-tuple/list] Set the y limits of the current axes.
xlabel [label, optional] Name to use for the xlabel on x-axis. Default uses index name
as xlabel, or the x-column name for planar plots.
New in version 1.1.0.
Changed in version 1.2.0: Now applicable to planar plots (scatter, hexbin).
ylabel [label, optional] Name to use for the ylabel on y-axis. Default will show no
ylabel, or the y-column name for planar plots.
New in version 1.1.0.
Changed in version 1.2.0: Now applicable to planar plots (scatter, hexbin).
rot [int, default None] Rotation for ticks (xticks for vertical, yticks for horizontal plots).
fontsize [int, default None] Font size for xticks and yticks.
colormap [str or matplotlib colormap object, default None] Colormap to select colors
from. If string, load colormap with that name from matplotlib.
colorbar [bool, optional] If True, plot colorbar (only relevant for ‘scatter’ and ‘hexbin’
plots).
position [float] Specify relative alignments for bar plot layout. From 0 (left/bottom-
end) to 1 (right/top-end). Default is 0.5 (center).
table [bool, Series or DataFrame, default False] If True, draw a table using the data in
the DataFrame and the data will be transposed to meet matplotlib’s default layout.
If a Series or DataFrame is passed, use passed data to draw a table.
yerr [DataFrame, Series, array-like, dict and str] See Plotting with Error Bars for de-
tail.
xerr [DataFrame, Series, array-like, dict and str] Equivalent to yerr.
stacked [bool, default False in line and bar plots, and True in area plot] If True, create
stacked plot.
sort_columns [bool, default False] Sort column names to determine plot ordering.
secondary_y [bool or sequence, default False] Whether to plot on the secondary y-axis
if a list/tuple, which columns to plot on secondary y-axis.
mark_right [bool, default True] When using a secondary_y axis, automatically mark
the column labels with “(right)” in the legend.
include_bool [bool, default is False] If True, boolean values can be plotted.
backend [str, default None] Backend to use instead of the backend specified in the
option plotting.backend. For instance, ‘matplotlib’. Alternatively, to
specify the plotting.backend for the whole session, set pd.options.
plotting.backend.
New in version 1.0.0.
**kwargs Options to pass to matplotlib plotting method.
Returns
matplotlib.axes.Axes or numpy.ndarray of them If the backend is not the
default matplotlib one, the return value will be the object returned by the back-
end.
Notes
pandas.DataFrame.pop
DataFrame.pop(item)
Return item and drop from frame. Raise KeyError if not found.
Parameters
item [label] Label of column to be popped.
Returns
Series
Examples
>>> df.pop('class')
0 bird
1 bird
2 mammal
3 mammal
Name: class, dtype: object
>>> df
name max_speed
0 falcon 389.0
1 parrot 24.0
2 lion 80.5
3 monkey NaN
pandas.DataFrame.pow
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.prod
Examples
>>> pd.Series([]).prod()
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
pandas.DataFrame.product
Examples
>>> pd.Series([]).prod()
1.0
>>> pd.Series([]).prod(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).prod()
1.0
>>> pd.Series([np.nan]).prod(min_count=1)
nan
pandas.DataFrame.quantile
See also:
Examples
Specifying numeric_only=False will also compute the quantile of datetime and timedelta data.
pandas.DataFrame.query
New in version 1.0.0: Expanding functionality of backtick quoting for more than
only spaces.
inplace [bool] Whether the query should modify the data in place or return a modified
copy.
**kwargs See the documentation for eval() for complete details on the keyword
arguments accepted by DataFrame.query().
Returns
DataFrame or None DataFrame resulting from the provided query expression or None
if inplace=True.
See also:
Notes
The result of the evaluation of this expression is first passed to DataFrame.loc and if that fails be-
cause of a multidimensional key (e.g., a DataFrame) then the result will be passed to DataFrame.
__getitem__().
This method uses the top-level eval() function to evaluate the passed query.
The query() method uses a slightly modified Python syntax by default. For example, the & and |
(bitwise) operators have the precedence of their boolean cousins, and and or. This is syntactically valid
Python, however the semantics are different.
You can change the semantics of the expression by passing the keyword argument parser='python'.
This enforces the same semantics as evaluation in Python space. Likewise, you can pass
engine='python' to evaluate an expression using Python itself as a backend. This is not recom-
mended as it is inefficient compared to using numexpr as the engine.
The DataFrame.index and DataFrame.columns attributes of the DataFrame instance are
placed in the query namespace by default, which allows you to treat both the index and columns of
the frame as a column in the frame. The identifier index is used for the frame index; you can also use
the name of the index to identify it in a query. Please note that Python keywords may not be used as
identifiers.
For further details and examples see the query documentation in indexing.
Backtick quoted variables
Backtick quoted variables are parsed as literal Python code and are converted internally to a Python valid
identifier. This can lead to the following problems.
During parsing a number of disallowed characters inside the backtick quoted string are replaced by strings
that are allowed as a Python identifier. These characters include all operators in Python, the space charac-
ter, the question mark, the exclamation mark, the dollar sign, and the euro sign. For other characters that
fall outside the ASCII range (U+0001..U+007F) and those that are not further specified in PEP 3131, the
query parser will raise an error. This excludes whitespace different than the space character, but also the
hashtag (as it is used for comments) and the backtick itself (backtick can also not be escaped).
In a special case, quotes that make a pair around a backtick can confuse the parser. For example, `it's`
> `that's` will raise an error, as it forms a quoted string ('s > `that') with a backtick inside.
Examples
For columns with spaces in their name, you can use backtick quoting.
pandas.DataFrame.radd
fill_value [float or None, default None] Fill existing missing (NaN) values, and any
new element needed for successful DataFrame alignment, with this value before
computation. If data in both corresponding DataFrame locations is missing the
result will be missing.
Returns
DataFrame Result of the arithmetic operation.
See also:
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rank
pct [bool, default False] Whether or not to display the returned rankings in percentile
form.
Returns
same type as caller Return a Series or DataFrame with data ranks as values.
See also:
Examples
The following example shows how the method behaves with the above parameters:
• default_rank: this is the default behaviour obtained without using any parameter.
• max_rank: setting method = 'max' the records that have the same values are ranked using the
highest rank (e.g.: since ‘cat’ and ‘dog’ are both in the 2nd and 3rd position, rank 3 is assigned.)
• NA_bottom: choosing na_option = 'bottom', if there are records with NaN values they are
placed at the bottom of the ranking.
• pct_rank: when setting pct = True, the ranking is expressed as percentile rank.
pandas.DataFrame.rdiv
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.reindex
Examples
Create a new index and reindex the dataframe. By default values in the new index that do not have
corresponding records in the dataframe are assigned NaN.
We can fill in the missing values by passing a value to the keyword fill_value. Because the index is
not monotonically increasing or decreasing, we cannot use arguments to the keyword method to fill the
NaN values.
To further illustrate the filling functionality in reindex, we will create a dataframe with a monotonically
increasing index (for example, a sequence of dates).
The index entries that did not have a value in the original data frame (for example, ‘2009-12-29’) are by
default filled with NaN. If desired, we can fill in the missing values using one of several options.
For example, to back-propagate the last valid value to fill the NaN values, pass bfill as an argument to
the method keyword.
Please note that the NaN value present in the original dataframe (at index value 2010-01-03) will not be
filled by any of the value propagation schemes. This is because filling while reindexing does not look at
dataframe values, but only compares the original and desired indexes. If you do want to fill in the NaN
values present in the original dataframe, use the fillna() method.
See the user guide for more.
pandas.DataFrame.reindex_like
Series or DataFrame Same type as caller, but with changed indices on each axis.
See also:
Notes
Examples
>>> df1
temp_celsius temp_fahrenheit windspeed
2014-02-12 24.3 75.7 high
2014-02-13 31.0 87.8 high
2014-02-14 22.0 71.6 medium
2014-02-15 35.0 95.0 medium
>>> df2
temp_celsius windspeed
2014-02-12 28.0 low
2014-02-13 30.0 low
2014-02-15 35.1 medium
>>> df2.reindex_like(df1)
temp_celsius temp_fahrenheit windspeed
2014-02-12 28.0 NaN low
2014-02-13 30.0 NaN low
2014-02-14 NaN NaN NaN
2014-02-15 35.1 NaN medium
pandas.DataFrame.rename
Examples
>>> df.index
RangeIndex(start=0, stop=3, step=1)
>>> df.rename(index=str).index
Index(['0', '1', '2'], dtype='object')
pandas.DataFrame.rename_axis
Notes
Examples
Series
DataFrame
MultiIndex
>>> df.rename_axis(columns=str.upper)
LIMBS num_legs num_arms
type name
mammal dog 4 0
cat 4 0
monkey 2 2
pandas.DataFrame.reorder_levels
DataFrame.reorder_levels(order, axis=0)
Rearrange index levels using input order. May not drop or duplicate levels.
Parameters
order [list of int or list of str] List representing new level order. Reference level by
number (position) or by key (label).
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Where to reorder levels.
Returns
DataFrame
pandas.DataFrame.replace
• dict:
– Dicts can be used to specify different replacement values for different exist-
ing values. For example, {'a': 'b', 'y': 'z'} replaces the value
‘a’ with ‘b’ and ‘y’ with ‘z’. To use a dict in this way the value parameter
should be None.
– For a DataFrame a dict can specify that different values should be replaced
in different columns. For example, {'a': 1, 'b': 'z'} looks for the
value 1 in column ‘a’ and the value ‘z’ in column ‘b’ and replaces these
values with whatever is specified in value. The value parameter should not
be None in this case. You can treat this as a special case of passing two lists
except that you are specifying the column to search in.
– For a DataFrame nested dictionaries, e.g., {'a': {'b': np.nan}},
are read as follows: look in column ‘a’ for the value ‘b’ and replace it with
NaN. The value parameter should be None to use a nested dict in this way.
You can nest regular expressions as well. Note that column names (the top-
level dictionary keys in a nested dictionary) cannot be regular expressions.
• None:
– This means that the regex argument must be a string, compiled regular ex-
pression, or list, dict, ndarray or Series of such elements. If value is also
None then this must be a nested dictionary or Series.
See the examples section for examples of each of these.
value [scalar, dict, list, str, regex, default None] Value to replace any values matching
to_replace with. For a DataFrame a dict of values can be used to specify which
value to use for each column (columns not in the dict will not be filled). Regular
expressions, strings and lists or dicts of such objects are also allowed.
inplace [bool, default False] If True, in place. Note: this will modify any other views
on this object (e.g. a column from a DataFrame). Returns the caller if this is True.
limit [int or None, default None] Maximum size gap to forward or backward fill.
regex [bool or same types as to_replace, default False] Whether to interpret to_replace
and/or value as regular expressions. If this is True then to_replace must be a
string. Alternatively, this could be a regular expression or a list, dict, or array of
regular expressions in which case to_replace must be None.
method [{‘pad’, ‘ffill’, ‘bfill’, None}] The method to use when for replacement, when
to_replace is a scalar, list or tuple and value is None.
Returns
DataFrame or None Object after replacement or None if inplace=True.
Raises
AssertionError
• If regex is not a bool and to_replace is not None.
TypeError
• If to_replace is not a scalar, array-like, dict, or None
• If to_replace is a dict and value is not a list, dict, ndarray, or
Series
Notes
• Regex substitution is performed under the hood with re.sub. The rules for substitution for re.
sub are the same.
• Regular expressions will only substitute on strings, meaning you cannot provide, for example, a
regular expression matching floating point numbers and expect the columns in your frame that have
a numeric dtype to be matched. However, if those floating point numbers are strings, then you can
do this.
• This method has a lot of options. You are encouraged to experiment and play with this method to
gain intuition about how it works.
• When dict is used as the to_replace value, it is like key(s) in the dict are the to_replace part and
value(s) in the dict are the value parameter.
Examples
List-like `to_replace`
dict-like `to_replace`
When one uses a dict as the to_replace value, it is like the value(s) in the dict are equal to the value param-
eter. s.replace({'a': None}) is equivalent to s.replace(to_replace={'a': None},
value=None, method=None):
When value=None and to_replace is a scalar, list or tuple, replace uses the method parameter (default
‘pad’) to do the replacement. So this is why the ‘a’ values are being replaced by 10 in rows 1 and 2
and ‘b’ in row 4 in this case. The command s.replace('a', None) is actually equivalent to s.
replace(to_replace='a', value=None, method='pad'):
pandas.DataFrame.resample
origin [{‘epoch’, ‘start’, ‘start_day’}, Timestamp or str, default ‘start_day’] The times-
tamp on which to adjust the grouping. The timezone of origin must match the
timezone of the index. If a timestamp is not used, these values are also supported:
• ‘epoch’: origin is 1970-01-01
• ‘start’: origin is the first value of the timeseries
• ‘start_day’: origin is the first day at midnight of the timeseries
New in version 1.1.0.
offset [Timedelta or str, default is None] An offset timedelta added to the origin.
New in version 1.1.0.
Returns
Resampler object
See also:
Notes
Examples
Downsample the series into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> series.resample('3T').sum()
2000-01-01 00:00:00 3
2000-01-01 00:03:00 12
2000-01-01 00:06:00 21
Freq: 3T, dtype: int64
Downsample the series into 3 minute bins as above, but label each bin using the right edge instead of the
left. Please note that the value in the bucket used as the label is not included in the bucket, which it labels.
For example, in the original series the bucket 2000-01-01 00:03:00 contains the value 3, but the
summed value in the resampled bucket with the label 2000-01-01 00:03:00 does not include 3 (if
it did, the summed value would be 6, not 3). To include this value close the right side of the bin interval
as illustrated in the example below this one.
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
Upsample the series into 30 second bins and fill the NaN values using the pad method.
>>> series.resample('30S').pad()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 0
2000-01-01 00:01:00 1
2000-01-01 00:01:30 1
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
Upsample the series into 30 second bins and fill the NaN values using the bfill method.
>>> series.resample('30S').bfill()[0:5]
2000-01-01 00:00:00 0
2000-01-01 00:00:30 1
2000-01-01 00:01:00 1
2000-01-01 00:01:30 2
2000-01-01 00:02:00 2
Freq: 30S, dtype: int64
For a Series with a PeriodIndex, the keyword convention can be used to control whether to use the start or
end of rule.
Resample a year by quarter using ‘start’ convention. Values are assigned to the first quarter of the period.
Resample quarters by month using ‘end’ convention. Values are assigned to the last month of the period.
For DataFrame objects, the keyword on can be used to specify the column instead of the index for resam-
pling.
For a DataFrame with MultiIndex, the keyword level can be used to specify on which level the resampling
needs to take place.
If you want to adjust the start of the bins based on a fixed timestamp:
>>> ts.resample('17min').sum()
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, dtype: int64
If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:
>>> ts.resample('17min', origin='start').sum()
2000-10-01 23:30:00 9
2000-10-01 23:47:00 21
2000-10-02 00:04:00 54
2000-10-02 00:21:00 24
Freq: 17T, dtype: int64
To replace the use of the deprecated base argument, you can now use offset, in this example it is equivalent
to have base=2:
>>> ts.resample('17min', offset='2min').sum()
2000-10-01 23:16:00 0
2000-10-01 23:33:00 9
(continues on next page)
pandas.DataFrame.reset_index
Examples
When we reset the index, the old index is added as a column, and a new sequential index is used:
>>> df.reset_index()
index class max_speed
0 falcon bird 389.0
1 parrot bird 24.0
2 lion mammal 80.5
3 monkey mammal NaN
We can use the drop parameter to avoid the old index being added as a column:
>>> df.reset_index(drop=True)
class max_speed
0 bird 389.0
1 bird 24.0
2 mammal 80.5
3 mammal NaN
>>> df.reset_index(level='class')
class speed species
max type
name
falcon bird 389.0 fly
parrot bird 24.0 fly
lion mammal 80.5 run
monkey mammal NaN jump
If we are not dropping the index, by default, it is placed in the top level. We can place it in another level:
When the index is inserted under another level, we can specify under which one with the parameter
col_fill:
pandas.DataFrame.rfloordiv
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df - [1, 2]
angles degrees
circle -1 358
triangle 2 178
rectangle 3 358
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rmod
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.div(10)
angles degrees
circle 0.0 36.0
triangle 0.3 18.0
rectangle 0.4 36.0
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rmul
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rolling
Notes
By default, the result is set to the right edge of the window. This can be changed to the center of the
window by setting center=True.
To learn more about the offsets & frequency strings, please see this link.
If win_type=None, all points are evenly weighted; otherwise, win_type can accept a string of any
scipy.signal window function.
Certain Scipy window types require additional parameters to be passed in the aggregation function. The
additional parameters must match the keywords specified in the Scipy window type method signature.
Please see the third example below on how to add the additional parameters.
Examples
Rolling sum with a window length of 2, using the ‘triang’ window type.
Rolling sum with a window length of 2, using the ‘gaussian’ window type (note how we need to specify
std).
Rolling sum with a window length of 2, min_periods defaults to the window length.
>>> df.rolling(2).sum()
B
0 NaN
1 1.0
2 3.0
3 NaN
4 NaN
>>> df
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 2.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
Contrasting to an integer rolling window, this will roll a variable length window corresponding to the time
period. The default for min_periods is 1.
>>> df.rolling('2s').sum()
B
2013-01-01 09:00:00 0.0
2013-01-01 09:00:02 1.0
2013-01-01 09:00:03 3.0
2013-01-01 09:00:05 NaN
2013-01-01 09:00:06 4.0
pandas.DataFrame.round
Examples
By providing an integer each column is rounded to the same number of decimal places
>>> df.round(1)
dogs cats
0 0.2 0.3
1 0.0 0.7
2 0.7 0.0
3 0.2 0.2
With a dict, the number of places for specific columns can be specified with the column names as key and
the number of decimal places as value
>>> df.round({'dogs': 1, 'cats': 0})
dogs cats
0 0.2 0.0
1 0.0 1.0
(continues on next page)
Using a Series, the number of places for specific columns can be specified with the column names as
index and the number of decimal places as value
pandas.DataFrame.rpow
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rsub
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.rtruediv
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.sample
Notes
Examples
Extract 3 random elements from the Series df['num_legs']: Note that we use random_state to
ensure the reproducibility of the examples.
An upsample sample of the DataFrame with replacement: Note that replace parameter has to be True
for frac parameter > 1.
Using a DataFrame column as weights. Rows with larger value in the num_specimen_seen column are
more likely to be sampled.
pandas.DataFrame.select_dtypes
DataFrame.select_dtypes(include=None, exclude=None)
Return a subset of the DataFrame’s columns based on the column dtypes.
Parameters
include, exclude [scalar or list-like] A selection of dtypes or strings to be in-
cluded/excluded. At least one of these parameters must be supplied.
Returns
DataFrame The subset of the frame including the dtypes in include and excluding
the dtypes in exclude.
Raises
ValueError
• If both of include and exclude are empty
• If include and exclude have overlapping elements
• If any kind of string dtype is passed in.
See also:
Notes
Examples
>>> df.select_dtypes(include='bool')
b
0 True
1 False
2 True
3 False
4 True
5 False
>>> df.select_dtypes(include=['float64'])
c
0 1.0
1 2.0
2 1.0
3 2.0
4 1.0
5 2.0
>>> df.select_dtypes(exclude=['int64'])
b c
0 True 1.0
1 False 2.0
2 True 1.0
3 False 2.0
4 True 1.0
5 False 2.0
pandas.DataFrame.sem
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
pandas.DataFrame.set_axis
Examples
pandas.DataFrame.set_flags
Notes
This method returns a new object that’s a view on the same data as the input. Mutating the input or the
output values will be reflected in the other.
This method is intended to be used in method chains.
“Flags” differ from “metadata”. Flags reflect properties of the pandas object (the Series or DataFrame).
Metadata refer to properties of the dataset, and should be stored in DataFrame.attrs.
Examples
pandas.DataFrame.set_index
inplace [bool, default False] If True, modifies the DataFrame in place (do not create a
new object).
verify_integrity [bool, default False] Check the new index for duplicates. Otherwise
defer the check until necessary. Setting to False will improve the performance of
this method.
Returns
DataFrame or None Changed row labels or None if inplace=True.
See also:
Examples
>>> df.set_index('month')
year sale
month
1 2012 55
4 2014 40
7 2013 84
10 2014 31
pandas.DataFrame.shift
Examples
>>> df.shift(periods=3)
Col1 Col2 Col3
2020-01-01 NaN NaN NaN
2020-01-02 NaN NaN NaN
2020-01-03 NaN NaN NaN
2020-01-04 10.0 13.0 17.0
2020-01-05 20.0 23.0 27.0
pandas.DataFrame.skew
pandas.DataFrame.slice_shift
DataFrame.slice_shift(periods=1, axis=0)
Equivalent to shift without copying data. The shifted data will not include the dropped periods and the
shifted axis will be smaller than the original.
Deprecated since version 1.2.0: slice_shift is deprecated, use DataFrame/Series.shift instead.
Parameters
periods [int] Number of periods to move, can be positive or negative.
Returns
shifted [same type as caller]
Notes
While the slice_shift is faster than shift, you may pay for it later during alignment.
pandas.DataFrame.sort_index
level [int or level name or list of ints or list of level names] If not None, sort on values
in specified index level(s).
ascending [bool or list of bools, default True] Sort ascending vs. descending. When the
index is a MultiIndex the sort direction can be controlled for each level individually.
inplace [bool, default False] If True, perform operation in-place.
kind [{‘quicksort’, ‘mergesort’, ‘heapsort’}, default ‘quicksort’] Choice of sorting al-
gorithm. See also ndarray.np.sort for more information. mergesort is the only
stable algorithm. For DataFrames, this option is only applied when sorting on a
single column or label.
na_position [{‘first’, ‘last’}, default ‘last’] Puts NaNs at the beginning if first; last puts
NaNs at the end. Not implemented for MultiIndex.
sort_remaining [bool, default True] If True and sorting by level and index is multi-
level, sort by other levels too (in order) after sorting by specified level.
ignore_index [bool, default False] If True, the resulting axis will be labeled 0, 1, . . . ,
n - 1.
New in version 1.0.0.
key [callable, optional] If not None, apply the key function to the index values before
sorting. This is similar to the key argument in the builtin sorted() function, with
the notable difference that this key function should be vectorized. It should expect
an Index and return an Index of the same shape. For MultiIndex inputs, the key
is applied per level.
New in version 1.1.0.
Returns
DataFrame or None The original DataFrame sorted by the labels or None if
inplace=True.
See also:
Examples
>>> df.sort_index(ascending=False)
A
234 3
150 5
100 1
29 2
1 4
A key function can be specified which is applied to the index before sorting. For a MultiIndex this is
applied to each level separately.
pandas.DataFrame.sort_values
Examples
>>> df = pd.DataFrame({
... 'col1': ['A', 'A', 'B', np.nan, 'D', 'C'],
... 'col2': [2, 1, 9, 8, 7, 4],
... 'col3': [0, 1, 9, 4, 2, 3],
... 'col4': ['a', 'B', 'c', 'D', 'e', 'F']
... })
>>> df
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
3 NaN 8 4 D
4 D 7 2 e
5 C 4 3 F
Sort by col1
>>> df.sort_values(by=['col1'])
col1 col2 col3 col4
0 A 2 0 a
1 A 1 1 B
2 B 9 9 c
5 C 4 3 F
4 D 7 2 e
3 NaN 8 4 D
Sort Descending
>>> df.sort_values(by='col1', ascending=False)
col1 col2 col3 col4
4 D 7 2 e
5 C 4 3 F
2 B 9 9 c
(continues on next page)
Natural sort with the key argument, using the natsort <https://github.com/SethMMorton/natsort> pack-
age.
>>> df = pd.DataFrame({
... "time": ['0hr', '128hr', '72hr', '48hr', '96hr'],
... "value": [10, 20, 30, 40, 50]
... })
>>> df
time value
0 0hr 10
1 128hr 20
2 72hr 30
3 48hr 40
4 96hr 50
>>> from natsort import index_natsorted
>>> df.sort_values(
... by="time",
... key=lambda x: np.argsort(index_natsorted(df["time"]))
... )
time value
0 0hr 10
3 48hr 40
2 72hr 30
4 96hr 50
1 128hr 20
pandas.DataFrame.sparse
DataFrame.sparse()
DataFrame accessor for sparse data.
New in version 0.25.0.
pandas.DataFrame.squeeze
DataFrame.squeeze(axis=None)
Squeeze 1 dimensional axis objects into scalars.
Series or DataFrames with a single element are squeezed to a scalar. DataFrames with a single column or
a single row are squeezed to a Series. Otherwise the object is unchanged.
This method is most useful when you don’t know if your object is a Series or DataFrame, but you do
know it has just a single column. In that case you can safely call squeeze to ensure you have a Series.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’, None}, default None] A specific axis to squeeze.
By default, all length-1 axes are squeezed.
Returns
DataFrame, Series, or scalar The projection after squeezing axis or all the axes.
See also:
Examples
>>> even_primes.squeeze()
2
Squeezing objects with more than one value in every axis does nothing:
>>> odd_primes.squeeze()
1 3
2 5
3 7
dtype: int64
Slicing a single column will produce a DataFrame with the columns having only one value:
>>> df_a.squeeze('columns')
0 1
1 3
Name: a, dtype: int64
Slicing a single row from a single column will produce a single scalar DataFrame:
>>> df_0a.squeeze('rows')
a 1
Name: 0, dtype: int64
>>> df_0a.squeeze()
1
pandas.DataFrame.stack
DataFrame.stack(level=- 1, dropna=True)
Stack the prescribed level(s) from columns to index.
Return a reshaped DataFrame or Series having a multi-level index with one or more new inner-most levels
compared to the current DataFrame. The new inner-most levels are created by pivoting the columns of
the current dataframe:
• if the columns have a single level, the output is a Series;
• if the columns have multiple levels, the new index level(s) is (are) taken from the prescribed level(s)
and the output is a DataFrame.
Parameters
level [int, str, list, default -1] Level(s) to stack from the column axis onto the index
axis, defined as one index or label, or a list of indices or labels.
dropna [bool, default True] Whether to drop rows in the resulting Frame/Series with
missing values. Stacking a column level onto the index axis can create combina-
tions of index and column values that are missing from the original dataframe. See
Examples section.
Returns
DataFrame or Series Stacked dataframe or series.
See also:
DataFrame.unstack Unstack prescribed level(s) from index axis onto column axis.
DataFrame.pivot Reshape dataframe from long format to wide format.
DataFrame.pivot_table Create a spreadsheet-style pivot table as a DataFrame.
Notes
The function is named by analogy with a collection of books being reorganized from being side by side
on a horizontal position (the columns of the dataframe) to being stacked vertically on top of each other
(in the index of the dataframe).
Examples
>>> df_single_level_cols
weight height
cat 0 1
dog 2 3
>>> df_single_level_cols.stack()
(continues on next page)
Missing values
>>> multicol2 = pd.MultiIndex.from_tuples([('weight', 'kg'),
... ('height', 'm')])
>>> df_multi_level_cols2 = pd.DataFrame([[1.0, 2.0], [3.0, 4.0]],
... index=['cat', 'dog'],
... columns=multicol2)
It is common to have missing values when stacking a dataframe with multi-level columns, as the stacked
dataframe typically has more values than the original dataframe. Missing values are filled with NaNs:
>>> df_multi_level_cols2
weight height
kg m
cat 1.0 2.0
dog 3.0 4.0
>>> df_multi_level_cols2.stack()
height weight
cat kg NaN 1.0
m 2.0 NaN
dog kg NaN 3.0
m 4.0 NaN
Note that rows where all values are missing are dropped by default but this behaviour can be controlled
via the dropna keyword parameter:
>>> df_multi_level_cols3
weight height
kg m
cat NaN 1.0
dog 2.0 3.0
>>> df_multi_level_cols3.stack(dropna=False)
height weight
cat kg NaN NaN
m 1.0 NaN
dog kg NaN 2.0
m 3.0 NaN
>>> df_multi_level_cols3.stack(dropna=True)
height weight
cat m 1.0 NaN
dog kg NaN 2.0
m 3.0 NaN
pandas.DataFrame.std
numeric_only [bool, default None] Include only float, int, boolean columns. If None,
will attempt to use everything, then use only numeric data. Not implemented for
Series.
Returns
Series or DataFrame (if level specified)
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
pandas.DataFrame.sub
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.subtract
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.sum
Examples
>>> s.sum()
14
>>> s.sum(level='blooded')
blooded
warm 6
cold 8
Name: legs, dtype: int64
>>> s.sum(level=0)
blooded
warm 6
cold 8
Name: legs, dtype: int64
This can be controlled with the min_count parameter. For example, if you’d like the sum of an empty
series to be NaN, pass min_count=1.
>>> pd.Series([]).sum(min_count=1)
nan
Thanks to the skipna parameter, min_count handles all-NA and empty series identically.
>>> pd.Series([np.nan]).sum()
0.0
>>> pd.Series([np.nan]).sum(min_count=1)
nan
pandas.DataFrame.swapaxes
pandas.DataFrame.swaplevel
pandas.DataFrame.tail
DataFrame.tail(n=5)
Return the last n rows.
This function returns last n rows from the object based on position. It is useful for quickly verifying data,
for example, after sorting or appending rows.
For negative values of n, this function returns all rows except the first n rows, equivalent to df[n:].
Parameters
n [int, default 5] Number of rows to select.
Returns
type of caller The last n rows of the caller object.
See also:
Examples
>>> df.tail()
animal
4 monkey
5 parrot
6 shark
7 whale
8 zebra
>>> df.tail(3)
animal
6 shark
7 whale
8 zebra
>>> df.tail(-3)
animal
3 lion
4 monkey
5 parrot
6 shark
7 whale
8 zebra
pandas.DataFrame.take
Examples
We may take elements using negative integers for positive indices, starting from the end of the object, just
like with Python lists.
pandas.DataFrame.to_clipboard
Notes
Examples
>>> df.to_clipboard(sep=',')
... # Wrote the following to the system clipboard:
... # ,A,B,C
... # 0,1,2,3
... # 1,4,5,6
We can omit the index by passing the keyword index and setting it to false.
pandas.DataFrame.to_csv
path_or_buf [str or file handle, default None] File path or object, if None is provided
the result is returned as a string. If a non-binary file object is passed, it should
be opened with newline=”, disabling universal newlines. If a binary file object is
passed, mode might need to contain a ‘b’.
Changed in version 0.24.0: Was previously named “path” for Series.
Changed in version 1.2.0: Support for binary file objects was introduced.
sep [str, default ‘,’] String of length 1. Field delimiter for the output file.
na_rep [str, default ‘’] Missing data representation.
float_format [str, default None] Format string for floating point numbers.
columns [sequence, optional] Columns to write.
header [bool or list of str, default True] Write out the column names. If a list of strings
is given it is assumed to be aliases for the column names.
Changed in version 0.24.0: Previously defaulted to False for Series.
index [bool, default True] Write row names (index).
index_label [str or sequence, or False, default None] Column label for index column(s)
if desired. If None is given, and header and index are True, then the index names
are used. A sequence should be given if the object uses MultiIndex. If False do not
print fields for index names. Use index_label=False for easier importing in R.
mode [str] Python write mode, default ‘w’.
encoding [str, optional] A string representing the encoding to use in the output file,
defaults to ‘utf-8’. encoding is not supported if path_or_buf is a non-binary file
object.
compression [str or dict, default ‘infer’] If str, represents compression mode. If dict,
value at ‘method’ is the compression mode. Compression mode may be any of the
following possible values: {‘infer’, ‘gzip’, ‘bz2’, ‘zip’, ‘xz’, None}. If compres-
sion mode is ‘infer’ and path_or_buf is path-like, then detect compression mode
from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’ or ‘.xz’. (otherwise no compres-
sion). If dict given and mode is one of {‘zip’, ‘gzip’, ‘bz2’}, or inferred as one of
the above, other entries passed as additional compression options.
Changed in version 1.0.0: May now be a dict with key ‘method’ as compression
mode and other entries as additional compression options if compression mode is
‘zip’.
Changed in version 1.1.0: Passing compression options as keys in dict is supported
for compression modes ‘gzip’ and ‘bz2’ as well as ‘zip’.
Changed in version 1.2.0: Compression is supported for binary file objects.
Changed in version 1.2.0: Previous versions forwarded dict entries for ‘gzip’ to
gzip.open instead of gzip.GzipFile which prevented setting mtime.
quoting [optional constant from csv module] Defaults to csv.QUOTE_MINIMAL.
If you have set a float_format then floats are converted to strings and thus
csv.QUOTE_NONNUMERIC will treat them as non-numeric.
quotechar [str, default ‘"’] String of length 1. Character used to quote fields.
line_terminator [str, optional] The newline character or character sequence to use in
the output file. Defaults to os.linesep, which depends on the OS in which this
method is called (‘n’ for linux, ‘rn’ for Windows, i.e.).
Examples
pandas.DataFrame.to_dict
Examples
>>> df.to_dict('split')
{'index': ['row1', 'row2'], 'columns': ['col1', 'col2'],
'data': [[1, 0.5], [2, 0.75]]}
>>> df.to_dict('records')
[{'col1': 1, 'col2': 0.5}, {'col1': 2, 'col2': 0.75}]
>>> df.to_dict('index')
{'row1': {'col1': 1, 'col2': 0.5}, 'row2': {'col1': 2, 'col2': 0.75}}
>>> dd = defaultdict(list)
>>> df.to_dict('records', into=dd)
[defaultdict(<class 'list'>, {'col1': 1, 'col2': 0.5}),
defaultdict(<class 'list'>, {'col1': 2, 'col2': 0.75})]
pandas.DataFrame.to_excel
Notes
For compatibility with to_csv(), to_excel serializes lists and dicts to strings before writing.
Once a workbook has been saved it is not possible write further data without rewriting the whole work-
book.
Examples
>>> df1.to_excel("output.xlsx",
... sheet_name='Sheet_name_1')
If you wish to write to more than one sheet in the workbook, it is necessary to specify an ExcelWriter
object:
To set the library that is used to write the Excel file, you can pass the engine keyword (the default engine
is automatically chosen depending on the file extension):
pandas.DataFrame.to_feather
DataFrame.to_feather(path, **kwargs)
Write a DataFrame to the binary Feather format.
Parameters
path [str or file-like object] If a string, it will be used as Root Directory path.
**kwargs Additional keywords passed to pyarrow.feather.
write_feather(). Starting with pyarrow 0.17, this includes the compression,
compression_level, chunksize and version keywords.
New in version 1.1.0.
pandas.DataFrame.to_gbq
pandas.DataFrame.to_hdf
errors [str, default ‘strict’] Specifies how encoding and decoding errors are to be han-
dled. See the errors argument for open() for a full list of options.
encoding [str, default “UTF-8”]
min_itemsize [dict or int, optional] Map column names to minimum string sizes for
columns.
nan_rep [Any, optional] How to represent null values as str. Not allowed with ap-
pend=True.
data_columns [list of columns or True, optional] List of columns to create as indexed
data columns for on-disk queries, or True to use all columns. By default only the
axes of the object are indexed. See Query via data columns. Applicable only to
format=’table’.
See also:
Examples
>>> import os
>>> os.remove('data.h5')
pandas.DataFrame.to_html
• match-parent
• initial
• unset.
max_rows [int, optional] Maximum number of rows to display in the console.
min_rows [int, optional] The number of rows to display in the console in a truncated
repr (when number of rows is above max_rows).
max_cols [int, optional] Maximum number of columns to display in the console.
show_dimensions [bool, default False] Display DataFrame dimensions (number of
rows by number of columns).
decimal [str, default ‘.’] Character recognized as decimal separator, e.g. ‘,’ in Europe.
bold_rows [bool, default True] Make the row labels bold in the output.
classes [str or list or tuple, default None] CSS class(es) to apply to the resulting html
table.
escape [bool, default True] Convert the characters <, >, and & to HTML-safe se-
quences.
notebook [{True, False}, default False] Whether the generated HTML is for IPython
Notebook.
border [int] A border=border attribute is included in the opening <table> tag.
Default pd.options.display.html.border.
encoding [str, default “utf-8”] Set character encoding.
New in version 1.0.
table_id [str, optional] A css id is included in the opening <table> tag if specified.
render_links [bool, default False] Convert URLs to HTML links.
New in version 0.24.0.
Returns
str or None If buf is None, returns the result as a string. Otherwise returns None.
See also:
pandas.DataFrame.to_json
storage_options [dict, optional] Extra options that make sense for a particular storage
connection, e.g. host, port, username, password, etc., if using a URL that will be
parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if pro-
viding this argument with a non-fsspec URL. See the fsspec and backend storage
implementation docs for the set of allowed keys and values.
New in version 1.2.0.
Returns
None or str If path_or_buf is None, returns the resulting json format as a string. Oth-
erwise returns None.
See also:
Notes
The behavior of indent=0 varies from the stdlib, which does not indent the output but does insert
newlines. Currently, indent=0 and the default indent=None are equivalent in pandas, though this
may change in a future release.
orient='table' contains a ‘pandas_version’ field under ‘schema’. This stores the version of pandas
used in the latest revision of the schema.
Examples
Encoding/decoding a Dataframe using 'records' formatted JSON. Note that index labels are not pre-
served with this encoding.
pandas.DataFrame.to_latex
encoding [str, optional] A string representing the encoding to use in the output file,
defaults to ‘utf-8’.
decimal [str, default ‘.’] Character recognized as decimal separator, e.g. ‘,’ in Europe.
multicolumn [bool, default True] Use multicolumn to enhance MultiIndex columns.
The default will be read from the config module.
multicolumn_format [str, default ‘l’] The alignment for multicolumns, similar to col-
umn_format The default will be read from the config module.
multirow [bool, default False] Use multirow to enhance MultiIndex rows. Requires
adding a usepackage{multirow} to your LaTeX preamble. Will print centered
labels (instead of top-aligned) across the contained rows, separating groups via
clines. The default will be read from the pandas config module.
caption [str or tuple, optional] Tuple (full_caption, short_caption), which results in \
caption[short_caption]{full_caption}; if a single string is passed,
no short caption will be set.
New in version 1.0.0.
Changed in version 1.2.0: Optionally allow caption to be a tuple
(full_caption, short_caption).
label [str, optional] The LaTeX label to be placed inside \label{} in the output.
This is used with \ref{} in the main .tex file.
New in version 1.0.0.
position [str, optional] The LaTeX positional argument for tables, to be placed after
\begin{} in the output.
New in version 1.2.0.
Returns
str or None If buf is None, returns the result as a string. Otherwise returns None.
See also:
Examples
pandas.DataFrame.to_markdown
Notes
Examples
>>> print(s.to_markdown(tablefmt="grid"))
+----+----------+
| | animal |
+====+==========+
| 0 | elk |
+----+----------+
| 1 | pig |
+----+----------+
| 2 | dog |
(continues on next page)
pandas.DataFrame.to_numpy
Examples
With heterogeneous data, the lowest common type will have to be used.
For a mix of numeric and non-numeric types, the output array will have object dtype.
pandas.DataFrame.to_parquet
Notes
Examples
If you want to get a buffer to the parquet content you can use a io.BytesIO object, as long as you don’t
use partition_cols, which creates multiple files.
>>> import io
>>> f = io.BytesIO()
>>> df.to_parquet(f)
>>> f.seek(0)
0
>>> content = f.read()
pandas.DataFrame.to_period
pandas.DataFrame.to_pickle
read_pickle Load pickled pandas object (or any object) from file.
DataFrame.to_hdf Write DataFrame to an HDF5 file.
DataFrame.to_sql Write DataFrame to a SQL database.
DataFrame.to_parquet Write a DataFrame to the binary parquet format.
Examples
>>> import os
>>> os.remove("./dummy.pkl")
pandas.DataFrame.to_records
Examples
If the DataFrame index has no label then the recarray field name is set to ‘index’. If the index has a label
then this is used as the field name:
>>> df.to_records(index=False)
rec.array([(1, 0.5 ), (2, 0.75)],
dtype=[('A', '<i8'), ('B', '<f8')])
>>> df.to_records(index_dtypes="<S2")
rec.array([(b'a', 1, 0.5 ), (b'b', 2, 0.75)],
dtype=[('I', 'S2'), ('A', '<i8'), ('B', '<f8')])
pandas.DataFrame.to_sql
index [bool, default True] Write DataFrame index as a column. Uses index_label as
the column name in the table.
index_label [str or sequence, default None] Column label for index column(s). If None
is given (default) and index is True, then the index names are used. A sequence
should be given if the DataFrame uses MultiIndex.
chunksize [int, optional] Specify the number of rows in each batch to be written at a
time. By default, all rows will be written at once.
dtype [dict or scalar, optional] Specifying the datatype for columns. If a dictio-
nary is used, the keys should be the column names and the values should be the
SQLAlchemy types or strings for the sqlite3 legacy mode. If a scalar is provided,
it will be applied to all columns.
method [{None, ‘multi’, callable}, optional] Controls the SQL insertion clause used:
• None : Uses standard SQL INSERT clause (one per row).
• ‘multi’: Pass multiple values in a single INSERT clause.
• callable with signature (pd_table, conn, keys, data_iter).
Details and a sample callable implementation can be found in the section insert
method.
New in version 0.24.0.
Raises
ValueError When the table already exists and if_exists is ‘fail’ (the default).
See also:
Notes
Timezone aware datetime columns will be written as Timestamp with timezone type with
SQLAlchemy if supported by the database. Otherwise, the datetimes will be stored as timezone unaware
timestamps local to the original timezone.
New in version 0.24.0.
References
[1], [2]
Examples
This is allowed to support operations that require that the same DBAPI connection is used for the entire
operation.
Specify the dtype (especially useful for integers with missing values). Notice that while pandas is forced
to store the data as floating point, the database supports nullable integers. When fetching the data with
Python, we get back integer scalars.
pandas.DataFrame.to_stata
‘xz’, None}. If compression mode is ‘infer’ and fname is path-like, then detect
compression from the following extensions: ‘.gz’, ‘.bz2’, ‘.zip’, or ‘.xz’ (otherwise
no compression). If dict and compression mode is one of {‘zip’, ‘gzip’, ‘bz2’},
or inferred as one of the above, other entries passed as additional compression op-
tions.
New in version 1.1.0.
storage_options [dict, optional] Extra options that make sense for a particular storage
connection, e.g. host, port, username, password, etc., if using a URL that will be
parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if pro-
viding this argument with a non-fsspec URL. See the fsspec and backend storage
implementation docs for the set of allowed keys and values.
New in version 1.2.0.
Raises
NotImplementedError
• If datetimes contain timezone information
• Column dtype is not representable in Stata
ValueError
• Columns listed in convert_dates are neither datetime64[ns] or date-
time.datetime
• Column listed in convert_dates is not in DataFrame
• Categorical label contains more than 32,000 characters
See also:
Examples
pandas.DataFrame.to_string
buf [str, Path or StringIO-like, optional, default None] Buffer to write to. If None, the
output is returned as a string.
columns [sequence, optional, default None] The subset of columns to write. Writes all
columns by default.
col_space [int, list or dict of int, optional] The minimum width of each column.
header [bool or sequence, optional] Write out the column names. If a list of strings is
given, it is assumed to be aliases for the column names.
index [bool, optional, default True] Whether to print index (row) labels.
na_rep [str, optional, default ‘NaN’] String representation of NaN to use.
formatters [list, tuple or dict of one-param. functions, optional] Formatter functions to
apply to columns’ elements by position or name. The result of each function must
be a unicode string. List/tuple must be of length equal to the number of columns.
float_format [one-parameter function, optional, default None] Formatter function to
apply to columns’ elements if they are floats. This function must return a unicode
string and will be applied only to the non-NaN elements, with NaN being handled
by na_rep.
Changed in version 1.2.0.
sparsify [bool, optional, default True] Set to False for a DataFrame with a hierarchical
index to print every multiindex key at each row.
index_names [bool, optional, default True] Prints the names of the indexes.
justify [str, default None] How to justify the column labels. If None uses the option
from the print configuration (controlled by set_option), ‘right’ out of the box. Valid
values are
• left
• right
• center
• justify
• justify-all
• start
• end
• inherit
• match-parent
• initial
• unset.
max_rows [int, optional] Maximum number of rows to display in the console.
min_rows [int, optional] The number of rows to display in the console in a truncated
repr (when number of rows is above max_rows).
max_cols [int, optional] Maximum number of columns to display in the console.
show_dimensions [bool, default False] Display DataFrame dimensions (number of
rows by number of columns).
decimal [str, default ‘.’] Character recognized as decimal separator, e.g. ‘,’ in Europe.
line_width [int, optional] Width to wrap a line in characters.
max_colwidth [int, optional] Max width to truncate each column in characters. By
default, no limit.
New in version 1.0.0.
encoding [str, default “utf-8”] Set character encoding.
New in version 1.0.
Returns
str or None If buf is None, returns the result as a string. Otherwise returns None.
See also:
Examples
pandas.DataFrame.to_timestamp
pandas.DataFrame.to_xarray
DataFrame.to_xarray()
Return an xarray object from the pandas object.
Returns
xarray.DataArray or xarray.Dataset Data in the pandas structure converted to
Dataset if the object is a DataFrame, or a DataArray if the object is a Series.
See also:
Notes
Examples
>>> df.to_xarray()
<xarray.Dataset>
Dimensions: (index: 4)
Coordinates:
* index (index) int64 0 1 2 3
Data variables:
name (index) object 'falcon' 'parrot' 'lion' 'monkey'
class (index) object 'bird' 'bird' 'mammal' 'mammal'
max_speed (index) float64 389.0 24.0 80.5 nan
num_legs (index) int64 2 2 4 4
>>> df['max_speed'].to_xarray()
<xarray.DataArray 'max_speed' (index: 4)>
array([389. , 24. , 80.5, nan])
Coordinates:
* index (index) int64 0 1 2 3
>>> df_multiindex
speed
date animal
2018-01-01 falcon 350
parrot 18
2018-01-02 falcon 361
parrot 15
>>> df_multiindex.to_xarray()
<xarray.Dataset>
Dimensions: (animal: 2, date: 2)
Coordinates:
* date (date) datetime64[ns] 2018-01-01 2018-01-02
* animal (animal) object 'falcon' 'parrot'
Data variables:
speed (date, animal) int64 350 18 361 15
pandas.DataFrame.transform
See also:
Examples
Even though the resulting DataFrame must have the same length as the input DataFrame, it is possible to
provide several input functions:
>>> s = pd.Series(range(3))
>>> s
0 0
1 1
2 2
dtype: int64
>>> s.transform([np.sqrt, np.exp])
sqrt exp
0 0.000000 1.000000
1 1.000000 2.718282
2 1.414214 7.389056
>>> df = pd.DataFrame({
... "c": [1, 1, 1, 2, 2, 2, 2],
... "type": ["m", "n", "o", "m", "m", "n", "n"]
... })
>>> df
c type
0 1 m
1 1 n
2 1 o
3 2 m
4 2 m
5 2 n
6 2 n
>>> df['size'] = df.groupby('c')['type'].transform(len)
>>> df
c type size
0 1 m 3
1 1 n 3
2 1 o 3
3 2 m 4
4 2 m 4
5 2 n 4
6 2 n 4
pandas.DataFrame.transpose
DataFrame.transpose(*args, copy=False)
Transpose index and columns.
Reflect the DataFrame over its main diagonal by writing rows as columns and vice-versa. The property T
is an accessor to the method transpose().
Parameters
*args [tuple, optional] Accepted for compatibility with NumPy.
copy [bool, default False] Whether to copy the data after transposing, even for
DataFrames with a single dtype.
Note that a copy is always required for mixed dtype DataFrames, or for DataFrames
with any extension types.
Returns
DataFrame The transposed DataFrame.
See also:
Notes
Transposing a DataFrame with mixed dtypes will result in a homogeneous DataFrame with the object
dtype. In such a case, a copy of the data is always made.
Examples
When the dtype is homogeneous in the original DataFrame, we get a transposed DataFrame with the same
dtype:
>>> df1.dtypes
col1 int64
col2 int64
dtype: object
>>> df1_transposed.dtypes
0 int64
1 int64
dtype: object
When the DataFrame has mixed dtypes, we get a transposed DataFrame with the object dtype:
>>> df2.dtypes
name object
score float64
employed bool
kids int64
dtype: object
>>> df2_transposed.dtypes
0 object
1 object
dtype: object
pandas.DataFrame.truediv
Notes
Examples
Add a scalar with operator version which return the same results.
>>> df + 1
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.add(1)
angles degrees
circle 1 361
triangle 4 181
rectangle 5 361
>>> df.rdiv(10)
angles degrees
circle inf 0.027778
triangle 3.333333 0.055556
rectangle 2.500000 0.027778
>>> df * other
angles degrees
circle 0 NaN
triangle 9 NaN
rectangle 16 NaN
pandas.DataFrame.truncate
Notes
If the index being truncated contains only datetime values, before and after may be specified as strings
instead of Timestamps.
Examples
>>> df.truncate(before=pd.Timestamp('2016-01-05'),
... after=pd.Timestamp('2016-01-10')).tail()
A
2016-01-09 23:59:56 1
2016-01-09 23:59:57 1
2016-01-09 23:59:58 1
2016-01-09 23:59:59 1
2016-01-10 00:00:00 1
Because the index is a DatetimeIndex containing only dates, we can specify before and after as strings.
They will be coerced to Timestamps before truncation.
Note that truncate assumes a 0 value for any unspecified time component (midnight). This differs
from partial string slicing, which returns any partially matching dates.
pandas.DataFrame.tshift
Notes
If freq is not specified then tries to use the freq or inferred_freq attributes of the index. If neither of those
attributes exist, a ValueError is thrown
pandas.DataFrame.tz_convert
pandas.DataFrame.tz_localize
Examples
Be careful with DST changes. When there is sequential data, pandas can infer the DST time:
>>> s = pd.Series(range(7),
... index=pd.DatetimeIndex(['2018-10-28 01:30:00',
... '2018-10-28 02:00:00',
... '2018-10-28 02:30:00',
... '2018-10-28 02:00:00',
... '2018-10-28 02:30:00',
... '2018-10-28 03:00:00',
... '2018-10-28 03:30:00']))
>>> s.tz_localize('CET', ambiguous='infer')
2018-10-28 01:30:00+02:00 0
2018-10-28 02:00:00+02:00 1
2018-10-28 02:30:00+02:00 2
2018-10-28 02:00:00+01:00 3
2018-10-28 02:30:00+01:00 4
2018-10-28 03:00:00+01:00 5
2018-10-28 03:30:00+01:00 6
dtype: int64
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous
parameter to set the DST explicitly
>>> s = pd.Series(range(3),
... index=pd.DatetimeIndex(['2018-10-28 01:20:00',
... '2018-10-28 02:36:00',
... '2018-10-28 03:46:00']))
>>> s.tz_localize('CET', ambiguous=np.array([True, True, False]))
2018-10-28 01:20:00+02:00 0
2018-10-28 02:36:00+02:00 1
2018-10-28 03:46:00+01:00 2
dtype: int64
If the DST transition causes nonexistent times, you can shift these dates forward or backward with a
timedelta object or ‘shift_forward’ or ‘shift_backward’.
>>> s = pd.Series(range(2),
... index=pd.DatetimeIndex(['2015-03-29 02:30:00',
... '2015-03-29 03:30:00']))
>>> s.tz_localize('Europe/Warsaw', nonexistent='shift_forward')
2015-03-29 03:00:00+02:00 0
2015-03-29 03:30:00+02:00 1
dtype: int64
>>> s.tz_localize('Europe/Warsaw', nonexistent='shift_backward')
2015-03-29 01:59:59.999999999+01:00 0
2015-03-29 03:30:00+02:00 1
dtype: int64
>>> s.tz_localize('Europe/Warsaw', nonexistent=pd.Timedelta('1H'))
(continues on next page)
pandas.DataFrame.unstack
DataFrame.unstack(level=- 1, fill_value=None)
Pivot a level of the (necessarily hierarchical) index labels.
Returns a DataFrame having a new level of column labels whose inner-most level consists of the pivoted
index labels.
If the index is not a MultiIndex, the output will be a Series (the analogue of stack when the columns are
not a MultiIndex).
Parameters
level [int, str, or list of these, default -1 (last level)] Level(s) of index to unstack, can
pass level name.
fill_value [int, str or dict] Replace NaN with this value if the unstack produces missing
values.
Returns
Series or DataFrame
See also:
Examples
>>> s.unstack(level=-1)
a b
one 1.0 2.0
two 3.0 4.0
>>> s.unstack(level=0)
one two
a 1.0 3.0
b 2.0 4.0
>>> df = s.unstack(level=0)
>>> df.unstack()
one a 1.0
b 2.0
two a 3.0
b 4.0
dtype: float64
pandas.DataFrame.update
Examples
The DataFrame’s length does not increase as a result of the update, only values at matching index/column
labels are updated.
>>> df = pd.DataFrame({'A': ['a', 'b', 'c'],
... 'B': ['x', 'y', 'z']})
>>> new_df = pd.DataFrame({'B': ['d', 'e', 'f', 'g', 'h', 'i']})
>>> df.update(new_df)
>>> df
A B
0 a d
1 b e
2 c f
If other contains NaNs the corresponding values are not updated in the original dataframe.
>>> df = pd.DataFrame({'A': [1, 2, 3],
... 'B': [400, 500, 600]})
>>> new_df = pd.DataFrame({'B': [4, np.nan, 6]})
>>> df.update(new_df)
>>> df
A B
0 1 4.0
1 2 500.0
2 3 6.0
pandas.DataFrame.value_counts
Notes
The returned Series will have a MultiIndex with one level per input column. By default, rows that contain
any NA values are omitted from the result. By default, the resulting Series will be in descending order so
that the first element is the most frequently-occurring row.
Examples
>>> df.value_counts()
num_legs num_wings
4 0 2
2 2 1
6 0 1
dtype: int64
>>> df.value_counts(sort=False)
num_legs num_wings
2 2 1
4 0 2
6 0 1
dtype: int64
>>> df.value_counts(ascending=True)
num_legs num_wings
2 2 1
6 0 1
4 0 2
dtype: int64
>>> df.value_counts(normalize=True)
num_legs num_wings
4 0 0.50
2 2 0.25
6 0 0.25
dtype: float64
pandas.DataFrame.var
Notes
To have the same behaviour as numpy.std, use ddof=0 (instead of the default ddof=1)
pandas.DataFrame.where
Notes
The where method is an application of the if-then idiom. For each element in the calling DataFrame, if
cond is True the element is used; otherwise the corresponding element from the DataFrame other is
used.
The signature for DataFrame.where() differs from numpy.where(). Roughly df1.where(m,
df2) is equivalent to np.where(m, df1, df2).
For further details and examples see the where documentation in indexing.
Examples
>>> s = pd.Series(range(5))
>>> s.where(s > 0)
0 NaN
1 1.0
2 2.0
3 3.0
4 4.0
dtype: float64
>>> s.mask(s > 0)
0 0.0
1 NaN
2 NaN
3 NaN
4 NaN
dtype: float64
pandas.DataFrame.xs
Notes
Examples
>>> df.xs('mammal')
num_legs num_wings
animal locomotion
cat walks 4 0
dog walks 4 0
bat flies 2 2
Axes
3.4.3 Conversion
pandas.DataFrame.__iter__
DataFrame.__iter__()
Iterate over info axis.
Returns
iterator Info axis as iterator.
For more information on .at, .iat, .loc, and .iloc, see the indexing documentation.
DataFrame.add(other[, axis, level, fill_value]) Get Addition of dataframe and other, element-wise (bi-
nary operator add).
DataFrame.sub(other[, axis, level, fill_value]) Get Subtraction of dataframe and other, element-wise
(binary operator sub).
DataFrame.mul(other[, axis, level, fill_value]) Get Multiplication of dataframe and other, element-wise
(binary operator mul).
DataFrame.div(other[, axis, level, fill_value]) Get Floating division of dataframe and other, element-
wise (binary operator truediv).
DataFrame.truediv(other[, axis, level, . . . ]) Get Floating division of dataframe and other, element-
wise (binary operator truediv).
DataFrame.floordiv(other[, axis, level, . . . ]) Get Integer division of dataframe and other, element-
wise (binary operator floordiv).
DataFrame.mod(other[, axis, level, fill_value]) Get Modulo of dataframe and other, element-wise (bi-
nary operator mod).
DataFrame.pow(other[, axis, level, fill_value]) Get Exponential power of dataframe and other, element-
wise (binary operator pow).
DataFrame.dot(other) Compute the matrix multiplication between the
DataFrame and other.
DataFrame.radd(other[, axis, level, fill_value]) Get Addition of dataframe and other, element-wise (bi-
nary operator radd).
continues on next page
pandas.DataFrame.T
property DataFrame.T
3.4.13 Flags
Flags refer to attributes of the pandas object. Properties of the dataset (like the date is was recorded, the URL it was
accessed from, etc.) should be stored in DataFrame.attrs.
3.4.14 Metadata
3.4.15 Plotting
DataFrame.plot is both a callable method and a namespace attribute for specific plotting methods of the form
DataFrame.plot.<kind>.
pandas.DataFrame.plot.area
See also:
DataFrame.plot Make plots of DataFrame using matplotlib / pylab.
Examples
>>> df = pd.DataFrame({
... 'sales': [3, 2, 3, 9, 10, 6],
... 'signups': [5, 5, 6, 12, 14, 13],
... 'visits': [20, 42, 28, 62, 81, 50],
... }, index=pd.date_range(start='2018/01/01', end='2018/07/01',
... freq='M'))
>>> ax = df.plot.area()
Area plots are stacked by default. To produce an unstacked plot, pass stacked=False:
>>> ax = df.plot.area(stacked=False)
>>> ax = df.plot.area(y='sales')
>>> df = pd.DataFrame({
... 'sales': [3, 2, 3],
... 'visits': [20, 42, 28],
... 'day': [1, 2, 3],
... })
>>> ax = df.plot.area(x='day')
pandas.DataFrame.plot.bar
• A single color string referred to by name, RGB or RGBA code, for instance
‘red’ or ‘#a98d19’.
• A sequence of color strings referred to by name, RGB or RGBA code, which
will be used for each column recursively. For instance [‘green’,’yellow’] each
column’s bar will be filled in green or yellow, alternatively.
• A dict of the form {column name [color}, so that each column will be] colored
accordingly. For example, if your columns are called a and b, then passing
{‘a’: ‘green’, ‘b’: ‘red’} will color bars for column a in green and bars for
column b in red.
New in version 1.1.0.
**kwargs Additional keyword arguments are documented in DataFrame.plot().
Returns
matplotlib.axes.Axes or np.ndarray of them An ndarray is returned with one
matplotlib.axes.Axes per column when subplots=True.
See also:
DataFrame.plot.barh Horizontal bar plot.
DataFrame.plot Make plots of a DataFrame.
matplotlib.pyplot.bar Make a bar plot with matplotlib.
Examples
Basic plot.
Plot a whole dataframe to a bar plot. Each column is assigned a distinct color, and each row is nested in a group
along the horizontal axis.
>>> ax = df.plot.bar(stacked=True)
Instead of nesting, the figure can be split by column with subplots=True. In this case, a numpy.ndarray
of matplotlib.axes.Axes are returned.
If you don’t like the default colours, you can specify how you’d like each column to be colored.
pandas.DataFrame.plot.barh
x [label or position, optional] Allows plotting of one column versus another. If not specified,
the index of the DataFrame is used.
y [label or position, optional] Allows plotting of one column versus another. If not specified,
all numerical columns are used.
color [str, array_like, or dict, optional] The color for each of the DataFrame’s columns.
Possible values are:
• A single color string referred to by name, RGB or RGBA code, for instance
‘red’ or ‘#a98d19’.
• A sequence of color strings referred to by name, RGB or RGBA code, which
will be used for each column recursively. For instance [‘green’,’yellow’] each
column’s bar will be filled in green or yellow, alternatively.
• A dict of the form {column name [color}, so that each column will be] colored
accordingly. For example, if your columns are called a and b, then passing
{‘a’: ‘green’, ‘b’: ‘red’} will color bars for column a in green and bars for
column b in red.
New in version 1.1.0.
**kwargs Additional keyword arguments are documented in DataFrame.plot().
Returns
matplotlib.axes.Axes or np.ndarray of them An ndarray is returned with one
matplotlib.axes.Axes per column when subplots=True.
See also:
DataFrame.plot.bar Vertical bar plot.
DataFrame.plot Make plots of DataFrame using matplotlib.
matplotlib.axes.Axes.bar Plot a vertical bar plot using matplotlib.
Examples
Basic example
>>> ax = df.plot.barh(stacked=True)
pandas.DataFrame.plot.box
DataFrame.plot.box(by=None, **kwargs)
Make a box plot of the DataFrame columns.
A box plot is a method for graphically depicting groups of numerical data through their quartiles. The box
extends from the Q1 to Q3 quartile values of the data, with a line at the median (Q2). The whiskers extend from
the edges of box to show the range of the data. The position of the whiskers is set by default to 1.5*IQR (IQR
= Q3 - Q1) from the edges of the box. Outlier points are those past the end of the whiskers.
For further details see Wikipedia’s entry for boxplot.
A consideration when using this chart is that the box and the whiskers can overlap, which is very common when
plotting small sets of data.
Parameters
by [str or sequence] Column in the DataFrame to group by.
**kwargs Additional keywords are documented in DataFrame.plot().
Returns
matplotlib.axes.Axes or numpy.ndarray of them
See also:
DataFrame.boxplot Another method to draw a box plot.
Series.plot.box Draw a box plot from a Series object.
matplotlib.pyplot.boxplot Draw a box plot in matplotlib.
Examples
Draw a box plot from a DataFrame with four columns of randomly generated data.
pandas.DataFrame.plot.density
Returns
matplotlib.axes.Axes or numpy.ndarray of them
See also:
scipy.stats.gaussian_kde Representation of a kernel-density estimate using Gaussian kernels. This
is the function used internally to estimate the PDF.
Examples
Given a Series of points randomly sampled from an unknown distribution, estimate its PDF using KDE with
automatic bandwidth determination and plot the results, evaluating them at 1000 equally spaced points (default):
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large
bandwidth value may result in under-fitting:
>>> ax = s.plot.kde(bw_method=0.3)
>>> ax = s.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
>>> df = pd.DataFrame({
... 'x': [1, 2, 2.5, 3, 3.5, 4, 5],
... 'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> ax = df.plot.kde()
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large
bandwidth value may result in under-fitting:
>>> ax = df.plot.kde(bw_method=0.3)
>>> ax = df.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
pandas.DataFrame.plot.hexbin
Examples
The following examples are generated with random data from a normal distribution.
>>> n = 10000
>>> df = pd.DataFrame({'x': np.random.randn(n),
... 'y': np.random.randn(n)})
>>> ax = df.plot.hexbin(x='x', y='y', gridsize=20)
The next example uses C and np.sum as reduce_C_function. Note that ‘observations’ values ranges from 1 to 5
but the result plot shows values up to more than 25. This is because of the reduce_C_function.
>>> n = 500
>>> df = pd.DataFrame({
... 'coord_x': np.random.uniform(-3, 3, size=n),
... 'coord_y': np.random.uniform(30, 50, size=n),
... 'observations': np.random.randint(1,5, size=n)
... })
>>> ax = df.plot.hexbin(x='coord_x',
... y='coord_y',
(continues on next page)
pandas.DataFrame.plot.hist
See also:
DataFrame.hist Draw histograms per DataFrame’s Series.
Series.hist Draw a histogram with Series’ data.
Examples
When we draw a dice 6000 times, we expect to get each value around 1000 times. But when we draw two dices
and sum the result, the distribution is going to be quite different. A histogram illustrates those distributions.
>>> df = pd.DataFrame(
... np.random.randint(1, 7, 6000),
... columns = ['one'])
>>> df['two'] = df['one'] + np.random.randint(1, 7, 6000)
>>> ax = df.plot.hist(bins=12, alpha=0.5)
pandas.DataFrame.plot.kde
Examples
Given a Series of points randomly sampled from an unknown distribution, estimate its PDF using KDE with
automatic bandwidth determination and plot the results, evaluating them at 1000 equally spaced points (default):
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large
bandwidth value may result in under-fitting:
>>> ax = s.plot.kde(bw_method=0.3)
>>> ax = s.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
>>> df = pd.DataFrame({
... 'x': [1, 2, 2.5, 3, 3.5, 4, 5],
... 'y': [4, 4, 4.5, 5, 5.5, 6, 6],
... })
>>> ax = df.plot.kde()
A scalar bandwidth can be specified. Using a small bandwidth value can lead to over-fitting, while using a large
bandwidth value may result in under-fitting:
>>> ax = df.plot.kde(bw_method=0.3)
>>> ax = df.plot.kde(bw_method=3)
Finally, the ind parameter determines the evaluation points for the plot of the estimated PDF:
pandas.DataFrame.plot.line
color [str, array_like, or dict, optional] The color for each of the DataFrame’s columns.
Possible values are:
• A single color string referred to by name, RGB or RGBA code, for instance
‘red’ or ‘#a98d19’.
• A sequence of color strings referred to by name, RGB or RGBA code, which
will be used for each column recursively. For instance [‘green’,’yellow’] each
column’s line will be filled in green or yellow, alternatively.
• A dict of the form {column name [color}, so that each column will be] colored
accordingly. For example, if your columns are called a and b, then passing
{‘a’: ‘green’, ‘b’: ‘red’} will color lines for column a in green and lines for
column b in red.
New in version 1.1.0.
**kwargs Additional keyword arguments are documented in DataFrame.plot().
Returns
matplotlib.axes.Axes or np.ndarray of them An ndarray is returned with one
matplotlib.axes.Axes per column when subplots=True.
See also:
matplotlib.pyplot.plot Plot y versus x as lines and/or markers.
Examples
The following example shows the populations for some animals over the years.
>>> df = pd.DataFrame({
... 'pig': [20, 18, 489, 675, 1776],
... 'horse': [4, 25, 281, 600, 1900]
... }, index=[1990, 1997, 2003, 2009, 2014])
>>> lines = df.plot.line()
Let’s repeat the same example, but specifying colors for each column (in this case, for each animal).
pandas.DataFrame.plot.pie
DataFrame.plot.pie(**kwargs)
Generate a pie plot.
A pie plot is a proportional representation of the numerical data in a column. This function wraps
matplotlib.pyplot.pie() for the specified column. If no column reference is passed and
subplots=True a pie plot is drawn for each numerical column independently.
Parameters
y [int or label, optional] Label or position of the column to plot. If not provided,
subplots=True argument must be passed.
**kwargs Keyword arguments to pass on to DataFrame.plot().
Returns
matplotlib.axes.Axes or np.ndarray of them A NumPy array is returned when subplots is
True.
See also:
Series.plot.pie Generate a pie plot for a Series.
DataFrame.plot Make plots of a DataFrame.
Examples
In the example below we have a DataFrame with the information about planet’s mass and radius. We pass the
‘mass’ column to the pie function to get a pie plot.
pandas.DataFrame.plot.scatter
• A sequence of scalars, which will be used for each point’s size recursively. For
instance, when passing [2,14] all points size will be either 2 or 14, alternatively.
Changed in version 1.1.0.
c [str, int or array_like, optional] The color of each point. Possible values are:
• A single color string referred to by name, RGB or RGBA code, for instance ‘red’
or ‘#a98d19’.
• A sequence of color strings referred to by name, RGB or RGBA code, which will
be used for each point’s color recursively. For instance [‘green’,’yellow’] all points
will be filled in green or yellow, alternatively.
• A column name or position whose values will be used to color the marker points
according to a colormap.
**kwargs Keyword arguments to pass on to DataFrame.plot().
Returns
matplotlib.axes.Axes or numpy.ndarray of them
See also:
matplotlib.pyplot.scatter Scatter plot using multiple input data formats.
Examples
Let’s see how to draw a scatter plot using coordinates from the values in a DataFrame’s columns.
>>> df = pd.DataFrame([[5.1, 3.5, 0], [4.9, 3.0, 0], [7.0, 3.2, 1],
... [6.4, 3.2, 1], [5.9, 3.0, 2]],
... columns=['length', 'width', 'species'])
>>> ax1 = df.plot.scatter(x='length',
... y='width',
... c='DarkBlue')
Sparse-dtype specific methods and attributes are provided under the DataFrame.sparse accessor.
pandas.DataFrame.sparse.density
DataFrame.sparse.density
Ratio of non-sparse points to total (dense) data points.
pandas.DataFrame.sparse.from_spmatrix
Examples
pandas.DataFrame.sparse.to_coo
DataFrame.sparse.to_coo()
Return the contents of the frame as a sparse SciPy COO matrix.
New in version 0.25.0.
Returns
coo_matrix [scipy.sparse.spmatrix] If the caller is heterogeneous and contains booleans or
objects, the result will be of dtype=object. See Notes.
Notes
The dtype will be the lowest-common-denominator type (implicit upcasting); that is to say if the dtypes (even
of numeric types) are mixed, the one that accommodates all will be chosen.
e.g. If the dtypes are float16 and float32, dtype will be upcast to float32. By numpy.find_common_type conven-
tion, mixing int64 and and uint64 will result in a float64 dtype.
pandas.DataFrame.sparse.to_dense
DataFrame.sparse.to_dense()
Convert a DataFrame with sparse values to dense.
New in version 0.25.0.
Returns
DataFrame A DataFrame with the same values stored as dense arrays.
Examples
For most data types, pandas uses NumPy arrays as the concrete objects contained with a Index, Series, or
DataFrame.
For some data types, pandas extends NumPy’s type system. String aliases for these types can be found at dtypes.
pandas and third-party libraries can extend NumPy’s type system (see Extension types). The top-level array()
method can be used to create a new array, which may be stored in a Series, Index, or as a column in a DataFrame.
3.5.1 pandas.array
2. Otherwise, pandas will attempt to infer the dtype from the data.
Note that when data is a NumPy array, data.dtype is not used for inferring the array
type. This is because NumPy cannot represent all the types of data that can be held in
extension arrays.
Currently, pandas will infer an extension dtype for sequences of
For all other cases, NumPy’s usual inference rules will be used.
Changed in version 1.0.0: Pandas infers nullable-integer dtype for integer data, string
dtype for string data, and nullable-boolean dtype for boolean data.
Changed in version 1.2.0: Pandas now also infers nullable-floating dtype for float-like
input data
copy [bool, default True] Whether to copy the data, even if not necessary. Depending on the
type of data, creating the new array may require copying data, even if copy=False.
Returns
ExtensionArray The newly created array.
Raises
ValueError When data is not 1-dimensional.
See also:
numpy.array Construct a NumPy array.
Series Construct a pandas Series.
Index Construct a pandas Index.
arrays.PandasArray ExtensionArray wrapping a NumPy array.
Series.array Extract the array stored within a Series.
Notes
Omitting the dtype argument means pandas will attempt to infer the best array type from the values in the
data. As new array types are added by pandas and 3rd party libraries, the “best” array type may change. We
recommend specifying dtype to ensure that
1. the correct array type for the data is returned
2. the returned array type doesn’t change as new extension types are added by pandas and third-party libraries
Additionally, if the underlying memory representation of the returned array matters, we recommend specifying
the dtype as a concrete object rather than a string alias or allowing it to be inferred. For example, a future
version of pandas or a 3rd-party library may include a dedicated ExtensionArray for string data. In this event,
the following would no longer return a arrays.PandasArray backed by a NumPy array.
This would instead return the new ExtensionArray dedicated for string data. If you really need the new array to
be backed by a NumPy array, specify that in the dtype.
Examples
If a dtype is not specified, pandas will infer the best dtype from the values. See the description of dtype for the
types pandas infers for.
As mentioned in the “Notes” section, new extension types may be added in the future (by pandas or 3rd party
libraries), causing the return value to no longer be a arrays.PandasArray. Specify the dtype as a NumPy
dtype if you need to ensure there’s no future change in behavior.
data must be 1-dimensional. A ValueError is raised when the input has the wrong dimensionality.
>>> pd.array(1)
Traceback (most recent call last):
...
ValueError: Cannot pass scalar '1' to 'pandas.array'.
NumPy cannot natively represent timezone-aware datetimes. pandas supports this with the arrays.
DatetimeArray extension array, which can hold timezone-naive or timezone-aware values.
Timestamp, a subclass of datetime.datetime, is pandas’ scalar type for timezone-naive or timezone-aware
datetime data.
Timestamp([ts_input, freq, tz, unit, year, . . . ]) Pandas replacement for python datetime.datetime ob-
ject.
pandas.Timestamp
Notes
There are essentially three calling conventions for the constructor. The primary form accepts four parameters.
They can be passed by position or keyword.
The other two forms mimic the parameters from datetime.datetime. They can be passed by either posi-
tion or keyword, but not both mixed together.
Examples
>>> pd.Timestamp('2017-01-01T12')
Timestamp('2017-01-01 12:00:00')
This converts an int representing a Unix-epoch in units of seconds and for a particular timezone
Using the other two forms that mimic the API for datetime.datetime:
Attributes
pandas.Timestamp.asm8
Timestamp.asm8
Return numpy datetime64 format in nanoseconds.
pandas.Timestamp.day_of_week
Timestamp.day_of_week
Return day of the week.
pandas.Timestamp.day_of_year
Timestamp.day_of_year
Return the day of the year.
pandas.Timestamp.dayofweek
Timestamp.dayofweek
Return day of the week.
pandas.Timestamp.dayofyear
Timestamp.dayofyear
Return the day of the year.
pandas.Timestamp.days_in_month
Timestamp.days_in_month
Return the number of days in the month.
pandas.Timestamp.daysinmonth
Timestamp.daysinmonth
Return the number of days in the month.
pandas.Timestamp.freqstr
property Timestamp.freqstr
Return the total number of days in the month.
pandas.Timestamp.is_leap_year
Timestamp.is_leap_year
Return True if year is a leap year.
pandas.Timestamp.is_month_end
Timestamp.is_month_end
Return True if date is last day of month.
pandas.Timestamp.is_month_start
Timestamp.is_month_start
Return True if date is first day of month.
pandas.Timestamp.is_quarter_end
Timestamp.is_quarter_end
Return True if date is last day of the quarter.
pandas.Timestamp.is_quarter_start
Timestamp.is_quarter_start
Return True if date is first day of the quarter.
pandas.Timestamp.is_year_end
Timestamp.is_year_end
Return True if date is last day of the year.
pandas.Timestamp.is_year_start
Timestamp.is_year_start
Return True if date is first day of the year.
pandas.Timestamp.quarter
Timestamp.quarter
Return the quarter of the year.
pandas.Timestamp.tz
property Timestamp.tz
Alias for tzinfo.
pandas.Timestamp.week
Timestamp.week
Return the week number of the year.
pandas.Timestamp.weekofyear
Timestamp.weekofyear
Return the week number of the year.
day
fold
freq
hour
microsecond
minute
month
nanosecond
second
tzinfo
value
year
Methods
pandas.Timestamp.astimezone
Timestamp.astimezone(tz)
Convert tz-aware Timestamp to another time zone.
Parameters
tz [str, pytz.timezone, dateutil.tz.tzfile or None] Time zone for time which Timestamp
will be converted to. None will remove timezone holding UTC time.
Returns
converted [Timestamp]
Raises
TypeError If Timestamp is tz-naive.
pandas.Timestamp.ceil
pandas.Timestamp.combine
pandas.Timestamp.ctime
Timestamp.ctime()
Return ctime() style string.
pandas.Timestamp.date
Timestamp.date()
Return date object with same year, month and day.
pandas.Timestamp.day_name
Timestamp.day_name()
Return the day name of the Timestamp with specified locale.
Parameters
locale [str, default None (English locale)] Locale determining the language in which to
return the day name.
Returns
str
pandas.Timestamp.dst
Timestamp.dst()
Return self.tzinfo.dst(self).
pandas.Timestamp.floor
pandas.Timestamp.fromisocalendar
Timestamp.fromisocalendar()
int, int, int -> Construct a date from the ISO year, week number and weekday.
This is the inverse of the date.isocalendar() function
pandas.Timestamp.fromisoformat
Timestamp.fromisoformat()
string -> datetime from datetime.isoformat() output
pandas.Timestamp.fromordinal
pandas.Timestamp.fromtimestamp
classmethod Timestamp.fromtimestamp(ts)
Transform timestamp[, tz] to tz’s local time from POSIX timestamp.
pandas.Timestamp.isocalendar
Timestamp.isocalendar()
Return a 3-tuple containing ISO year, week number, and weekday.
pandas.Timestamp.isoformat
Timestamp.isoformat()
[sep] -> string in ISO 8601 format, YYYY-MM-DDT[HH[:MM[:SS[.mmm[uuu]]]]][+HH:MM]. sep is
used to separate the year from the time, and defaults to ‘T’. timespec specifies what components of the time
to include (allowed values are ‘auto’, ‘hours’, ‘minutes’, ‘seconds’, ‘milliseconds’, and ‘microseconds’).
pandas.Timestamp.isoweekday
Timestamp.isoweekday()
Return the day of the week represented by the date. Monday == 1 . . . Sunday == 7
pandas.Timestamp.month_name
Timestamp.month_name()
Return the month name of the Timestamp with specified locale.
Parameters
locale [str, default None (English locale)] Locale determining the language in which to
return the month name.
Returns
str
pandas.Timestamp.normalize
Timestamp.normalize()
Normalize Timestamp to midnight, preserving tz information.
pandas.Timestamp.now
classmethod Timestamp.now(tz=None)
Return new Timestamp object representing current time local to tz.
Parameters
tz [str or timezone object, default None] Timezone to localize to.
pandas.Timestamp.replace
pandas.Timestamp.round
pandas.Timestamp.strftime
Timestamp.strftime(format)
Return a string representing the given POSIX timestamp controlled by an explicit format string.
Parameters
format [str] Format string to convert Timestamp to string. See strftime documenta-
tion for more information on the format string: https://docs.python.org/3/library/
datetime.html#strftime-and-strptime-behavior.
pandas.Timestamp.strptime
pandas.Timestamp.time
Timestamp.time()
Return time object with same time but with tzinfo=None.
pandas.Timestamp.timestamp
Timestamp.timestamp()
Return POSIX timestamp as float.
pandas.Timestamp.timetuple
Timestamp.timetuple()
Return time tuple, compatible with time.localtime().
pandas.Timestamp.timetz
Timestamp.timetz()
Return time object with same time and tzinfo.
pandas.Timestamp.to_datetime64
Timestamp.to_datetime64()
Return a numpy.datetime64 object with ‘ns’ precision.
pandas.Timestamp.to_julian_date
Timestamp.to_julian_date()
Convert TimeStamp to a Julian Date. 0 Julian date is noon January 1, 4713 BC.
pandas.Timestamp.to_numpy
Timestamp.to_numpy()
Convert the Timestamp to a NumPy datetime64.
New in version 0.25.0.
This is an alias method for Timestamp.to_datetime64(). The dtype and copy parameters are available here
only for compatibility. Their values will not affect the return value.
Returns
numpy.datetime64
See also:
pandas.Timestamp.to_period
Timestamp.to_period()
Return an period of which this timestamp is an observation.
pandas.Timestamp.to_pydatetime
Timestamp.to_pydatetime()
Convert a Timestamp object to a native Python datetime object.
If warn=True, issue a warning if nanoseconds is nonzero.
pandas.Timestamp.today
pandas.Timestamp.toordinal
Timestamp.toordinal()
Return proleptic Gregorian ordinal. January 1 of year 1 is day 1.
pandas.Timestamp.tz_convert
Timestamp.tz_convert(tz)
Convert tz-aware Timestamp to another time zone.
Parameters
tz [str, pytz.timezone, dateutil.tz.tzfile or None] Time zone for time which Timestamp
will be converted to. None will remove timezone holding UTC time.
Returns
converted [Timestamp]
Raises
TypeError If Timestamp is tz-naive.
pandas.Timestamp.tz_localize
pandas.Timestamp.tzname
Timestamp.tzname()
Return self.tzinfo.tzname(self).
pandas.Timestamp.utcfromtimestamp
classmethod Timestamp.utcfromtimestamp(ts)
Construct a naive UTC datetime from a POSIX timestamp.
pandas.Timestamp.utcnow
classmethod Timestamp.utcnow()
Return a new Timestamp representing UTC day and time.
pandas.Timestamp.utcoffset
Timestamp.utcoffset()
Return self.tzinfo.utcoffset(self).
pandas.Timestamp.utctimetuple
Timestamp.utctimetuple()
Return UTC time tuple, compatible with time.localtime().
pandas.Timestamp.weekday
Timestamp.weekday()
Return the day of the week represented by the date. Monday == 0 . . . Sunday == 6
Properties
pandas.Timestamp.day
Timestamp.day
pandas.Timestamp.fold
Timestamp.fold
pandas.Timestamp.hour
Timestamp.hour
pandas.Timestamp.max
pandas.Timestamp.microsecond
Timestamp.microsecond
pandas.Timestamp.min
pandas.Timestamp.minute
Timestamp.minute
pandas.Timestamp.month
Timestamp.month
pandas.Timestamp.nanosecond
Timestamp.nanosecond
pandas.Timestamp.resolution
pandas.Timestamp.second
Timestamp.second
pandas.Timestamp.tzinfo
Timestamp.tzinfo
pandas.Timestamp.value
Timestamp.value
pandas.Timestamp.year
Timestamp.year
Methods
pandas.Timestamp.freq
Timestamp.freq
A collection of timestamps may be stored in a arrays.DatetimeArray. For timezone-aware data, the .dtype
of a DatetimeArray is a DatetimeTZDtype. For timezone-naive data, np.dtype("datetime64[ns]")
is used.
If the data are tz-aware, then every value in the array must have the same timezone.
pandas.arrays.DatetimeArray
Warning: DatetimeArray is currently experimental, and its API may change without warning. In particu-
lar, DatetimeArray.dtype is expected to change to always be an instance of an ExtensionDtype
subclass.
Parameters
values [Series, Index, DatetimeArray, ndarray] The datetime data.
For DatetimeArray values (or a Series or Index boxing one), dtype and freq will be
extracted from values.
dtype [numpy.dtype or DatetimeTZDtype] Note that the only NumPy dtype allowed is ‘date-
time64[ns]’.
freq [str or Offset, optional] The frequency.
copy [bool, default False] Whether to copy the underlying array of values.
Attributes
None
Methods
None
pandas.DatetimeTZDtype
Examples
>>> pd.DatetimeTZDtype(tz='UTC')
datetime64[ns, UTC]
>>> pd.DatetimeTZDtype(tz='dateutil/US/Central')
datetime64[ns, tzfile('/usr/share/zoneinfo/US/Central')]
Attributes
pandas.DatetimeTZDtype.unit
property DatetimeTZDtype.unit
The precision of the datetime data.
pandas.DatetimeTZDtype.tz
property DatetimeTZDtype.tz
The timezone.
Methods
None
NumPy can natively represent timedeltas. pandas provides Timedelta for symmetry with Timestamp.
pandas.Timedelta
Notes
Attributes
pandas.Timedelta.asm8
Timedelta.asm8
Return a numpy timedelta64 array scalar view.
Provides access to the array scalar view (i.e. a combination of the value and the units) associated with
the numpy.timedelta64().view(), including a 64-bit integer representation of the timedelta in nanoseconds
(Python int compatible).
Returns
numpy timedelta64 array scalar view Array scalar view of the timedelta in nanosec-
onds.
Examples
pandas.Timedelta.components
Timedelta.components
Return a components namedtuple-like.
pandas.Timedelta.days
Timedelta.days
Number of days.
pandas.Timedelta.delta
Timedelta.delta
Return the timedelta in nanoseconds (ns), for internal compatibility.
Returns
int Timedelta in nanoseconds.
Examples
pandas.Timedelta.microseconds
Timedelta.microseconds
Number of microseconds (>= 0 and less than 1 second).
pandas.Timedelta.nanoseconds
Timedelta.nanoseconds
Return the number of nanoseconds (n), where 0 <= n < 1 microsecond.
Returns
int Number of nanoseconds.
See also:
Timedelta.components Return all attributes with assigned values (i.e. days, hours, minutes, sec-
onds, milliseconds, microseconds, nanoseconds).
Examples
>>> td.nanoseconds
42
pandas.Timedelta.resolution_string
Timedelta.resolution_string
Return a string representing the lowest timedelta resolution.
Each timedelta has a defined resolution that represents the lowest OR most granular level of precision.
Each level of resolution is represented by a short string as defined below:
Resolution: Return value
• Days: ‘D’
• Hours: ‘H’
• Minutes: ‘T’
• Seconds: ‘S’
• Milliseconds: ‘L’
• Microseconds: ‘U’
• Nanoseconds: ‘N’
Returns
str Timedelta resolution.
Examples
pandas.Timedelta.seconds
Timedelta.seconds
Number of seconds (>= 0 and less than 1 day).
freq
is_populated
value
Methods
pandas.Timedelta.ceil
Timedelta.ceil(freq)
Return a new Timedelta ceiled to this resolution.
Parameters
freq [str] Frequency string indicating the ceiling resolution.
pandas.Timedelta.floor
Timedelta.floor(freq)
Return a new Timedelta floored to this resolution.
Parameters
freq [str] Frequency string indicating the flooring resolution.
pandas.Timedelta.isoformat
Timedelta.isoformat()
Format Timedelta as ISO 8601 Duration like P[n]Y[n]M[n]DT[n]H[n]M[n]S, where the [n] s are
replaced by the values. See https://en.wikipedia.org/wiki/ISO_8601#Durations.
Returns
str
See also:
Timestamp.isoformat Function is used to convert the given Timestamp object into the ISO format.
Notes
The longest component is days, whose value may be larger than 365. Every component is always included,
even if its value is 0. Pandas uses nanosecond precision, so up to 9 decimal places may be included in the
seconds component. Trailing 0’s are removed from the seconds component after the decimal. We do not
0 pad components, so it’s . . . T5H. . . , not . . . T05H. . .
Examples
>>> td.isoformat()
'P6DT0H50M3.010010012S'
>>> pd.Timedelta(hours=1, seconds=10).isoformat()
'P0DT0H0M10S'
>>> pd.Timedelta(hours=1, seconds=10).isoformat()
'P0DT0H0M10S'
>>> pd.Timedelta(days=500.5).isoformat()
'P500DT12H0MS'
pandas.Timedelta.round
Timedelta.round(freq)
Round the Timedelta to the specified resolution.
Parameters
freq [str] Frequency string indicating the rounding resolution.
Returns
a new Timedelta rounded to the given resolution of freq
Raises
ValueError if the freq cannot be converted
pandas.Timedelta.to_numpy
Timedelta.to_numpy()
Convert the Timedelta to a NumPy timedelta64.
New in version 0.25.0.
This is an alias method for Timedelta.to_timedelta64(). The dtype and copy parameters are available here
only for compatibility. Their values will not affect the return value.
Returns
numpy.timedelta64
See also:
pandas.Timedelta.to_pytimedelta
Timedelta.to_pytimedelta()
Convert a pandas Timedelta object into a python timedelta object.
Timedelta objects are internally saved as numpy datetime64[ns] dtype. Use to_pytimedelta() to convert to
object dtype.
Returns
datetime.timedelta or numpy.array of datetime.timedelta
See also:
Notes
pandas.Timedelta.to_timedelta64
Timedelta.to_timedelta64()
Return a numpy.timedelta64 object with ‘ns’ precision.
pandas.Timedelta.total_seconds
Timedelta.total_seconds()
Total seconds in the duration.
pandas.Timedelta.view
Timedelta.view()
Array view compatibility.
Properties
pandas.Timedelta.freq
Timedelta.freq
pandas.Timedelta.is_populated
Timedelta.is_populated
pandas.Timedelta.max
pandas.Timedelta.min
pandas.Timedelta.resolution
pandas.Timedelta.value
Timedelta.value
Methods
pandas.arrays.TimedeltaArray
Warning: TimedeltaArray is currently experimental, and its API may change without warning. In par-
ticular, TimedeltaArray.dtype is expected to change to be an instance of an ExtensionDtype
subclass.
Parameters
values [array-like] The timedelta data.
dtype [numpy.dtype] Currently, only numpy.dtype("timedelta64[ns]") is ac-
cepted.
freq [Offset, optional]
copy [bool, default False] Whether to copy the underlying array of data.
Attributes
None
Methods
None
3.5.5 Period
pandas.Period
Attributes
pandas.Period.day
Period.day
Get day of the month that a Period falls on.
Returns
int
See also:
Examples
pandas.Period.day_of_week
Period.day_of_week
Day of the week the period lies in, with Monday=0 and Sunday=6.
If the period frequency is lower than daily (e.g. hourly), and the period spans over multiple days, the day
at the start of the period is used.
If the frequency is higher than daily (e.g. monthly), the last day of the period is used.
Returns
int Day of the week.
See also:
Examples
For periods that span over multiple days, the day at the beginning of the period is returned.
For periods with a frequency higher than days, the last day of the period is returned.
pandas.Period.day_of_year
Period.day_of_year
Return the day of the year.
This attribute returns the day of the year on which the particular date occurs. The return value ranges
between 1 to 365 for regular years and 1 to 366 for leap years.
Returns
int The day of year.
See also:
Examples
pandas.Period.dayofweek
Period.dayofweek
Day of the week the period lies in, with Monday=0 and Sunday=6.
If the period frequency is lower than daily (e.g. hourly), and the period spans over multiple days, the day
at the start of the period is used.
If the frequency is higher than daily (e.g. monthly), the last day of the period is used.
Returns
int Day of the week.
See also:
Examples
For periods that span over multiple days, the day at the beginning of the period is returned.
For periods with a frequency higher than days, the last day of the period is returned.
pandas.Period.dayofyear
Period.dayofyear
Return the day of the year.
This attribute returns the day of the year on which the particular date occurs. The return value ranges
between 1 to 365 for regular years and 1 to 366 for leap years.
Returns
int The day of year.
See also:
Examples
pandas.Period.days_in_month
Period.days_in_month
Get the total number of days in the month that this period falls on.
Returns
int
See also:
Examples
>>> p = pd.Period('2018-2-17')
>>> p.days_in_month
28
>>> pd.Period('2018-03-01').days_in_month
31
>>> p = pd.Period('2016-2-17')
>>> p.days_in_month
29
pandas.Period.daysinmonth
Period.daysinmonth
Get the total number of days of the month that the Period falls in.
Returns
int
See also:
Examples
pandas.Period.hour
Period.hour
Get the hour of the day component of the Period.
Returns
int The hour as an integer, between 0 and 23.
See also:
Examples
pandas.Period.minute
Period.minute
Get minute of the hour component of the Period.
Returns
int The minute as an integer, between 0 and 59.
See also:
Examples
pandas.Period.qyear
Period.qyear
Fiscal year the Period lies in according to its starting-quarter.
The year and the qyear of the period will be the same if the fiscal and calendar years are the same. When
they are not, the fiscal year can be different from the calendar year of the period.
Returns
int The fiscal year of the period.
See also:
Examples
If the natural and fiscal year are the same, qyear and year will be the same.
If the fiscal year starts in April (Q-MAR), the first quarter of 2018 will start in April 2017. year will then
be 2018, but qyear will be the fiscal year, 2018.
pandas.Period.second
Period.second
Get the second component of the Period.
Returns
int The second of the Period (ranges from 0 to 59).
See also:
Examples
pandas.Period.start_time
Period.start_time
Get the Timestamp for the start of the period.
Returns
Timestamp
See also:
Examples
>>> period.start_time
Timestamp('2012-01-01 00:00:00')
>>> period.end_time
Timestamp('2012-01-01 23:59:59.999999999')
pandas.Period.week
Period.week
Get the week of the year on the given Period.
Returns
int
See also:
Examples
pandas.Period.weekday
Period.weekday
Day of the week the period lies in, with Monday=0 and Sunday=6.
If the period frequency is lower than daily (e.g. hourly), and the period spans over multiple days, the day
at the start of the period is used.
If the frequency is higher than daily (e.g. monthly), the last day of the period is used.
Returns
int Day of the week.
See also:
Examples
For periods that span over multiple days, the day at the beginning of the period is returned.
For periods with a frequency higher than days, the last day of the period is returned.
end_time
freq
freqstr
is_leap_year
month
ordinal
quarter
weekofyear
year
Methods
pandas.Period.asfreq
Period.asfreq()
Convert Period to desired frequency, at the start or end of the interval.
Parameters
freq [str] The desired frequency.
how [{‘E’, ‘S’, ‘end’, ‘start’}, default ‘end’] Start or end of the timespan.
Returns
resampled [Period]
pandas.Period.strftime
Period.strftime()
Returns the string representation of the Period, depending on the selected fmt. fmt must be a
string containing one or several directives. The method recognizes the same directives as the time.
strftime() function of the standard Python distribution, as well as the specific additional directives
%f, %F, %q. (formatting & docs originally from scikits.timeries).
Notes
(1) The %f directive is the same as %y if the frequency is not quarterly. Otherwise, it corresponds to
the ‘fiscal’ year, as defined by the qyear attribute.
(2) The %F directive is the same as %Y if the frequency is not quarterly. Otherwise, it corresponds to
the ‘fiscal’ year, as defined by the qyear attribute.
(3) The %p directive only affects the output hour field if the %I directive is used to parse the hour.
(4) The range really is 0 to 61; this accounts for leap seconds and the (very rare) double leap seconds.
(5) The %U and %W directives are only used in calculations when the day of the week and the year are
specified.
Examples
pandas.Period.to_timestamp
Period.to_timestamp()
Return the Timestamp representation of the Period.
Uses the target frequency specified at the part of the period specified by how, which is either Start or
Finish.
Parameters
freq [str or DateOffset] Target frequency. Default is ‘D’ if self.freq is week or longer
and ‘S’ otherwise.
how [str, default ‘S’ (start)] One of ‘S’, ‘E’. Can be aliased as case insensitive ‘Start’,
‘Finish’, ‘Begin’, ‘End’.
Returns
Timestamp
now
Properties
pandas.Period.end_time
Period.end_time
pandas.Period.freq
Period.freq
pandas.Period.freqstr
Period.freqstr
pandas.Period.is_leap_year
Period.is_leap_year
pandas.Period.month
Period.month
pandas.Period.ordinal
Period.ordinal
pandas.Period.quarter
Period.quarter
pandas.Period.weekofyear
Period.weekofyear
pandas.Period.year
Period.year
Methods
pandas.Period.now
Period.now()
A collection of timedeltas may be stored in a arrays.PeriodArray. Every period in a PeriodArray must
have the same freq.
arrays.PeriodArray(values[, dtype, freq, copy]) Pandas ExtensionArray for storing Period data.
pandas.arrays.PeriodArray
Notes
Attributes
None
Methods
None
pandas.PeriodDtype
class pandas.PeriodDtype(freq=None)
An ExtensionDtype for Period data.
This is not an actual numpy dtype, but a duck type.
Parameters
freq [str or DateOffset] The frequency of this PeriodDtype.
Examples
>>> pd.PeriodDtype(freq='D')
period[D]
>>> pd.PeriodDtype(freq=pd.offsets.MonthEnd())
period[M]
Attributes
pandas.PeriodDtype.freq
property PeriodDtype.freq
The frequency object of this PeriodDtype.
Methods
None
pandas.Interval
class pandas.Interval
Immutable object implementing an Interval, a bounded slice-like interval.
Parameters
left [orderable scalar] Left bound for the interval.
right [orderable scalar] Right bound for the interval.
closed [{‘right’, ‘left’, ‘both’, ‘neither’}, default ‘right’] Whether the interval is closed on
the left-side, right-side, both or neither. See the Notes for more detailed explanation.
See also:
IntervalIndex An Index of Interval objects that are all closed on the same side.
cut Convert continuous data into discrete bins (Categorical of Interval objects).
qcut Convert continuous data into bins (Categorical of Interval objects) based on quantiles.
Period Represents a period of time.
Notes
The parameters left and right must be from the same type, you must be able to compare them and they must
satisfy left <= right.
A closed interval (in mathematics denoted by square brackets) contains its endpoints, i.e. the closed interval
[0, 5] is characterized by the conditions 0 <= x <= 5. This is what closed='both' stands for. An
open interval (in mathematics denoted by parentheses) does not contain its endpoints, i.e. the open interval (0,
5) is characterized by the conditions 0 < x < 5. This is what closed='neither' stands for. Intervals
can also be half-open or half-closed, i.e. [0, 5) is described by 0 <= x < 5 (closed='left') and (0,
5] is described by 0 < x <= 5 (closed='right').
Examples
>>> 2.5 in iv
True
>>> 0 in iv
False
>>> 5 in iv
True
>>> 0.0001 in iv
True
>>> iv.length
5
You can operate with + and * over an Interval and the operation is applied to each of its bounds, so the result
depends on the type of the bound elements
>>> shifted_iv = iv + 3
>>> shifted_iv
Interval(3, 8, closed='right')
>>> extended_iv = iv * 10.0
>>> extended_iv
Interval(0.0, 50.0, closed='right')
Attributes
pandas.Interval.closed
Interval.closed
Whether the interval is closed on the left-side, right-side, both or neither.
pandas.Interval.closed_left
Interval.closed_left
Check if the interval is closed on the left side.
For the meaning of closed and open see Interval.
Returns
bool True if the Interval is closed on the left-side.
pandas.Interval.closed_right
Interval.closed_right
Check if the interval is closed on the right side.
For the meaning of closed and open see Interval.
Returns
bool True if the Interval is closed on the left-side.
pandas.Interval.is_empty
Interval.is_empty
Indicates if an interval is empty, meaning it contains no points.
New in version 0.25.0.
Returns
Examples
pandas.Interval.left
Interval.left
Left bound for the interval.
pandas.Interval.length
Interval.length
Return the length of the Interval.
pandas.Interval.mid
Interval.mid
Return the midpoint of the Interval.
pandas.Interval.open_left
Interval.open_left
Check if the interval is open on the left side.
For the meaning of closed and open see Interval.
Returns
bool True if the Interval is closed on the left-side.
pandas.Interval.open_right
Interval.open_right
Check if the interval is open on the right side.
For the meaning of closed and open see Interval.
Returns
bool True if the Interval is closed on the left-side.
pandas.Interval.right
Interval.right
Right bound for the interval.
Methods
pandas.Interval.overlaps
Interval.overlaps()
Check whether two Interval objects overlap.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have
an open endpoint in common do not overlap.
New in version 0.24.0.
Parameters
other [Interval] Interval to check against for an overlap.
Returns
bool True if the two intervals overlap.
See also:
Examples
>>> i1 = pd.Interval(0, 2)
>>> i2 = pd.Interval(1, 3)
>>> i1.overlaps(i2)
True
>>> i3 = pd.Interval(4, 5)
>>> i1.overlaps(i3)
False
Properties
arrays.IntervalArray(data[, closed, dtype, Pandas array for interval data that are closed on the same
. . . ]) side.
pandas.arrays.IntervalArray
Notes
Examples
Attributes
pandas.arrays.IntervalArray.left
property IntervalArray.left
Return the left endpoints of each Interval in the IntervalArray as an Index.
pandas.arrays.IntervalArray.right
property IntervalArray.right
Return the right endpoints of each Interval in the IntervalArray as an Index.
pandas.arrays.IntervalArray.closed
property IntervalArray.closed
Whether the intervals are closed on the left-side, right-side, both or neither.
pandas.arrays.IntervalArray.mid
property IntervalArray.mid
Return the midpoint of each Interval in the IntervalArray as an Index.
pandas.arrays.IntervalArray.length
property IntervalArray.length
Return an Index with entries denoting the length of each Interval in the IntervalArray.
pandas.arrays.IntervalArray.is_empty
IntervalArray.is_empty
Indicates if an interval is empty, meaning it contains no points.
New in version 0.25.0.
Returns
bool or ndarray A boolean indicating if a scalar Interval is empty, or a boolean
ndarray positionally indicating if an Interval in an IntervalArray or
IntervalIndex is empty.
Examples
pandas.arrays.IntervalArray.is_non_overlapping_monotonic
property IntervalArray.is_non_overlapping_monotonic
Return True if the IntervalArray is non-overlapping (no Intervals share points) and is either monotonic
increasing or monotonic decreasing, else False.
Methods
from_arrays(left, right[, closed, copy, dtype]) Construct from two arrays defining the left and right
bounds.
from_tuples(data[, closed, copy, dtype]) Construct an IntervalArray from an array-like of tu-
ples.
from_breaks(breaks[, closed, copy, dtype]) Construct an IntervalArray from an array of splits.
contains(other) Check elementwise if the Intervals contain the value.
overlaps(other) Check elementwise if an Interval overlaps the values
in the IntervalArray.
continues on next page
pandas.arrays.IntervalArray.from_arrays
Notes
Each element of left must be less than or equal to the right element at the same position. If an element is
missing, it must be missing in both left and right. A TypeError is raised when using an unsupported type
for left or right. At the moment, ‘category’, ‘object’, and ‘string’ subtypes are not supported.
pandas.arrays.IntervalArray.from_tuples
Examples
pandas.arrays.IntervalArray.from_breaks
Examples
pandas.arrays.IntervalArray.contains
IntervalArray.contains(other)
Check elementwise if the Intervals contain the value.
Return a boolean mask whether the value is contained in the Intervals of the IntervalArray.
New in version 0.25.0.
Parameters
other [scalar] The value to check whether it is contained in the Intervals.
Returns
boolean array
See also:
Examples
>>> intervals.contains(0.5)
array([ True, False, False])
pandas.arrays.IntervalArray.overlaps
IntervalArray.overlaps(other)
Check elementwise if an Interval overlaps the values in the IntervalArray.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have
an open endpoint in common do not overlap.
New in version 0.24.0.
Parameters
other [IntervalArray] Interval to check against for an overlap.
Returns
ndarray Boolean array positionally indicating where an overlap occurs.
See also:
Examples
pandas.arrays.IntervalArray.set_closed
IntervalArray.set_closed(closed)
Return an IntervalArray identical to the current one, but closed on the specified side.
New in version 0.24.0.
Parameters
closed [{‘left’, ‘right’, ‘both’, ‘neither’}] Whether the intervals are closed on the left-
side, right-side, both or neither.
Returns
new_index [IntervalArray]
Examples
pandas.arrays.IntervalArray.to_tuples
IntervalArray.to_tuples(na_tuple=True)
Return an ndarray of tuples of the form (left, right).
Parameters
na_tuple [bool, default True] Returns NA as a tuple if True, (nan, nan), or just as
the NA value itself if False, nan.
Returns
tuples: ndarray
pandas.IntervalDtype
class pandas.IntervalDtype(subtype=None)
An ExtensionDtype for Interval data.
This is not an actual numpy dtype, but a duck type.
Parameters
subtype [str, np.dtype] The dtype of the Interval bounds.
Examples
>>> pd.IntervalDtype(subtype='int64')
interval[int64]
Attributes
pandas.IntervalDtype.subtype
property IntervalDtype.subtype
The dtype of the Interval bounds.
Methods
None
numpy.ndarray cannot natively represent integer-data with missing values. pandas provides this through
arrays.IntegerArray.
pandas.arrays.IntegerArray
Warning: IntegerArray is currently experimental, and its API or internal implementation may change
without warning.
Examples
String aliases for the dtypes are also available. They are capitalized.
Attributes
None
Methods
None
pandas.Int8Dtype
class pandas.Int8Dtype
An ExtensionDtype for int8 integer data.
Changed in version 1.0.0: Now uses pandas.NA as its missing value, rather than numpy.nan.
Attributes
None
Methods
None
pandas.Int16Dtype
class pandas.Int16Dtype
An ExtensionDtype for int16 integer data.
Changed in version 1.0.0: Now uses pandas.NA as its missing value, rather than numpy.nan.
Attributes
None
Methods
None
pandas.Int32Dtype
class pandas.Int32Dtype
An ExtensionDtype for int32 integer data.
Changed in version 1.0.0: Now uses pandas.NA as its missing value, rather than numpy.nan.
Attributes
None
Methods
None
pandas.Int64Dtype
class pandas.Int64Dtype
An ExtensionDtype for int64 integer data.
Changed in version 1.0.0: Now uses pandas.NA as its missing value, rather than numpy.nan.
Attributes
None
Methods
None
pandas.UInt8Dtype
class pandas.UInt8Dtype
An ExtensionDtype for uint8 integer data.
Changed in version 1.0.0: Now uses pandas.NA as its missing value, rather than numpy.nan.
Attributes
None
Methods
None
pandas.UInt16Dtype
class pandas.UInt16Dtype
An ExtensionDtype for uint16 integer data.
Changed in version 1.0.0: Now uses pandas.NA as its missing value, rather than numpy.nan.
Attributes
None
Methods
None
pandas.UInt32Dtype
class pandas.UInt32Dtype
An ExtensionDtype for uint32 integer data.
Changed in version 1.0.0: Now uses pandas.NA as its missing value, rather than numpy.nan.
Attributes
None
Methods
None
pandas.UInt64Dtype
class pandas.UInt64Dtype
An ExtensionDtype for uint64 integer data.
Changed in version 1.0.0: Now uses pandas.NA as its missing value, rather than numpy.nan.
Attributes
None
Methods
None
pandas defines a custom data type for representing data that can take only a limited, fixed set of values. The dtype of
a Categorical can be described by a pandas.api.types.CategoricalDtype.
CategoricalDtype([categories, ordered]) Type for categorical data with the categories and or-
deredness.
pandas.CategoricalDtype
Notes
This class is useful for specifying the type of a Categorical independent of the values. See CategoricalDtype
for more.
Examples
An empty CategoricalDtype with a specific dtype can be created by providing an empty index. As follows,
>>> pd.CategoricalDtype(pd.DatetimeIndex([])).categories.dtype
dtype('<M8[ns]')
Attributes
pandas.CategoricalDtype.categories
property CategoricalDtype.categories
An Index containing the unique categories allowed.
pandas.CategoricalDtype.ordered
property CategoricalDtype.ordered
Whether the categories have an ordered relationship.
Methods
None
pandas.Categorical
categories [Index-like (unique), optional] The unique categories for this categorical. If not
given, the categories are assumed to be the unique values of values (sorted, if possible,
otherwise in the order in which they appear).
ordered [bool, default False] Whether or not this categorical is treated as a ordered categor-
ical. If True, the resulting categorical will be ordered. An ordered categorical respects,
when sorted, the order of its categories attribute (which in turn is the categories argu-
ment, if provided).
dtype [CategoricalDtype] An instance of CategoricalDtype to use for this categorical.
Raises
ValueError If the categories do not validate.
TypeError If an explicit ordered=True is given but no categories and the values are not
sortable.
See also:
CategoricalDtype Type for categorical data.
CategoricalIndex An Index with an underlying Categorical.
Notes
Examples
>>> c.codes
array([ 0, 1, 2, 0, 1, 2, -1], dtype=int8)
Ordered Categoricals can be sorted according to the custom order of the categories and can have a min and max
value.
Attributes
pandas.Categorical.categories
property Categorical.categories
The categories of this categorical.
Setting assigns new values to each category (effectively a rename of each individual category).
The assigned value has to be a list-like object. All items must be unique and the number of items in the
new categories must be the same as the number of items in the old categories.
Assigning to categories is a inplace operation!
Raises
ValueError If the new categories do not validate as categories or if the number of new
categories is unequal the number of old categories
See also:
pandas.Categorical.codes
property Categorical.codes
The category codes of this categorical.
Codes are an array of integers which are the positions of the actual values in the categories array.
There is no setter, use the other categorical methods and the normal item setter to change values in the
categorical.
Returns
ndarray[int] A non-writable view of the codes array.
pandas.Categorical.ordered
property Categorical.ordered
Whether the categories have an ordered relationship.
pandas.Categorical.dtype
property Categorical.dtype
The CategoricalDtype for this instance.
Methods
from_codes(codes[, categories, ordered, dtype]) Make a Categorical type from codes and categories
or dtype.
__array__([dtype]) The numpy array interface.
pandas.Categorical.from_codes
Examples
pandas.Categorical.__array__
Categorical.__array__(dtype=None)
The numpy array interface.
Returns
numpy.array A numpy array of either the specified dtype or, if dtype==None (default),
the same dtype as categorical.categories.dtype.
The alternative Categorical.from_codes() constructor can be used when you have the categories and integer
codes already:
np.asarray(categorical) works by implementing the array interface. Be aware, that this converts the Cate-
gorical back to a NumPy array, so categories and order information is not preserved!
A Categorical can be stored in a Series or DataFrame. To create a Series of dtype category, use cat =
s.astype(dtype) or Series(..., dtype=dtype) where dtype is either
• the string 'category'
• an instance of CategoricalDtype.
If the Series is of dtype CategoricalDtype, Series.cat can be used to change the categorical data. See
Categorical accessor for more.
Data where a single value is repeated many times (e.g. 0 or NaN) may be stored efficiently as a arrays.
SparseArray.
pandas.arrays.SparseArray
data.dtype na_value
float np.nan
int 0
bool False
datetime64 pd.NaT
timedelta64 pd.NaT
The fill value is potentially specified in three ways. In order of precedence, these are
1. The fill_value argument
2. dtype.fill_value if fill_value is None and dtype is a SparseDtype
3. data.dtype.fill_value if fill_value is None and dtype is not a
SparseDtype and data is a SparseArray.
kind [{‘integer’, ‘block’}, default ‘integer’] The type of storage for sparse locations.
• ‘block’: Stores a block and block_length for each contiguous span of sparse values.
This is best when sparse data tends to be clumped together, with large regions of
fill-value values between sparse values.
• ‘integer’: uses an integer to store the location of each sparse value.
dtype [np.dtype or SparseDtype, optional] The dtype to use for the SparseArray. For numpy
dtypes, this determines the dtype of self.sp_values. For SparseDtype, this deter-
mines self.sp_values and self.fill_value.
copy [bool, default False] Whether to explicitly copy the incoming data array.
Examples
Attributes
None
Methods
None
pandas.SparseDtype
dtype na_value
float np.nan
int 0
bool False
datetime64 pd.NaT
timedelta64 pd.NaT
Attributes
None
Methods
None
The Series.sparse accessor may be used to access sparse-specific attributes and methods if the Series contains
sparse values. See Sparse accessor for more.
When working with text data, where each valid element is a string or missing, we recommend using StringDtype
(with the alias "string").
pandas.arrays.StringArray
Warning: StringArray is considered experimental. The implementation and parts of the API may change
without warning.
Parameters
values [array-like] The array of data.
Warning: Currently, this expects an object-dtype ndarray where the elements are
Python strings or pandas.NA. This may change without warning in the future.
Use pandas.array() with dtype="string" for a stable way of creating a
StringArray from any sequence.
Notes
Examples
Unlike arrays instantiated with dtype="object", StringArray will convert the values to strings.
Attributes
None
Methods
None
pandas.StringDtype
class pandas.StringDtype
Extension dtype for string data.
New in version 1.0.0.
Warning: StringDtype is considered experimental. The implementation and parts of the API may change
without warning.
In particular, StringDtype.na_value may change to no longer be numpy.nan.
Examples
>>> pd.StringDtype()
StringDtype
Attributes
None
Methods
None
The Series.str accessor is available for Series backed by a arrays.StringArray. See String handling
for more.
The boolean dtype (with the alias "boolean") provides support for storing boolean data (True, False values) with
missing values, which is not possible with a bool numpy.ndarray.
arrays.BooleanArray(values, mask[, copy]) Array of boolean (True/False) data with missing values.
pandas.arrays.BooleanArray
Warning: BooleanArray is considered experimental. The implementation and parts of the API may change
without warning.
Parameters
values [numpy.ndarray] A 1-d boolean-dtype array with the data.
mask [numpy.ndarray] A 1-d boolean-dtype array indicating missing values (True indicates
missing).
copy [bool, default False] Whether to copy the values and mask arrays.
Returns
BooleanArray
Examples
Attributes
None
Methods
None
pandas.BooleanDtype
class pandas.BooleanDtype
Extension dtype for boolean data.
New in version 1.0.0.
Warning: BooleanDtype is considered experimental. The implementation and parts of the API may change
without warning.
Examples
>>> pd.BooleanDtype()
BooleanDtype
Attributes
None
Methods
None
3.6.1 Index
Many of these methods or variants thereof are available on the objects that contain an index (Series/DataFrame)
and those should most likely be used before calling these methods directly.
Index([data, dtype, copy, name, tupleize_cols]) Immutable sequence used for indexing and alignment.
pandas.Index
Notes
Examples
>>> pd.Index(list('abc'))
Index(['a', 'b', 'c'], dtype='object')
Attributes
pandas.Index.T
property Index.T
Return the transpose, which is by definition self.
pandas.Index.array
Index.array
The ExtensionArray of the data backing this Series or Index.
New in version 0.24.0.
Returns
ExtensionArray An ExtensionArray of the values stored within. For extension types,
this is the actual array. For NumPy native types, this is a thin (no copy) wrapper
around numpy.ndarray.
.array differs .values which may require converting the data to a different
form.
See also:
Notes
This table lays out the different array types for each extension dtype within pandas.
For any 3rd-party extension types, the array type will be an ExtensionArray.
For all remaining dtypes .array will be a arrays.NumpyExtensionArray wrapping the actual
ndarray stored within. If you absolutely need a NumPy array (possibly with copying / coercing data), then
use Series.to_numpy() instead.
Examples
For regular NumPy types like int, and float, a PandasArray is returned.
>>> pd.Series([1, 2, 3]).array
<PandasArray>
[1, 2, 3]
Length: 3, dtype: int64
pandas.Index.asi8
property Index.asi8
Integer representation of the values.
Returns
ndarray An ndarray with int64 dtype.
pandas.Index.dtype
Index.dtype
Return the dtype object of the underlying data.
pandas.Index.has_duplicates
property Index.has_duplicates
Check if the Index has duplicate values.
Returns
bool Whether or not the Index has duplicate values.
Examples
pandas.Index.hasnans
Index.hasnans
Return if I have any nans; enables various perf speedups.
pandas.Index.inferred_type
Index.inferred_type
Return a string of the type inferred from the values.
pandas.Index.is_all_dates
Index.is_all_dates
Whether or not the index values only consist of dates.
pandas.Index.is_monotonic
property Index.is_monotonic
Alias for is_monotonic_increasing.
pandas.Index.is_monotonic_decreasing
property Index.is_monotonic_decreasing
Return if the index is monotonic decreasing (only equal or decreasing) values.
Examples
pandas.Index.is_monotonic_increasing
property Index.is_monotonic_increasing
Return if the index is monotonic increasing (only equal or increasing) values.
Examples
pandas.Index.is_unique
Index.is_unique
Return if the index has unique values.
pandas.Index.name
property Index.name
Return Index or MultiIndex name.
pandas.Index.nbytes
property Index.nbytes
Return the number of bytes in the underlying data.
pandas.Index.ndim
property Index.ndim
Number of dimensions of the underlying data, by definition 1.
pandas.Index.nlevels
property Index.nlevels
Number of levels.
pandas.Index.shape
property Index.shape
Return a tuple of the shape of the underlying data.
pandas.Index.size
property Index.size
Return the number of elements in the underlying data.
pandas.Index.values
property Index.values
Return an array representing the data in the Index.
Returns
array: numpy.ndarray or ExtensionArray
See also:
empty
names
Methods
pandas.Index.all
Index.all()
Return whether all elements are Truthy.
Parameters
*args These parameters will be passed to numpy.all.
**kwargs These parameters will be passed to numpy.all.
Returns
all [bool or array_like (if axis is specified)] A single element array_like may be con-
verted to bool.
See also:
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal
to zero.
Examples
all
True, because nonzero integers are considered True.
any
True, because 1 is considered True.
pandas.Index.any
Index.any(*args, **kwargs)
Return whether any element is Truthy.
Parameters
*args These parameters will be passed to numpy.any.
**kwargs These parameters will be passed to numpy.any.
Returns
any [bool or array_like (if axis is specified)] A single element array_like may be con-
verted to bool.
See also:
Notes
Not a Number (NaN), positive infinity and negative infinity evaluate to True because these are not equal
to zero.
Examples
pandas.Index.append
Index.append(other)
Append a collection of Index options together.
Parameters
other [Index or list/tuple of indices]
Returns
appended [Index]
pandas.Index.argmax
Examples
>>> s.argmax()
2
>>> s.argmin()
0
The maximum cereal calories is the third element and the minimum cereal calories is the first element,
since series is zero-indexed.
pandas.Index.argmin
Examples
>>> s.argmax()
2
>>> s.argmin()
0
The maximum cereal calories is the third element and the minimum cereal calories is the first element,
since series is zero-indexed.
pandas.Index.argsort
Index.argsort(*args, **kwargs)
Return the integer indices that would sort the index.
Parameters
*args Passed to numpy.ndarray.argsort.
**kwargs Passed to numpy.ndarray.argsort.
Returns
numpy.ndarray Integer indices that would sort the index if used as an indexer.
See also:
Examples
>>> idx[order]
Index(['a', 'b', 'c', 'd'], dtype='object')
pandas.Index.asof
Index.asof(label)
Return the label from the index, or, if not present, the previous one.
Assuming that the index is sorted, return the passed index label if it is in the index, or return the previous
index label if the passed one is not in the index.
Parameters
label [object] The label up to which the method returns the latest index label.
Returns
object The passed label if it is in the index. The previous label if the passed label is not
in the sorted index or NaN if there is no such label.
See also:
Examples
If the label is in the index, the method returns the passed label.
>>> idx.asof('2014-01-02')
'2014-01-02'
If all of the labels in the index are later than the passed label, NaN is returned.
>>> idx.asof('1999-01-02')
nan
pandas.Index.asof_locs
Index.asof_locs(where, mask)
Return the locations (indices) of labels in the index.
As in the asof function, if the label (a particular entry in where) is not in the index, the latest index label
up to the passed label is chosen and its index returned.
If all of the labels in the index are later than a label in where, -1 is returned.
mask is used to ignore NA values in the index during calculation.
Parameters
where [Index] An Index consisting of an array of timestamps.
mask [array-like] Array of booleans denoting where values in the original data are not
NA.
Returns
numpy.ndarray An array of locations (indices) of the labels from the Index which
correspond to the return values of the asof function for every element in where.
pandas.Index.astype
Index.astype(dtype, copy=True)
Create an Index with values cast to dtypes.
The class of a new Index is determined by dtype. When conversion is impossible, a TypeError exception
is raised.
Parameters
dtype [numpy dtype or pandas type] Note that any signed integer dtype is treated as
'int64', and any unsigned integer dtype is treated as 'uint64', regardless of
the size.
copy [bool, default True] By default, astype always returns a newly allocated object.
If copy is set to False and internal requirements on dtype are satisfied, the original
data is used to create a new Index or the original Index is returned.
Returns
Index Index with values cast to specified dtype.
pandas.Index.copy
Notes
In most cases, there should be no functional difference from using deep, but if deep is passed it will
attempt to deepcopy.
pandas.Index.delete
Index.delete(loc)
Make new Index with passed location(-s) deleted.
Parameters
loc [int or list of int] Location of item(-s) which will be deleted. Use a list of locations
to delete more than one value at the same time.
Returns
Index New Index with passed location(-s) deleted.
See also:
numpy.delete Delete any rows and column from NumPy array (ndarray).
Examples
pandas.Index.difference
Index.difference(other, sort=None)
Return a new Index with elements of index not in other.
This is the set difference of two Index objects.
Parameters
other [Index or array-like]
sort [False or None, default None] Whether to sort the resulting index. By default, the
values are attempted to be sorted, but any TypeError from incomparable elements
is caught by pandas.
• None : Attempt to sort the result, but catch any TypeErrors from comparing
incomparable elements.
• False : Do not sort the result.
New in version 0.24.0.
Changed in version 0.24.1: Changed the default value from True to None (with-
out change in behaviour).
Returns
difference [Index]
Examples
pandas.Index.drop
Index.drop(labels, errors='raise')
Make new Index with passed list of labels deleted.
Parameters
labels [array-like]
errors [{‘ignore’, ‘raise’}, default ‘raise’] If ‘ignore’, suppress error and existing labels
are dropped.
Returns
dropped [Index]
Raises
KeyError If not all of the labels are found in the selected axis
pandas.Index.drop_duplicates
Index.drop_duplicates(keep='first')
Return Index with duplicate values removed.
Parameters
keep [{‘first’, ‘last’, False}, default ‘first’]
• ‘first’ : Drop duplicates except for the first occurrence.
• ‘last’ : Drop duplicates except for the last occurrence.
• False : Drop all duplicates.
Returns
deduplicated [Index]
See also:
Examples
The keep parameter controls which duplicate values are removed. The value ‘first’ keeps the first occur-
rence for each set of duplicated entries. The default value of keep is ‘first’.
>>> idx.drop_duplicates(keep='first')
Index(['lama', 'cow', 'beetle', 'hippo'], dtype='object')
The value ‘last’ keeps the last occurrence for each set of duplicated entries.
>>> idx.drop_duplicates(keep='last')
Index(['cow', 'beetle', 'lama', 'hippo'], dtype='object')
pandas.Index.droplevel
Index.droplevel(level=0)
Return index with requested level(s) removed.
If resulting index has only 1 level left, the result will be of Index type, not MultiIndex.
Parameters
level [int, str, or list-like, default 0] If a string is given, must be the name of a level If
list-like, elements must be names or indexes of levels.
Returns
Index or MultiIndex
Examples
>>> mi = pd.MultiIndex.from_arrays(
... [[1, 2], [3, 4], [5, 6]], names=['x', 'y', 'z'])
>>> mi
MultiIndex([(1, 3, 5),
(2, 4, 6)],
names=['x', 'y', 'z'])
>>> mi.droplevel()
MultiIndex([(3, 5),
(4, 6)],
names=['y', 'z'])
>>> mi.droplevel(2)
MultiIndex([(1, 3),
(2, 4)],
names=['x', 'y'])
>>> mi.droplevel('z')
MultiIndex([(1, 3),
(2, 4)],
names=['x', 'y'])
pandas.Index.dropna
Index.dropna(how='any')
Return Index without NA/NaN values.
Parameters
how [{‘any’, ‘all’}, default ‘any’] If the Index is a MultiIndex, drop the value when
any or all levels are NaN.
Returns
Index
pandas.Index.duplicated
Index.duplicated(keep='first')
Indicate duplicate index values.
Duplicated values are indicated as True values in the resulting array. Either all duplicates, all except the
first, or all except the last occurrence of duplicates can be indicated.
Parameters
keep [{‘first’, ‘last’, False}, default ‘first’] The value or values in a set of duplicates to
mark as missing.
• ‘first’ : Mark duplicates as True except for the first occurrence.
• ‘last’ : Mark duplicates as True except for the last occurrence.
• False : Mark all duplicates as True.
Returns
numpy.ndarray
See also:
Examples
By default, for each set of duplicated values, the first occurrence is set to False and all others to True:
which is equivalent to
>>> idx.duplicated(keep='first')
array([False, False, True, False, True])
By using ‘last’, the last occurrence of each set of duplicated values is set on False and all others on True:
>>> idx.duplicated(keep='last')
array([ True, False, True, False, False])
>>> idx.duplicated(keep=False)
array([ True, False, True, False, True])
pandas.Index.equals
Index.equals(other)
Determine if two Index object are equal.
The things that are being compared are:
• The elements inside the Index object.
• The order of the elements inside the Index object.
Parameters
other [Any] The other object to compare against.
Returns
bool True if “other” is an Index and it has the same elements and order as the calling
index; False otherwise.
Examples
>>> idx1.equals(idx2)
False
pandas.Index.factorize
Index.factorize(sort=False, na_sentinel=- 1)
Encode the object as an enumerated type or categorical variable.
This method is useful for obtaining a numeric representation of an array when all that matters is identifying
distinct values. factorize is available as both a top-level function pandas.factorize(), and as a
method Series.factorize() and Index.factorize().
Parameters
sort [bool, default False] Sort uniques and shuffle codes to maintain the relationship.
na_sentinel [int or None, default -1] Value to mark “not found”. If None, will not drop
the NaN from the uniques of the values.
Changed in version 1.1.2.
Returns
codes [ndarray] An integer ndarray that’s an indexer into uniques. uniques.
take(codes) will have the same values as values.
uniques [ndarray, Index, or Categorical] The unique valid values. When values is Cat-
egorical, uniques is a Categorical. When values is some other pandas object, an
Index is returned. Otherwise, a 1-D ndarray is returned.
Note: Even if there’s a missing value in values, uniques will not contain an entry
for it.
See also:
Examples
These examples all show factorize as a top-level method like pd.factorize(values). The results
are identical for methods like Series.factorize().
With sort=True, the uniques will be sorted, and codes will be shuffled so that the relationship is the
maintained.
Missing values are indicated in codes with na_sentinel (-1 by default). Note that missing values are never
included in uniques.
Thus far, we’ve only factorized lists (which are internally coerced to NumPy arrays). When factorizing
pandas objects, the type of uniques will differ. For Categoricals, a Categorical is returned.
If NaN is in the values, and we want to include NaN in the uniques of the values, it can be achieved by
setting na_sentinel=None.
pandas.Index.fillna
Index.fillna(value=None, downcast=None)
Fill NA/NaN values with the specified value.
Parameters
value [scalar] Scalar value to use to fill holes (e.g. 0). This value cannot be a list-likes.
downcast [dict, default is None] A dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate equal type (e.g.
float64 to int64 if possible).
Returns
Index
See also:
pandas.Index.format
pandas.Index.get_indexer
Examples
Notice that the return value is an array of locations in index and x is marked by -1, as it is not in index.
pandas.Index.get_indexer_for
Index.get_indexer_for(target, **kwargs)
Guaranteed return of an indexer even when non-unique.
This dispatches to get_indexer or get_indexer_non_unique as appropriate.
Returns
numpy.ndarray List of indices.
pandas.Index.get_indexer_non_unique
Index.get_indexer_non_unique(target)
Compute indexer and mask for new index given the current index. The indexer should be then used as an
input to ndarray.take to align the current data to the new index.
Parameters
target [Index]
Returns
indexer [ndarray of int] Integers from 0 to n - 1 indicating that the index at these
positions matches the corresponding target values. Missing values in the target are
marked by -1.
missing [ndarray of int] An indexer into the target of the values not found. These
correspond to the -1 in the indexer array.
pandas.Index.get_level_values
Index.get_level_values(level)
Return an Index of values for requested level.
This is primarily useful to get an individual level of values from a MultiIndex, but is provided on Index as
well for compatibility.
Parameters
level [int or str] It is either the integer position or the name of the level.
Returns
Index Calling object, as there is only one level in the Index.
See also:
Notes
Examples
>>> idx.get_level_values(0)
Index(['a', 'b', 'c'], dtype='object')
pandas.Index.get_loc
tolerance [int or float, optional] Maximum distance from index value for inexact
matches. The value of the index at the matching location must satisfy the equa-
tion abs(index[loc] - key) <= tolerance.
Returns
loc [int if unique index, slice if monotonic index, else mask]
Examples
pandas.Index.get_slice_bound
pandas.Index.get_value
Index.get_value(series, key)
Fast lookup of value from 1-dimensional ndarray.
Only use this if you know what you’re doing.
Returns
scalar or Series
pandas.Index.groupby
Index.groupby(values)
Group the index labels by a given array of values.
Parameters
values [array] Values used to determine the groups.
Returns
dict {group name -> group labels}
pandas.Index.holds_integer
Index.holds_integer()
Whether the type is an integer type.
pandas.Index.identical
Index.identical(other)
Similar to equals, but checks that object attributes and types are also equal.
Returns
bool If two Index objects have equal elements and same type True, otherwise False.
pandas.Index.insert
Index.insert(loc, item)
Make new Index inserting new item at location.
Follows Python list.append semantics for negative values.
Parameters
loc [int]
item [object]
Returns
new_index [Index]
pandas.Index.intersection
Index.intersection(other, sort=False)
Form the intersection of two Index objects.
This returns a new Index with elements common to the index and other.
Parameters
other [Index or array-like]
sort [False or None, default False] Whether to sort the resulting index.
• False : do not sort the result.
• None : sort the result, except when self and other are equal or when the values
cannot be compared.
New in version 0.24.0.
Changed in version 0.24.1: Changed the default from True to False, to match
the behaviour of 0.23.4 and earlier.
Returns
intersection [Index]
Examples
pandas.Index.is_
Index.is_(other)
More flexible, faster check like is but that works through views.
Note: this is not the same as Index.identical(), which checks that metadata is also the same.
Parameters
other [object] Other object to compare against.
Returns
bool True if both have same underlying data, False otherwise.
See also:
pandas.Index.is_boolean
Index.is_boolean()
Check if the Index only consists of booleans.
Returns
bool Whether or not the Index only consists of booleans.
See also:
is_mixed Check if the Index holds data with mixed data types.
Examples
pandas.Index.is_categorical
Index.is_categorical()
Check if the Index holds categorical data.
Returns
bool True if the Index is categorical.
See also:
Examples
pandas.Index.is_floating
Index.is_floating()
Check if the Index is a floating type.
The Index may consist of only floats, NaNs, or a mix of floats, integers, or NaNs.
Returns
bool Whether or not the Index only consists of only consists of floats, NaNs, or a mix
of floats, integers, or NaNs.
See also:
Examples
pandas.Index.is_integer
Index.is_integer()
Check if the Index only consists of integers.
Returns
bool Whether or not the Index only consists of integers.
See also:
Examples
pandas.Index.is_interval
Index.is_interval()
Check if the Index holds Interval objects.
Returns
bool Whether or not the Index holds Interval objects.
See also:
Examples
pandas.Index.is_mixed
Index.is_mixed()
Check if the Index holds data with mixed data types.
Returns
bool Whether or not the Index holds data with mixed data types.
See also:
Examples
pandas.Index.is_numeric
Index.is_numeric()
Check if the Index only consists of numeric data.
Returns
bool Whether or not the Index only consists of numeric data.
See also:
Examples
pandas.Index.is_object
Index.is_object()
Check if the Index is of the object dtype.
Returns
bool Whether or not the Index is of the object dtype.
See also:
Examples
pandas.Index.is_type_compatible
Index.is_type_compatible(kind)
Whether the index type is compatible with the provided type.
pandas.Index.isin
Index.isin(values, level=None)
Return a boolean array where the index values are in values.
Compute boolean array of whether each index value is found in the passed set of values. The length of
the returned boolean array matches the length of the index.
Parameters
values [set or list-like] Sought values.
level [str or int, optional] Name or position of the index level to use (if the index is a
MultiIndex).
Returns
is_contained [ndarray] NumPy array of boolean values.
See also:
Notes
In the case of MultiIndex you must either specify values as a list-like object containing tuples that are the
same length as the number of levels, or specify level. Otherwise it will raise a ValueError.
If level is specified:
• if it is the name of one and only one index level, use that level;
• otherwise it should be a number indicating level position.
Examples
Check whether the strings in the ‘color’ level of the MultiIndex are in a list of colors.
>>> dti.isin(['2000-03-11'])
array([ True, False, False])
pandas.Index.isna
Index.isna()
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None, numpy.
NaN or pd.NaT, get mapped to True values. Everything else get mapped to False values. Charac-
ters such as empty strings ‘’ or numpy.inf are not considered NA values (unless you set pandas.
options.mode.use_inf_as_na = True).
Returns
numpy.ndarray A boolean array of whether my values are NA.
See also:
Examples
pandas.Index.isnull
Index.isnull()
Detect missing values.
Return a boolean same-sized object indicating if the values are NA. NA values, such as None, numpy.
NaN or pd.NaT, get mapped to True values. Everything else get mapped to False values. Charac-
ters such as empty strings ‘’ or numpy.inf are not considered NA values (unless you set pandas.
options.mode.use_inf_as_na = True).
Returns
numpy.ndarray A boolean array of whether my values are NA.
See also:
Examples
pandas.Index.item
Index.item()
Return the first element of the underlying data as a Python scalar.
Returns
scalar The first element of %(klass)s.
Raises
ValueError If the data is not length-1.
pandas.Index.join
pandas.Index.map
Index.map(mapper, na_action=None)
Map values using input correspondence (a dict, Series, or function).
Parameters
mapper [function, dict, or Series] Mapping correspondence.
na_action [{None, ‘ignore’}] If ‘ignore’, propagate NA values, without passing them
to the mapping correspondence.
Returns
applied [Union[Index, MultiIndex], inferred] The output of the mapping function ap-
plied to the index. If the function returns a tuple with more than one element a
MultiIndex will be returned.
pandas.Index.max
Examples
pandas.Index.memory_usage
Index.memory_usage(deep=False)
Memory usage of the values.
Parameters
deep [bool, default False] Introspect the data deeply, interrogate object dtypes for
system-level memory consumption.
Returns
bytes used
See also:
Notes
Memory usage does not include memory consumed by elements that are not components of the array if
deep=False or if used on PyPy
pandas.Index.min
Examples
pandas.Index.notna
Index.notna()
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to
True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set
pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN,
get mapped to False values.
Returns
numpy.ndarray Boolean array to indicate which entries are not NA.
See also:
Examples
Show which entries in an Index are not NA. The result is an array.
pandas.Index.notnull
Index.notnull()
Detect existing (non-missing) values.
Return a boolean same-sized object indicating if the values are not NA. Non-missing values get mapped to
True. Characters such as empty strings '' or numpy.inf are not considered NA values (unless you set
pandas.options.mode.use_inf_as_na = True). NA values, such as None or numpy.NaN,
get mapped to False values.
Returns
numpy.ndarray Boolean array to indicate which entries are not NA.
See also:
Examples
Show which entries in an Index are not NA. The result is an array.
pandas.Index.nunique
Index.nunique(dropna=True)
Return number of unique elements in the object.
Excludes NA values by default.
Parameters
dropna [bool, default True] Don’t include NaN in the count.
Returns
int
See also:
Examples
>>> s.nunique()
4
pandas.Index.putmask
Index.putmask(mask, value)
Return a new Index of the values set with the mask.
Returns
Index
See also:
pandas.Index.ravel
Index.ravel(order='C')
Return an ndarray of the flattened values of the underlying data.
Returns
numpy.ndarray Flattened array.
See also:
pandas.Index.reindex
pandas.Index.rename
Index.rename(name, inplace=False)
Alter Index or MultiIndex name.
Able to set new names without level. Defaults to returning new index. Length of names must match
number of levels in MultiIndex.
Parameters
name [label or list of labels] Name(s) to set.
inplace [bool, default False] Modifies the object directly, instead of creating a new
Index or MultiIndex.
Returns
Index or None The same type as the caller or None if inplace=True.
See also:
Examples
pandas.Index.repeat
Index.repeat(repeats, axis=None)
Repeat elements of a Index.
Returns a new Index where each element of the current Index is repeated consecutively a given number
of times.
Parameters
repeats [int or array of ints] The number of repetitions for each element. This should
be a non-negative integer. Repeating 0 times will return an empty Index.
axis [None] Must be None. Has no effect but is accepted for compatibility with numpy.
Returns
repeated_index [Index] Newly created Index with repeated elements.
See also:
Examples
pandas.Index.searchsorted
Note: The Index must be monotonically sorted, otherwise wrong locations will likely be returned. Pandas
does not check this for you.
Parameters
value [array_like] Values to insert into self.
side [{‘left’, ‘right’}, optional] If ‘left’, the index of the first suitable location found
is given. If ‘right’, return the last such index. If there is no suitable index, return
either 0 or N (where N is the length of self ).
sorter [1-D array_like, optional] Optional array of integer indices that sort self into
ascending order. They are typically the result of np.argsort.
Returns
int or array of int A scalar or array of insertion points with the same shape as value.
Changed in version 0.24.0: If value is a scalar, an int is now always returned. Pre-
viously, scalar inputs returned an 1-item array for Series and Categorical.
See also:
Notes
Examples
>>> ser.searchsorted(4)
3
>>> ser.searchsorted('3/14/2000')
3
>>> ser.searchsorted('bread')
1
If the values are not monotonically sorted, wrong locations may be returned:
>>> ser = pd.Series([2, 1, 3])
>>> ser
0 2
(continues on next page)
>>> ser.searchsorted(1)
0 # wrong result, correct would be 1
pandas.Index.set_names
Examples
pandas.Index.set_value
Notes
pandas.Index.shift
Index.shift(periods=1, freq=None)
Shift index by desired number of time frequency increments.
This method is for shifting the values of datetime-like indexes by a specified time increment a given
number of times.
Parameters
periods [int, default 1] Number of periods (or increments) to shift by, can be positive
or negative.
freq [pandas.DateOffset, pandas.Timedelta or str, optional] Frequency increment to
shift by. If None, the index is shifted by its own freq attribute. Offset aliases are
valid strings, e.g., ‘D’, ‘W’, ‘M’ etc.
Returns
pandas.Index Shifted index.
See also:
Notes
This method is only implemented for datetime-like index classes, i.e., DatetimeIndex, PeriodIndex and
TimedeltaIndex.
Examples
The default value of freq is the freq attribute of the index, which is ‘MS’ (month start) in this example.
>>> month_starts.shift(10)
DatetimeIndex(['2011-11-01', '2011-12-01', '2012-01-01', '2012-02-01',
'2012-03-01'],
dtype='datetime64[ns]', freq='MS')
pandas.Index.slice_indexer
Notes
This function assumes that the data is sorted, so use at your own peril
Examples
This is a method on all index types. For example you can do:
pandas.Index.slice_locs
Notes
Examples
pandas.Index.sort
Index.sort(*args, **kwargs)
Use sort_values instead.
pandas.Index.sort_values
Examples
>>> idx.sort_values()
Int64Index([1, 10, 100, 1000], dtype='int64')
Sort values in descending order, and also get the indices idx was sorted by.
pandas.Index.sortlevel
pandas.Index.str
Index.str()
Vectorized string functions for Series and Index.
NAs stay NA unless handled otherwise by a particular method. Patterned after Python’s string methods,
with some inspiration from R’s stringr package.
Examples
>>> s = pd.Series(["A_Str_Series"])
>>> s
0 A_Str_Series
dtype: object
>>> s.str.split("_")
0 [A, Str, Series]
dtype: object
pandas.Index.symmetric_difference
• None : Attempt to sort the result, but catch any TypeErrors from comparing
incomparable elements.
• False : Do not sort the result.
New in version 0.24.0.
Changed in version 0.24.1: Changed the default value from True to None (with-
out change in behaviour).
Returns
symmetric_difference [Index]
Notes
symmetric_difference contains elements that appear in either idx1 or idx2 but not both. Equiv-
alent to the Index created by idx1.difference(idx2) | idx2.difference(idx1) with du-
plicates dropped.
Examples
pandas.Index.take
numpy.ndarray.take Return an array formed from the elements of a at the given indices.
pandas.Index.to_flat_index
Index.to_flat_index()
Identity method.
New in version 0.24.0.
This is implemented for compatibility with subclass implementations when chaining.
Returns
pd.Index Caller.
See also:
pandas.Index.to_frame
Index.to_frame(index=True, name=None)
Create a DataFrame with a column containing the Index.
New in version 0.24.0.
Parameters
index [bool, default True] Set the index of the returned DataFrame as the original In-
dex.
name [object, default None] The passed name should substitute for the index name (if
it has one).
Returns
DataFrame DataFrame containing the original Index data.
See also:
Examples
>>> idx.to_frame(index=False)
animal
0 Ant
1 Bear
2 Cow
pandas.Index.to_list
Index.to_list()
Return a list of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Times-
tamp/Timedelta/Interval/Period)
Returns
list
See also:
numpy.ndarray.tolist Return the array as an a.ndim-levels deep nested list of Python scalars.
pandas.Index.to_native_types
Index.to_native_types(slicer=None, **kwargs)
Format specified values of self and return them.
Deprecated since version 1.2.0.
Parameters
slicer [int, array-like] An indexer into self that specifies which values are used in the
formatting process.
kwargs [dict] Options for specifying how the values should be formatted. These op-
tions include the following:
1) na_rep [str] The value that serves as a placeholder for NULL values
2) quoting [bool or None] Whether or not there are quoted values in self
3) date_format [str] The format used to represent date-like values.
Returns
numpy.ndarray Formatted values.
pandas.Index.to_numpy
Notes
The returned array will be the same up to equality (values equal in self will be equal in the returned array;
likewise for values that are not equal). When self contains an ExtensionArray, the dtype may be different.
For example, for a category-dtype Series, to_numpy() will return a NumPy array and the categorical
dtype will be lost.
For NumPy dtypes, this will be a reference to the actual data stored in this Series or Index (assuming
copy=False). Modifying the result in place will modify the data stored in the Series or Index (not that
we recommend doing that).
For extension types, to_numpy() may require copying data and coercing the result to a NumPy type
(possibly object), which may be expensive. When you need a no-copy reference to the underlying data,
Series.array should be used instead.
This table lays out the different dtypes and default return types of to_numpy() for various dtypes within
pandas.
Examples
Specify the dtype to control how datetime-aware data is represented. Use dtype=object to return an
ndarray of pandas Timestamp objects, each with the correct tz.
>>> ser.to_numpy(dtype="datetime64[ns]")
...
array(['1999-12-31T23:00:00.000000000', '2000-01-01T23:00:00...'],
dtype='datetime64[ns]')
pandas.Index.to_series
Index.to_series(index=None, name=None)
Create a Series with both index and values equal to the index keys.
Useful with map for returning an indexer based on an index.
Parameters
index [Index, optional] Index of resulting Series. If None, defaults to original index.
name [str, optional] Name of resulting Series. If None, defaults to name of original
index.
Returns
Series The dtype will be based on the type of the Index values.
See also:
Examples
>>> idx.to_series()
animal
Ant Ant
Bear Bear
Cow Cow
Name: animal, dtype: object
>>> idx.to_series(name='zoo')
animal
Ant Ant
Bear Bear
Cow Cow
Name: zoo, dtype: object
pandas.Index.tolist
Index.tolist()
Return a list of the values.
These are each a scalar type, which is a Python scalar (for str, int, float) or a pandas scalar (for Times-
tamp/Timedelta/Interval/Period)
Returns
list
See also:
numpy.ndarray.tolist Return the array as an a.ndim-levels deep nested list of Python scalars.
pandas.Index.transpose
Index.transpose(*args, **kwargs)
Return the transpose, which is by definition self.
Returns
%(klass)s
pandas.Index.union
Index.union(other, sort=None)
Form the union of two Index objects.
If the Index objects are incompatible, both Index objects will be cast to dtype(‘object’) first.
Changed in version 0.25.0.
Parameters
other [Index or array-like]
sort [bool or None, default None] Whether to sort the resulting Index.
• None : Sort the result, except when
1. self and other are equal.
2. self or other has length 0.
3. Some values in self or other cannot be compared. A RuntimeWarning is
issued in this case.
• False : do not sort the result.
New in version 0.24.0.
Changed in version 0.24.1: Changed the default value from True to None (with-
out change in behaviour).
Returns
union [Index]
Examples
pandas.Index.unique
Index.unique(level=None)
Return unique values in the index.
Unique values are returned in order of appearance, this does NOT sort.
Parameters
level [int or str, optional, default None] Only return values from specified level (for
MultiIndex).
Returns
Index without duplicates
See also:
pandas.Index.value_counts
Examples
With normalize set to True, returns the relative frequency by dividing all values by the sum of values.
bins
Bins can be useful for going from a continuous variable to a categorical variable; instead of counting
unique apparitions of values, divide the index in the specified number of half-open bins.
>>> s.value_counts(bins=3)
(0.996, 2.0] 2
(2.0, 3.0] 2
(3.0, 4.0] 1
dtype: int64
dropna
With dropna set to False we can also see NaN index values.
>>> s.value_counts(dropna=False)
3.0 2
2.0 1
NaN 1
4.0 1
1.0 1
dtype: int64
pandas.Index.where
Index.where(cond, other=None)
Replace values where the condition is False.
The replacement is taken from other.
Parameters
cond [bool array-like with the same length as self] Condition to select the values on.
other [scalar, or array-like, default None] Replacement if the condition is False.
Returns
pandas.Index A copy of self with values replaced from other where the condition is
False.
See also:
Examples
view
Properties
pandas.Index.names
property Index.names
pandas.Index.empty
property Index.empty
Missing values
Conversion
pandas.Index.view
Index.view(cls=None)
Sorting
Index.argsort(*args, **kwargs) Return the integer indices that would sort the index.
Index.searchsorted(value[, side, sorter]) Find indices where elements should be inserted to main-
tain order.
Index.sort_values([return_indexer, . . . ]) Return a sorted copy of the index.
Time-specific operations
Selecting
Index.asof(label) Return the label from the index, or, if not present, the
previous one.
Index.asof_locs(where, mask) Return the locations (indices) of labels in the index.
Index.get_indexer(target[, method, limit, . . . ]) Compute indexer and mask for new index given the cur-
rent index.
Index.get_indexer_for(target, **kwargs) Guaranteed return of an indexer even when non-unique.
Index.get_indexer_non_unique(target) Compute indexer and mask for new index given the cur-
rent index.
Index.get_level_values(level) Return an Index of values for requested level.
Index.get_loc(key[, method, tolerance]) Get integer location, slice or boolean mask for requested
label.
Index.get_slice_bound(label, side, kind) Calculate slice bound that corresponds to given label.
Index.get_value(series, key) Fast lookup of value from 1-dimensional ndarray.
Index.isin(values[, level]) Return a boolean array where the index values are in
values.
Index.slice_indexer([start, end, step, kind]) Compute the slice indexer for input labels and step.
Index.slice_locs([start, end, step, kind]) Compute slice locations for input labels.
RangeIndex([start, stop, step, dtype, copy, . . . ]) Immutable Index implementing a monotonic integer
range.
Int64Index([data, dtype, copy, name]) Immutable sequence used for indexing and alignment.
UInt64Index([data, dtype, copy, name]) Immutable sequence used for indexing and alignment.
Float64Index([data, dtype, copy, name]) Immutable sequence used for indexing and alignment.
pandas.RangeIndex
Attributes
pandas.RangeIndex.start
RangeIndex.start
The value of the start parameter (0 if this was not supplied).
pandas.RangeIndex.stop
RangeIndex.stop
The value of the stop parameter.
pandas.RangeIndex.step
RangeIndex.step
The value of the step parameter (1 if this was not supplied).
Methods
pandas.RangeIndex.from_range
pandas.Int64Index
Notes
Attributes
None
Methods
None
pandas.UInt64Index
Notes
Attributes
None
Methods
None
pandas.Float64Index
Notes
Attributes
None
Methods
None
RangeIndex.start The value of the start parameter (0 if this was not sup-
plied).
RangeIndex.stop The value of the stop parameter.
RangeIndex.step The value of the step parameter (1 if this was not sup-
plied).
RangeIndex.from_range(data[, name, dtype]) Create RangeIndex from a range object.
3.6.3 CategoricalIndex
pandas.CategoricalIndex
Notes
Examples
>>> ci = pd.CategoricalIndex(
... ["a", "b", "c", "a", "b", "c"], ordered=True, categories=["c", "b", "a"]
... )
>>> ci
CategoricalIndex(['a', 'b', 'c', 'a', 'b', 'c'],
categories=['c', 'b', 'a'], ordered=True, dtype='category')
>>> ci.min()
'c'
Attributes
pandas.CategoricalIndex.codes
property CategoricalIndex.codes
The category codes of this categorical.
Codes are an array of integers which are the positions of the actual values in the categories array.
There is no setter, use the other categorical methods and the normal item setter to change values in the
categorical.
Returns
ndarray[int] A non-writable view of the codes array.
pandas.CategoricalIndex.categories
property CategoricalIndex.categories
The categories of this categorical.
Setting assigns new values to each category (effectively a rename of each individual category).
The assigned value has to be a list-like object. All items must be unique and the number of items in the
new categories must be the same as the number of items in the old categories.
Assigning to categories is a inplace operation!
Raises
ValueError If the new categories do not validate as categories or if the number of new
categories is unequal the number of old categories
See also:
pandas.CategoricalIndex.ordered
property CategoricalIndex.ordered
Whether the categories have an ordered relationship.
Methods
pandas.CategoricalIndex.rename_categories
CategoricalIndex.rename_categories(*args, **kwargs)
Rename categories.
Parameters
new_categories [list-like, dict-like or callable] New categories which will replace old
categories.
• list-like: all items must be unique and the number of items in the new cate-
gories must match the existing number of categories.
• dict-like: specifies a mapping from old categories to new. Categories not con-
tained in the mapping are passed through and extra categories in the mapping
are ignored.
• callable : a callable that is called on all items in the old categories and whose
return values comprise the new categories.
inplace [bool, default False] Whether or not to rename the categories inplace or return
a copy of this categorical with renamed categories.
Returns
cat [Categorical or None] Categorical with removed categories or None if
inplace=True.
Raises
ValueError If new categories are list-like and do not have the same number of items
than the current categories or do not validate as categories
See also:
Examples
For dict-like new_categories, extra keys are ignored and categories not in the dictionary are passed
through
pandas.CategoricalIndex.reorder_categories
CategoricalIndex.reorder_categories(*args, **kwargs)
Reorder categories as specified in new_categories.
new_categories need to include all old categories and no new category items.
Parameters
new_categories [Index-like] The categories in new order.
ordered [bool, optional] Whether or not the categorical is treated as a ordered categor-
ical. If not given, do not change the ordered information.
inplace [bool, default False] Whether or not to reorder the categories inplace or return
a copy of this categorical with reordered categories.
Returns
cat [Categorical or None] Categorical with removed categories or None if
inplace=True.
Raises
ValueError If the new categories do not contain all old category items or any new ones
See also:
pandas.CategoricalIndex.add_categories
CategoricalIndex.add_categories(*args, **kwargs)
Add new categories.
new_categories will be included at the last/highest place in the categories and will be unused directly after
this call.
Parameters
new_categories [category or list-like of category] The new categories to be included.
inplace [bool, default False] Whether or not to add the categories inplace or return a
copy of this categorical with added categories.
Returns
cat [Categorical or None] Categorical with new categories added or None if
inplace=True.
Raises
ValueError If the new categories include old categories or do not validate as categories
See also:
pandas.CategoricalIndex.remove_categories
CategoricalIndex.remove_categories(*args, **kwargs)
Remove the specified categories.
removals must be included in the old categories. Values which were in the removed categories will be set
to NaN
Parameters
removals [category or list of categories] The categories which should be removed.
inplace [bool, default False] Whether or not to remove the categories inplace or return
a copy of this categorical with removed categories.
Returns
cat [Categorical or None] Categorical with removed categories or None if
inplace=True.
Raises
pandas.CategoricalIndex.remove_unused_categories
CategoricalIndex.remove_unused_categories(*args, **kwargs)
Remove categories which are not used.
Parameters
inplace [bool, default False] Whether or not to drop unused categories inplace or return
a copy of this categorical with unused categories dropped.
Deprecated since version 1.2.0.
Returns
cat [Categorical or None] Categorical with unused categories dropped or None if
inplace=True.
See also:
pandas.CategoricalIndex.set_categories
CategoricalIndex.set_categories(*args, **kwargs)
Set the categories to the specified new_categories.
new_categories can include new categories (which will result in unused categories) or remove old cate-
gories (which results in values set to NaN). If rename==True, the categories will simple be renamed (less
or more items than in old categories will result in values set to NaN or in unused categories respectively).
This method can be used to perform more than one action of adding, removing, and reordering simulta-
neously and is therefore faster than performing the individual steps via the more specialised methods.
On the other hand this methods does not do checks (e.g., whether the old categories are included in the
new categories on a reorder), which can result in surprising changes, for example when using special
string dtypes, which does not considers a S1 string equal to a single char python string.
Parameters
new_categories [Index-like] The categories in new order.
ordered [bool, default False] Whether or not the categorical is treated as a ordered
categorical. If not given, do not change the ordered information.
rename [bool, default False] Whether or not the new_categories should be considered
as a rename of the old categories or as reordered categories.
inplace [bool, default False] Whether or not to reorder the categories in-place or return
a copy of this categorical with reordered categories.
Returns
Categorical with reordered categories or None if inplace.
Raises
ValueError If new_categories does not validate as categories
See also:
pandas.CategoricalIndex.as_ordered
CategoricalIndex.as_ordered(*args, **kwargs)
Set the Categorical to be ordered.
Parameters
inplace [bool, default False] Whether or not to set the ordered attribute in-place or
return a copy of this categorical with ordered set to True.
Returns
Categorical or None Ordered Categorical or None if inplace=True.
pandas.CategoricalIndex.as_unordered
CategoricalIndex.as_unordered(*args, **kwargs)
Set the Categorical to be unordered.
Parameters
inplace [bool, default False] Whether or not to set the ordered attribute in-place or
return a copy of this categorical with ordered set to False.
Returns
Categorical or None Unordered Categorical or None if inplace=True.
pandas.CategoricalIndex.map
CategoricalIndex.map(mapper)
Map values using input correspondence (a dict, Series, or function).
Maps the values (their categories, not the codes) of the index to new categories. If the mapping corre-
spondence is one-to-one the result is a CategoricalIndex which has the same order property as the
original, otherwise an Index is returned.
If a dict or Series is used any unmapped category is mapped to NaN. Note that if this happens an
Index will be returned.
Parameters
mapper [function, dict, or Series] Mapping correspondence.
Returns
pandas.CategoricalIndex or pandas.Index Mapped index.
See also:
Examples
If a dict is used, all unmapped categories are mapped to NaN and the result is an Index:
Categorical components
pandas.CategoricalIndex.equals
CategoricalIndex.equals(other)
Determine if two CategoricalIndex objects contain the same elements.
Returns
bool If two CategoricalIndex objects have equal elements True, otherwise False.
3.6.4 IntervalIndex
IntervalIndex(data[, closed, dtype, copy, . . . ]) Immutable index of intervals that are closed on the same
side.
pandas.IntervalIndex
Notes
Examples
Attributes
pandas.IntervalIndex.closed
IntervalIndex.closed
Whether the intervals are closed on the left-side, right-side, both or neither.
pandas.IntervalIndex.length
property IntervalIndex.length
Return the length of the Interval.
pandas.IntervalIndex.is_empty
IntervalIndex.is_empty
Indicates if an interval is empty, meaning it contains no points.
New in version 0.25.0.
Returns
bool or ndarray A boolean indicating if a scalar Interval is empty, or a boolean
ndarray positionally indicating if an Interval in an IntervalArray or
IntervalIndex is empty.
Examples
pandas.IntervalIndex.is_non_overlapping_monotonic
IntervalIndex.is_non_overlapping_monotonic
Return True if the IntervalArray is non-overlapping (no Intervals share points) and is either monotonic
increasing or monotonic decreasing, else False.
pandas.IntervalIndex.is_overlapping
property IntervalIndex.is_overlapping
Return True if the IntervalIndex has overlapping intervals, else False.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have
an open endpoint in common do not overlap.
New in version 0.24.0.
Returns
bool Boolean indicating if the IntervalIndex has overlapping intervals.
See also:
Examples
pandas.IntervalIndex.values
IntervalIndex.values
Return the IntervalIndex’s data as an IntervalArray.
left
right
mid
Methods
from_arrays(left, right[, closed, name, . . . ]) Construct from two arrays defining the left and right
bounds.
from_tuples(data[, closed, name, copy, dtype]) Construct an IntervalIndex from an array-like of tu-
ples.
from_breaks(breaks[, closed, name, copy, Construct an IntervalIndex from an array of splits.
dtype])
contains(*args, **kwargs) Check elementwise if the Intervals contain the value.
overlaps(*args, **kwargs) Check elementwise if an Interval overlaps the values
in the IntervalArray.
continues on next page
pandas.IntervalIndex.from_arrays
Notes
Each element of left must be less than or equal to the right element at the same position. If an element is
missing, it must be missing in both left and right. A TypeError is raised when using an unsupported type
for left or right. At the moment, ‘category’, ‘object’, and ‘string’ subtypes are not supported.
Examples
pandas.IntervalIndex.from_tuples
Examples
pandas.IntervalIndex.from_breaks
Examples
pandas.IntervalIndex.contains
IntervalIndex.contains(*args, **kwargs)
Check elementwise if the Intervals contain the value.
Return a boolean mask whether the value is contained in the Intervals of the IntervalArray.
New in version 0.25.0.
Parameters
other [scalar] The value to check whether it is contained in the Intervals.
Returns
boolean array
See also:
Examples
>>> intervals.contains(0.5)
array([ True, False, False])
pandas.IntervalIndex.overlaps
IntervalIndex.overlaps(*args, **kwargs)
Check elementwise if an Interval overlaps the values in the IntervalArray.
Two intervals overlap if they share a common point, including closed endpoints. Intervals that only have
an open endpoint in common do not overlap.
New in version 0.24.0.
Parameters
other [IntervalArray] Interval to check against for an overlap.
Returns
ndarray Boolean array positionally indicating where an overlap occurs.
See also:
Examples
pandas.IntervalIndex.set_closed
IntervalIndex.set_closed(*args, **kwargs)
Return an IntervalArray identical to the current one, but closed on the specified side.
New in version 0.24.0.
Parameters
closed [{‘left’, ‘right’, ‘both’, ‘neither’}] Whether the intervals are closed on the left-
side, right-side, both or neither.
Returns
new_index [IntervalArray]
Examples
pandas.IntervalIndex.to_tuples
IntervalIndex.to_tuples(*args, **kwargs)
Return an ndarray of tuples of the form (left, right).
Parameters
na_tuple [bool, default True] Returns NA as a tuple if True, (nan, nan), or just as
the NA value itself if False, nan.
Returns
tuples: ndarray
IntervalIndex components
IntervalIndex.from_arrays(left, right[, . . . ]) Construct from two arrays defining the left and right
bounds.
IntervalIndex.from_tuples(data[, closed, Construct an IntervalIndex from an array-like of tuples.
. . . ])
IntervalIndex.from_breaks(breaks[, closed, Construct an IntervalIndex from an array of splits.
. . . ])
IntervalIndex.left
IntervalIndex.right
IntervalIndex.mid
IntervalIndex.closed Whether the intervals are closed on the left-side, right-
side, both or neither.
IntervalIndex.length Return the length of the Interval.
IntervalIndex.values Return the IntervalIndex’s data as an IntervalArray.
IntervalIndex.is_empty Indicates if an interval is empty, meaning it contains no
points.
Return True if the IntervalArray is non-overlapping (no
IntervalIndex.is_non_overlapping_monotonic
Intervals share points) and is either monotonic increas-
ing or monotonic decreasing, else False.
IntervalIndex.is_overlapping Return True if the IntervalIndex has overlapping inter-
vals, else False.
continues on next page
pandas.IntervalIndex.left
IntervalIndex.left
pandas.IntervalIndex.right
IntervalIndex.right
pandas.IntervalIndex.mid
IntervalIndex.mid
pandas.IntervalIndex.get_loc
Examples
>>> index.get_loc(1.5)
1
If a label is in several intervals, you get the locations of all the relevant intervals.
>>> i3 = pd.Interval(0, 2)
>>> overlapping_index = pd.IntervalIndex([i1, i2, i3])
>>> overlapping_index.get_loc(0.5)
array([ True, False, True])
pandas.IntervalIndex.get_indexer
Examples
Notice that the return value is an array of locations in index and x is marked by -1, as it is not in index.
3.6.5 MultiIndex
pandas.MultiIndex
Notes
Examples
A new MultiIndex is typically constructed using one of the helper methods MultiIndex.
from_arrays(), MultiIndex.from_product() and MultiIndex.from_tuples(). For exam-
ple (using .from_arrays):
See further examples for how to construct a MultiIndex in the doc strings of the mentioned helper methods.
Attributes
pandas.MultiIndex.names
property MultiIndex.names
Names of levels in MultiIndex.
Examples
>>> mi = pd.MultiIndex.from_arrays(
... [[1, 2], [3, 4], [5, 6]], names=['x', 'y', 'z'])
>>> mi
MultiIndex([(1, 3, 5),
(2, 4, 6)],
names=['x', 'y', 'z'])
>>> mi.names
FrozenList(['x', 'y', 'z'])
pandas.MultiIndex.nlevels
property MultiIndex.nlevels
Integer number of levels in this MultiIndex.
Examples
pandas.MultiIndex.levshape
property MultiIndex.levshape
A tuple with the length of each level.
Examples
levels
codes
Methods
pandas.MultiIndex.from_arrays
Examples
pandas.MultiIndex.from_tuples
Examples
pandas.MultiIndex.from_product
Examples
pandas.MultiIndex.from_frame
Examples
>>> pd.MultiIndex.from_frame(df)
MultiIndex([('HI', 'Temp'),
('HI', 'Precip'),
('NJ', 'Temp'),
('NJ', 'Precip')],
names=['a', 'b'])
pandas.MultiIndex.set_levels
Examples
If any of the levels passed to set_levels() exceeds the existing length, all of the values from that
argument will be stored in the MultiIndex levels, though the values will be truncated in the MultiIndex
output.
pandas.MultiIndex.set_codes
Examples
... )
>>> idx
MultiIndex([(1, 'one'),
(1, 'two'),
(2, 'one'),
(2, 'two')],
names=['foo', 'bar'])
pandas.MultiIndex.to_frame
MultiIndex.to_frame(index=True, name=None)
Create a DataFrame with the levels of the MultiIndex as columns.
Column ordering is determined by the DataFrame constructor with data as a dict.
New in version 0.24.0.
Parameters
index [bool, default True] Set the index of the returned DataFrame as the original Mul-
tiIndex.
name [list / sequence of str, optional] The passed names should substitute index level
names.
Returns
DataFrame [a DataFrame containing the original MultiIndex data.]
See also:
Examples
>>> df = mi.to_frame()
>>> df
0 1
a c a c
b d b d
>>> df = mi.to_frame(index=False)
>>> df
0 1
0 a c
1 b d
pandas.MultiIndex.to_flat_index
MultiIndex.to_flat_index()
Convert a MultiIndex to an Index of Tuples containing the level values.
New in version 0.24.0.
Returns
pd.Index Index with the MultiIndex data represented in Tuples.
Notes
This method will simply return the caller if called by anything other than a MultiIndex.
Examples
pandas.MultiIndex.is_lexsorted
MultiIndex.is_lexsorted()
Return True if the codes are lexicographically sorted.
Returns
bool
Examples
In the below examples, the first level of the MultiIndex is sorted because a<b<c, so there is no need to
look at the next level.
True
>>> pd.MultiIndex.from_arrays([['a', 'b', 'c'], ['d', 'f', 'e']]).is_
˓→lexsorted()
True
In case there is a tie, the lexicographical sorting looks at the next level of the MultiIndex.
pandas.MultiIndex.sortlevel
ascending [bool, default True] False to sort in descending order. Can also be a list to
specify a directed ordering.
sort_remaining [sort by the remaining levels after level]
Returns
sorted_index [pd.MultiIndex] Resulting index.
indexer [np.ndarray] Indices of output values in original index.
Examples
>>> mi.sortlevel()
(MultiIndex([(0, 1),
(0, 2)],
), array([1, 0]))
>>> mi.sortlevel(sort_remaining=False)
(MultiIndex([(0, 2),
(0, 1)],
), array([0, 1]))
>>> mi.sortlevel(1)
(MultiIndex([(0, 1),
(0, 2)],
), array([1, 0]))
pandas.MultiIndex.droplevel
MultiIndex.droplevel(level=0)
Return index with requested level(s) removed.
If resulting index has only 1 level left, the result will be of Index type, not MultiIndex.
Parameters
level [int, str, or list-like, default 0] If a string is given, must be the name of a level If
list-like, elements must be names or indexes of levels.
Returns
Index or MultiIndex
Examples
>>> mi = pd.MultiIndex.from_arrays(
... [[1, 2], [3, 4], [5, 6]], names=['x', 'y', 'z'])
>>> mi
MultiIndex([(1, 3, 5),
(2, 4, 6)],
names=['x', 'y', 'z'])
>>> mi.droplevel()
MultiIndex([(3, 5),
(4, 6)],
names=['y', 'z'])
>>> mi.droplevel(2)
MultiIndex([(1, 3),
(2, 4)],
names=['x', 'y'])
>>> mi.droplevel('z')
MultiIndex([(1, 3),
(2, 4)],
names=['x', 'y'])
pandas.MultiIndex.swaplevel
MultiIndex.swaplevel(i=- 2, j=- 1)
Swap level i with level j.
Calling this method does not change the ordering of the values.
Parameters
i [int, str, default -2] First level of index to be swapped. Can pass level name as string.
Type of parameters can be mixed.
j [int, str, default -1] Second level of index to be swapped. Can pass level name as
string. Type of parameters can be mixed.
Returns
MultiIndex A new MultiIndex.
See also:
Examples
pandas.MultiIndex.reorder_levels
MultiIndex.reorder_levels(order)
Rearrange levels using input order. May not drop or duplicate levels.
Parameters
order [list of int or list of str] List representing new level order. Reference level by
number (position) or by key (label).
Returns
MultiIndex
Examples
pandas.MultiIndex.remove_unused_levels
MultiIndex.remove_unused_levels()
Create new MultiIndex from current that removes unused levels.
Unused level(s) means levels that are not expressed in the labels. The resulting MultiIndex will have the
same outward appearance, meaning the same .values and ordering. It will also be .equals() to the original.
Returns
MultiIndex
Examples
>>> mi[2:]
MultiIndex([(1, 'a'),
(1, 'b')],
)
The 0 from the first level is not represented and can be removed
pandas.MultiIndex.get_locs
MultiIndex.get_locs(seq)
Get location for a sequence of labels.
Parameters
seq [label, slice, list, mask or a sequence of such] You should use one of the above for
each level. If a level should not be used, set it to slice(None).
Returns
numpy.ndarray NumPy array of integers suitable for passing to iloc.
See also:
Examples
>>> mi.get_locs('b')
array([1, 2], dtype=int64)
pandas.IndexSlice
Notes
Examples
MultiIndex constructors
MultiIndex properties
pandas.MultiIndex.levels
MultiIndex.levels
pandas.MultiIndex.codes
property MultiIndex.codes
MultiIndex components
MultiIndex selecting
pandas.MultiIndex.get_loc
MultiIndex.get_loc(key, method=None)
Get location for a label or a tuple of labels.
The location is returned as an integer/slice or boolean mask.
Parameters
key [label or tuple of labels (one for each level)]
method [None]
Returns
loc [int, slice object or boolean mask] If the key is past the lexsort depth, the return may be
a boolean mask array, otherwise it is always a slice or int.
See also:
Index.get_loc The get_loc method for (single-level) index.
MultiIndex.slice_locs Get slice location given start label(s) and end label(s).
MultiIndex.get_locs Get location for a label/slice/list/mask or a sequence of such.
Notes
The key cannot be a slice, list of same-level labels, a boolean mask, or a sequence of such. If you want to use
those, use MultiIndex.get_locs() instead.
Examples
>>> mi.get_loc('b')
slice(1, 3, None)
pandas.MultiIndex.get_loc_level
Examples
>>> mi.get_loc_level('b')
(slice(1, 3, None), Index(['e', 'f'], dtype='object', name='B'))
pandas.MultiIndex.get_indexer
limit [int, optional] Maximum number of consecutive labels in target to match for inexact
matches.
tolerance [optional] Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations must satisfy the equation
abs(index[indexer] - target) <= tolerance.
Tolerance may be a scalar value, which applies the same tolerance to all values, or list-
like, which applies variable tolerance per element. List-like includes list, tuple, array,
Series, and must be the same size as the index and its dtype must exactly match the
index’s type.
Returns
indexer [ndarray of int] Integers from 0 to n - 1 indicating that the index at these positions
matches the corresponding target values. Missing values in the target are marked by -1.
Examples
Notice that the return value is an array of locations in index and x is marked by -1, as it is not in index.
pandas.MultiIndex.get_level_values
MultiIndex.get_level_values(level)
Return vector of label values for requested level.
Length of returned vector is equal to the length of the index.
Parameters
level [int or str] level is either the integer position of the level in the MultiIndex, or the
name of the level.
Returns
values [Index] Values is a level of this MultiIndex converted to a single Index (or subclass
thereof).
Examples
Create a MultiIndex:
>>> mi.get_level_values(0)
Index(['a', 'b', 'c'], dtype='object', name='level_1')
>>> mi.get_level_values('level_2')
Index(['d', 'e', 'f'], dtype='object', name='level_2')
3.6.6 DatetimeIndex
pandas.DatetimeIndex
Notes
To learn more about the frequency strings, please see this link.
Attributes
pandas.DatetimeIndex.year
property DatetimeIndex.year
The year of the datetime.
Examples
pandas.DatetimeIndex.month
property DatetimeIndex.month
The month as January=1, December=12.
Examples
pandas.DatetimeIndex.day
property DatetimeIndex.day
The day of the datetime.
Examples
pandas.DatetimeIndex.hour
property DatetimeIndex.hour
The hours of the datetime.
Examples
pandas.DatetimeIndex.minute
property DatetimeIndex.minute
The minutes of the datetime.
Examples
pandas.DatetimeIndex.second
property DatetimeIndex.second
The seconds of the datetime.
Examples
pandas.DatetimeIndex.microsecond
property DatetimeIndex.microsecond
The microseconds of the datetime.
Examples
pandas.DatetimeIndex.nanosecond
property DatetimeIndex.nanosecond
The nanoseconds of the datetime.
Examples
pandas.DatetimeIndex.date
property DatetimeIndex.date
Returns numpy array of python datetime.date objects (namely, the date part of Timestamps without time-
zone information).
pandas.DatetimeIndex.time
property DatetimeIndex.time
Returns numpy array of datetime.time. The time part of the Timestamps.
pandas.DatetimeIndex.timetz
property DatetimeIndex.timetz
Returns numpy array of datetime.time also containing timezone information. The time part of the Times-
tamps.
pandas.DatetimeIndex.dayofyear
property DatetimeIndex.dayofyear
The ordinal day of the year.
pandas.DatetimeIndex.day_of_year
property DatetimeIndex.day_of_year
The ordinal day of the year.
pandas.DatetimeIndex.weekofyear
property DatetimeIndex.weekofyear
The week ordinal of the year.
Deprecated since version 1.1.0.
weekofyear and week have been deprecated. Please use DatetimeIndex.isocalendar().week instead.
pandas.DatetimeIndex.week
property DatetimeIndex.week
The week ordinal of the year.
Deprecated since version 1.1.0.
weekofyear and week have been deprecated. Please use DatetimeIndex.isocalendar().week instead.
pandas.DatetimeIndex.dayofweek
property DatetimeIndex.dayofweek
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on
Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt
accessor) or DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.DatetimeIndex.day_of_week
property DatetimeIndex.day_of_week
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on
Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt
accessor) or DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.DatetimeIndex.weekday
property DatetimeIndex.weekday
The day of the week with Monday=0, Sunday=6.
Return the day of the week. It is assumed the week starts on Monday, which is denoted by 0 and ends on
Sunday which is denoted by 6. This method is available on both Series with datetime values (using the dt
accessor) or DatetimeIndex.
Returns
Series or Index Containing integers indicating the day number.
See also:
Series.dt.dayofweek Alias.
Series.dt.weekday Alias.
Series.dt.day_name Returns the name of the day of the week.
Examples
pandas.DatetimeIndex.quarter
property DatetimeIndex.quarter
The quarter of the date.
pandas.DatetimeIndex.tz
property DatetimeIndex.tz
Return timezone, if any.
Returns
datetime.tzinfo, pytz.tzinfo.BaseTZInfo, dateutil.tz.tz.tzfile, or None Returns None
when the array is tz-naive.
pandas.DatetimeIndex.freq
property DatetimeIndex.freq
Return the frequency object if it is set, otherwise None.
pandas.DatetimeIndex.freqstr
property DatetimeIndex.freqstr
Return the frequency object as a string if its set, otherwise None.
pandas.DatetimeIndex.is_month_start
property DatetimeIndex.is_month_start
Indicates whether the date is the first day of the month.
Returns
Series or array For Series, returns a Series with boolean values. For DatetimeIndex,
returns a boolean array.
See also:
is_month_start Return a boolean indicating whether the date is the first day of the month.
is_month_end Return a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
>>> s = pd.Series(pd.date_range("2018-02-27", periods=3))
>>> s
0 2018-02-27
1 2018-02-28
2 2018-03-01
dtype: datetime64[ns]
(continues on next page)
pandas.DatetimeIndex.is_month_end
property DatetimeIndex.is_month_end
Indicates whether the date is the last day of the month.
Returns
Series or array For Series, returns a Series with boolean values. For DatetimeIndex,
returns a boolean array.
See also:
is_month_start Return a boolean indicating whether the date is the first day of the month.
is_month_end Return a boolean indicating whether the date is the last day of the month.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
pandas.DatetimeIndex.is_quarter_start
property DatetimeIndex.is_quarter_start
Indicator for whether the date is the first day of a quarter.
Returns
is_quarter_start [Series or DatetimeIndex] The same type as the original data with
boolean values. Series will have the same name and index. DatetimeIndex will
have the same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
>>> idx.is_quarter_start
array([False, False, True, False])
pandas.DatetimeIndex.is_quarter_end
property DatetimeIndex.is_quarter_end
Indicator for whether the date is the last day of a quarter.
Returns
is_quarter_end [Series or DatetimeIndex] The same type as the original data with
boolean values. Series will have the same name and index. DatetimeIndex will
have the same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
>>> idx.is_quarter_end
array([False, True, False, False])
pandas.DatetimeIndex.is_year_start
property DatetimeIndex.is_year_start
Indicate whether the date is the first day of a year.
Returns
Series or DatetimeIndex The same type as the original data with boolean values. Se-
ries will have the same name and index. DatetimeIndex will have the same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
>>> dates.dt.is_year_start
0 False
1 False
2 True
dtype: bool
>>> idx.is_year_start
array([False, False, True])
pandas.DatetimeIndex.is_year_end
property DatetimeIndex.is_year_end
Indicate whether the date is the last day of the year.
Returns
Series or DatetimeIndex The same type as the original data with boolean values. Se-
ries will have the same name and index. DatetimeIndex will have the same name.
See also:
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
>>> dates.dt.is_year_end
0 False
1 True
2 False
dtype: bool
>>> idx.is_year_end
array([False, True, False])
pandas.DatetimeIndex.is_leap_year
property DatetimeIndex.is_leap_year
Boolean indicator if the date belongs to a leap year.
A leap year is a year, which has 366 days (instead of 365) including 29th of February as an intercalary
day. Leap years are years which are multiples of four with the exception of years divisible by 100 but not
by 400.
Returns
Series or ndarray Booleans indicating if dates belong to a leap year.
Examples
This method is available on Series with datetime values under the .dt accessor, and directly on Date-
timeIndex.
pandas.DatetimeIndex.inferred_freq
DatetimeIndex.inferred_freq: Optional[str]
Tries to return a string representing a frequency guess, generated by infer_freq. Returns None if it can’t
autodetect the frequency.
Methods
pandas.DatetimeIndex.normalize
DatetimeIndex.normalize(*args, **kwargs)
Convert times to midnight.
The time component of the date-time is converted to midnight i.e. 00:00:00. This is useful in cases, when
the time does not matter. Length is unaltered. The timezones are unaffected.
This method is available on Series with datetime values under the .dt accessor, and directly on Datetime
Array/Index.
Returns
DatetimeArray, DatetimeIndex or Series The same type as the original data. Series
will have the same name and index. DatetimeIndex will have the same name.
See also:
Examples
pandas.DatetimeIndex.strftime
DatetimeIndex.strftime(*args, **kwargs)
Convert to Index using specified date_format.
Return an Index of formatted strings specified by date_format, which supports the same string format as
the python standard library. Details of the string format can be found in python string format doc.
Parameters
date_format [str] Date format string (e.g. “%Y-%m-%d”).
Returns
ndarray NumPy ndarray of formatted strings.
See also:
Examples
pandas.DatetimeIndex.snap
DatetimeIndex.snap(freq='S')
Snap time stamps to nearest occurring frequency.
Returns
DatetimeIndex
pandas.DatetimeIndex.tz_convert
DatetimeIndex.tz_convert(tz)
Convert tz-aware Datetime Array/Index from one time zone to another.
Parameters
tz [str, pytz.timezone, dateutil.tz.tzfile or None] Time zone for time. Corresponding
timestamps would be converted to this time zone of the Datetime Array/Index. A
tz of None will convert to UTC and remove the timezone information.
Returns
Array or Index
Raises
TypeError If Datetime Array/Index is tz-naive.
See also:
Examples
With the tz parameter, we can change the DatetimeIndex to other time zones:
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
'2014-08-01 10:00:00+02:00',
'2014-08-01 11:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert('US/Central')
DatetimeIndex(['2014-08-01 02:00:00-05:00',
'2014-08-01 03:00:00-05:00',
'2014-08-01 04:00:00-05:00'],
dtype='datetime64[ns, US/Central]', freq='H')
With the tz=None, we can remove the timezone (after converting to UTC if necessary):
>>> dti
DatetimeIndex(['2014-08-01 09:00:00+02:00',
'2014-08-01 10:00:00+02:00',
'2014-08-01 11:00:00+02:00'],
dtype='datetime64[ns, Europe/Berlin]', freq='H')
>>> dti.tz_convert(None)
DatetimeIndex(['2014-08-01 07:00:00',
'2014-08-01 08:00:00',
'2014-08-01 09:00:00'],
dtype='datetime64[ns]', freq='H')
pandas.DatetimeIndex.tz_localize
• ‘shift_backward’ will shift the nonexistent time backward to the closest exist-
ing time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
New in version 0.24.0.
Returns
Same type as self Array/Index converted to the specified time zone.
Raises
TypeError If the Datetime Array/Index is tz-aware and tz is not None.
See also:
Examples
With the tz=None, we can remove the time zone information while keeping the local time (not converted
to UTC):
>>> tz_aware.tz_localize(None)
DatetimeIndex(['2018-03-01 09:00:00', '2018-03-02 09:00:00',
'2018-03-03 09:00:00'],
dtype='datetime64[ns]', freq=None)
Be careful with DST changes. When there is sequential data, pandas can infer the DST time:
In some cases, inferring the DST is impossible. In such cases, you can pass an ndarray to the ambiguous
parameter to set the DST explicitly
If the DST transition causes nonexistent times, you can shift these dates forward or backwards with a
timedelta object or ‘shift_forward’ or ‘shift_backwards’.
pandas.DatetimeIndex.round
DatetimeIndex.round(*args, **kwargs)
Perform round operation on the data to the specified freq.
Parameters
freq [str or Offset] The frequency level to round the index to. Must be a fixed frequency
like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible
freq values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for DatetimeIn-
dex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.round("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.DatetimeIndex.floor
DatetimeIndex.floor(*args, **kwargs)
Perform floor operation on the data to the specified freq.
Parameters
freq [str or Offset] The frequency level to floor the index to. Must be a fixed frequency
like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible
freq values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for DatetimeIn-
dex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False designates a non-DST
time (note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times.
New in version 0.24.0.
nonexistent [‘shift_forward’, ‘shift_backward’, ‘NaT’, timedelta, default ‘raise’] A
nonexistent time does not exist in a particular timezone where clocks moved for-
ward due to DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing
time
• ‘shift_backward’ will shift the nonexistent time backward to the closest exist-
ing time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a DatetimeIn-
dex or TimedeltaIndex, or a Series with the same index for a Series.
Raises
ValueError if the freq cannot be converted.
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.floor("H")
0 2018-01-01 11:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.DatetimeIndex.ceil
DatetimeIndex.ceil(*args, **kwargs)
Perform ceil operation on the data to the specified freq.
Parameters
freq [str or Offset] The frequency level to ceil the index to. Must be a fixed frequency
like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible
freq values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for DatetimeIn-
dex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False designates a non-DST
time (note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times.
New in version 0.24.0.
nonexistent [‘shift_forward’, ‘shift_backward’, ‘NaT’, timedelta, default ‘raise’] A
nonexistent time does not exist in a particular timezone where clocks moved for-
ward due to DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing
time
• ‘shift_backward’ will shift the nonexistent time backward to the closest exist-
ing time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a DatetimeIn-
dex or TimedeltaIndex, or a Series with the same index for a Series.
Raises
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.ceil("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 13:00:00
dtype: datetime64[ns]
pandas.DatetimeIndex.to_period
DatetimeIndex.to_period(*args, **kwargs)
Cast to PeriodArray/Index at a particular frequency.
Converts DatetimeArray/Index to PeriodArray/Index.
Parameters
freq [str or Offset, optional] One of pandas’ offset strings or an Offset object. Will be
inferred by default.
Returns
PeriodArray/Index
Raises
ValueError When converting a DatetimeArray/Index with non-regular values, so that
a frequency cannot be inferred.
See also:
Examples
pandas.DatetimeIndex.to_perioddelta
DatetimeIndex.to_perioddelta(freq)
Calculate TimedeltaArray of difference between index values and index converted to PeriodArray at spec-
ified freq. Used for vectorized offsets.
Parameters
freq [Period frequency]
Returns
TimedeltaArray/Index
pandas.DatetimeIndex.to_pydatetime
DatetimeIndex.to_pydatetime(*args, **kwargs)
Return Datetime Array/Index as object ndarray of datetime.datetime objects.
Returns
datetimes [ndarray]
pandas.DatetimeIndex.to_series
If keep_tz is False:
Series will have a datetime64[ns] dtype. TZ aware objects will have the tz
removed.
Changed in version 1.0.0: The default value is now True. In a future version, this
keyword will be removed entirely. Stop passing the argument to obtain the future
behavior and silence the warning.
index [Index, optional] Index of resulting Series. If None, defaults to original index.
name [str, optional] Name of resulting Series. If None, defaults to name of original
index.
Returns
Series
pandas.DatetimeIndex.to_frame
DatetimeIndex.to_frame(index=True, name=None)
Create a DataFrame with a column containing the Index.
New in version 0.24.0.
Parameters
index [bool, default True] Set the index of the returned DataFrame as the original In-
dex.
name [object, default None] The passed name should substitute for the index name (if
it has one).
Returns
DataFrame DataFrame containing the original Index data.
See also:
Examples
>>> idx.to_frame(index=False)
animal
0 Ant
1 Bear
2 Cow
pandas.DatetimeIndex.month_name
DatetimeIndex.month_name(*args, **kwargs)
Return the month names of the DateTimeIndex with specified locale.
Parameters
locale [str, optional] Locale determining the language in which to return the month
name. Default is English locale.
Returns
Index Index of month names.
Examples
pandas.DatetimeIndex.day_name
DatetimeIndex.day_name(*args, **kwargs)
Return the day names of the DateTimeIndex with specified locale.
Parameters
locale [str, optional] Locale determining the language in which to return the day name.
Default is English locale.
Returns
Index Index of day names.
Examples
pandas.DatetimeIndex.mean
DatetimeIndex.mean(*args, **kwargs)
Return the mean value of the Array.
New in version 0.25.0.
Parameters
skipna [bool, default True] Whether to ignore any NaT elements.
axis [int, optional, default 0]
Returns
scalar Timestamp or Timedelta.
See also:
Notes
mean is only defined for Datetime and Timedelta dtypes, not for Period.
std
Time/date components
Selecting
pandas.DatetimeIndex.indexer_at_time
DatetimeIndex.indexer_at_time(time, asof=False)
Return index locations of values at particular time of day (e.g. 9:30AM).
Parameters
time [datetime.time or str] Time passed in either as object (datetime.time) or as string in ap-
propriate format (“%H:%M”, “%H%M”, “%I:%M%p”, “%I%M%p”, “%H:%M:%S”,
“%H%M%S”, “%I:%M:%S%p”, “%I%M%S%p”).
Returns
values_at_time [array of integers]
See also:
indexer_between_time Get index locations of values between particular times of day.
DataFrame.at_time Select values at particular time of day.
pandas.DatetimeIndex.indexer_between_time
Time-specific operations
Conversion
Methods
3.6.7 TimedeltaIndex
TimedeltaIndex([data, unit, freq, closed, . . . ]) Immutable ndarray of timedelta64 data, represented in-
ternally as int64, and which can be boxed to timedelta
objects.
pandas.TimedeltaIndex
Notes
To learn more about the frequency strings, please see this link.
Attributes
pandas.TimedeltaIndex.days
property TimedeltaIndex.days
Number of days for each element.
pandas.TimedeltaIndex.seconds
property TimedeltaIndex.seconds
Number of seconds (>= 0 and less than 1 day) for each element.
pandas.TimedeltaIndex.microseconds
property TimedeltaIndex.microseconds
Number of microseconds (>= 0 and less than 1 second) for each element.
pandas.TimedeltaIndex.nanoseconds
property TimedeltaIndex.nanoseconds
Number of nanoseconds (>= 0 and less than 1 microsecond) for each element.
pandas.TimedeltaIndex.components
property TimedeltaIndex.components
Return a dataframe of the components (days, hours, minutes, seconds, milliseconds, microseconds,
nanoseconds) of the Timedeltas.
Returns
a DataFrame
pandas.TimedeltaIndex.inferred_freq
TimedeltaIndex.inferred_freq
Tries to return a string representing a frequency guess, generated by infer_freq. Returns None if it can’t
autodetect the frequency.
Methods
pandas.TimedeltaIndex.to_pytimedelta
TimedeltaIndex.to_pytimedelta(*args, **kwargs)
Return Timedelta Array/Index as object ndarray of datetime.timedelta objects.
Returns
datetimes [ndarray]
pandas.TimedeltaIndex.to_series
TimedeltaIndex.to_series(index=None, name=None)
Create a Series with both index and values equal to the index keys.
Useful with map for returning an indexer based on an index.
Parameters
index [Index, optional] Index of resulting Series. If None, defaults to original index.
name [str, optional] Name of resulting Series. If None, defaults to name of original
index.
Returns
Series The dtype will be based on the type of the Index values.
See also:
Examples
pandas.TimedeltaIndex.round
TimedeltaIndex.round(*args, **kwargs)
Perform round operation on the data to the specified freq.
Parameters
freq [str or Offset] The frequency level to round the index to. Must be a fixed frequency
like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible
freq values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for DatetimeIn-
dex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.round("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.TimedeltaIndex.floor
TimedeltaIndex.floor(*args, **kwargs)
Perform floor operation on the data to the specified freq.
Parameters
freq [str or Offset] The frequency level to floor the index to. Must be a fixed frequency
like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible
freq values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for DatetimeIn-
dex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False designates a non-DST
time (note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times.
New in version 0.24.0.
nonexistent [‘shift_forward’, ‘shift_backward’, ‘NaT’, timedelta, default ‘raise’] A
nonexistent time does not exist in a particular timezone where clocks moved for-
ward due to DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing
time
• ‘shift_backward’ will shift the nonexistent time backward to the closest exist-
ing time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a DatetimeIn-
dex or TimedeltaIndex, or a Series with the same index for a Series.
Raises
ValueError if the freq cannot be converted.
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.floor("H")
0 2018-01-01 11:00:00
1 2018-01-01 12:00:00
2 2018-01-01 12:00:00
dtype: datetime64[ns]
pandas.TimedeltaIndex.ceil
TimedeltaIndex.ceil(*args, **kwargs)
Perform ceil operation on the data to the specified freq.
Parameters
freq [str or Offset] The frequency level to ceil the index to. Must be a fixed frequency
like ‘S’ (second) not ‘ME’ (month end). See frequency aliases for a list of possible
freq values.
ambiguous [‘infer’, bool-ndarray, ‘NaT’, default ‘raise’] Only relevant for DatetimeIn-
dex:
• ‘infer’ will attempt to infer fall dst-transition hours based on order
• bool-ndarray where True signifies a DST time, False designates a non-DST
time (note that this flag is only applicable for ambiguous times)
• ‘NaT’ will return NaT where there are ambiguous times
• ‘raise’ will raise an AmbiguousTimeError if there are ambiguous times.
New in version 0.24.0.
nonexistent [‘shift_forward’, ‘shift_backward’, ‘NaT’, timedelta, default ‘raise’] A
nonexistent time does not exist in a particular timezone where clocks moved for-
ward due to DST.
• ‘shift_forward’ will shift the nonexistent time forward to the closest existing
time
• ‘shift_backward’ will shift the nonexistent time backward to the closest exist-
ing time
• ‘NaT’ will return NaT where there are nonexistent times
• timedelta objects will shift nonexistent times by the timedelta
• ‘raise’ will raise an NonExistentTimeError if there are nonexistent times.
New in version 0.24.0.
Returns
DatetimeIndex, TimedeltaIndex, or Series Index of the same type for a DatetimeIn-
dex or TimedeltaIndex, or a Series with the same index for a Series.
Raises
Examples
DatetimeIndex
Series
>>> pd.Series(rng).dt.ceil("H")
0 2018-01-01 12:00:00
1 2018-01-01 12:00:00
2 2018-01-01 13:00:00
dtype: datetime64[ns]
pandas.TimedeltaIndex.to_frame
TimedeltaIndex.to_frame(index=True, name=None)
Create a DataFrame with a column containing the Index.
New in version 0.24.0.
Parameters
index [bool, default True] Set the index of the returned DataFrame as the original In-
dex.
name [object, default None] The passed name should substitute for the index name (if
it has one).
Returns
DataFrame DataFrame containing the original Index data.
See also:
Examples
>>> idx.to_frame(index=False)
animal
0 Ant
1 Bear
2 Cow
pandas.TimedeltaIndex.mean
TimedeltaIndex.mean(*args, **kwargs)
Return the mean value of the Array.
New in version 0.25.0.
Parameters
skipna [bool, default True] Whether to ignore any NaT elements.
axis [int, optional, default 0]
Returns
scalar Timestamp or Timedelta.
See also:
Notes
mean is only defined for Datetime and Timedelta dtypes, not for Period.
Components
Conversion
Methods
3.6.8 PeriodIndex
PeriodIndex([data, ordinal, freq, tz, . . . ]) Immutable ndarray holding ordinal values indicating
regular periods in time.
pandas.PeriodIndex
Examples
Attributes
pandas.PeriodIndex.day
property PeriodIndex.day
The days of the period.
pandas.PeriodIndex.dayofweek
property PeriodIndex.dayofweek
The day of the week with Monday=0, Sunday=6.
pandas.PeriodIndex.day_of_week
property PeriodIndex.day_of_week
The day of the week with Monday=0, Sunday=6.
pandas.PeriodIndex.dayofyear
property PeriodIndex.dayofyear
The ordinal day of the year.
pandas.PeriodIndex.day_of_year
property PeriodIndex.day_of_year
The ordinal day of the year.
pandas.PeriodIndex.days_in_month
property PeriodIndex.days_in_month
The number of days in the month.
pandas.PeriodIndex.daysinmonth
property PeriodIndex.daysinmonth
The number of days in the month.
pandas.PeriodIndex.freq
property PeriodIndex.freq
Return the frequency object if it is set, otherwise None.
pandas.PeriodIndex.freqstr
property PeriodIndex.freqstr
Return the frequency object as a string if its set, otherwise None.
pandas.PeriodIndex.hour
property PeriodIndex.hour
The hour of the period.
pandas.PeriodIndex.is_leap_year
property PeriodIndex.is_leap_year
Logical indicating if the date belongs to a leap year.
pandas.PeriodIndex.minute
property PeriodIndex.minute
The minute of the period.
pandas.PeriodIndex.month
property PeriodIndex.month
The month as January=1, December=12.
pandas.PeriodIndex.quarter
property PeriodIndex.quarter
The quarter of the date.
pandas.PeriodIndex.second
property PeriodIndex.second
The second of the period.
pandas.PeriodIndex.week
property PeriodIndex.week
The week ordinal of the year.
pandas.PeriodIndex.weekday
property PeriodIndex.weekday
The day of the week with Monday=0, Sunday=6.
pandas.PeriodIndex.weekofyear
property PeriodIndex.weekofyear
The week ordinal of the year.
pandas.PeriodIndex.year
property PeriodIndex.year
The year of the period.
end_time
qyear
start_time
Methods
pandas.PeriodIndex.asfreq
PeriodIndex.asfreq(freq=None, how='E')
Convert the Period Array/Index to the specified frequency freq.
Parameters
freq [str] A frequency.
how [str {‘E’, ‘S’}] Whether the elements should be aligned to the end or start within
pa period.
• ‘E’, ‘END’, or ‘FINISH’ for end,
• ‘S’, ‘START’, or ‘BEGIN’ for start.
January 31st (‘END’) vs. January 1st (‘START’) for example.
Returns
Period Array/Index Constructed with the new frequency.
Examples
>>> pidx.asfreq('M')
PeriodIndex(['2010-12', '2011-12', '2012-12', '2013-12', '2014-12',
'2015-12'], dtype='period[M]', freq='M')
pandas.PeriodIndex.strftime
PeriodIndex.strftime(*args, **kwargs)
Convert to Index using specified date_format.
Return an Index of formatted strings specified by date_format, which supports the same string format as
the python standard library. Details of the string format can be found in python string format doc.
Parameters
date_format [str] Date format string (e.g. “%Y-%m-%d”).
Returns
ndarray NumPy ndarray of formatted strings.
See also:
Examples
pandas.PeriodIndex.to_timestamp
PeriodIndex.to_timestamp(freq=None, how='start')
Cast to DatetimeArray/Index.
Parameters
freq [str or DateOffset, optional] Target frequency. The default is ‘D’ for week or
longer, ‘S’ otherwise.
how [{‘s’, ‘e’, ‘start’, ‘end’}] Whether to use the start or end of the time period being
converted.
Returns
DatetimeArray/Index
Properties
pandas.PeriodIndex.end_time
property PeriodIndex.end_time
pandas.PeriodIndex.qyear
property PeriodIndex.qyear
pandas.PeriodIndex.start_time
property PeriodIndex.start_time
Methods
3.7.1 DateOffset
pandas.tseries.offsets.DateOffset
class pandas.tseries.offsets.DateOffset
Standard kind of date increment used for a date range.
Works exactly like relativedelta in terms of the keyword args you pass in, use of the keyword n is discouraged–
you would be better off specifying n in the keywords you use, but regardless it is there for you. n is needed for
DateOffset subclasses.
DateOffset work as follows. Each offset specify a set of dates that conform to the DateOffset. For example,
Bday defines this set to be the set of dates that are weekdays (M-F). To test if a date is in the set of a DateOffset
dateOffset we can use the is_on_offset method: dateOffset.is_on_offset(date).
If a date is not on a valid date, the rollback and rollforward methods can be used to roll the date to the nearest
valid date before/after the date.
DateOffsets can be created to move dates forward a given number of valid dates. For example, Bday(2) can be
added to a date to move it two business days forward. If the date does not start on a valid date, first it is moved
to a valid date. Thus pseudo code is:
def __add__(date): date = rollback(date) # does nothing if date is valid return date + <n number of periods>
When a date offset is created for a negative number of periods, the date is first rolled forward. The pseudo code
is:
def __add__(date): date = rollforward(date) # does nothing is date is valid return date + <n number of periods>
Zero presents a problem. Should it roll forward or back? We arbitrarily have it rollforward:
date + BDay(0) == BDay.rollforward(date)
Since 0 is a bit weird, we suggest avoiding its use.
Parameters
n [int, default 1] The number of time periods the offset represents.
normalize [bool, default False] Whether to round the result of a DateOffset addition down
to the previous midnight.
**kwds Temporal parameter that add to or replace the offset value.
Parameters that add to the offset (like Timedelta):
• years
• months
• weeks
• days
• hours
• minutes
• seconds
• microseconds
• nanoseconds
Parameters that replace the offset value:
• year
• month
• day
• weekday
• hour
• minute
• second
• microsecond
• nanosecond.
See also:
dateutil.relativedelta.relativedelta The relativedelta type is designed to be applied to an
existing datetime an can replace specific components of that datetime, or represents an interval of time.
Examples
Attributes
pandas.tseries.offsets.DateOffset.base
DateOffset.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.DateOffset.__call__
DateOffset.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.DateOffset.rollback
DateOffset.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.DateOffset.rollforward
DateOffset.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
DateOffset.freqstr
DateOffset.kwds
DateOffset.name
DateOffset.nanos
DateOffset.normalize
DateOffset.rule_code
DateOffset.n
pandas.tseries.offsets.DateOffset.freqstr
DateOffset.freqstr
pandas.tseries.offsets.DateOffset.kwds
DateOffset.kwds
pandas.tseries.offsets.DateOffset.name
DateOffset.name
pandas.tseries.offsets.DateOffset.nanos
DateOffset.nanos
pandas.tseries.offsets.DateOffset.normalize
DateOffset.normalize
pandas.tseries.offsets.DateOffset.rule_code
DateOffset.rule_code
pandas.tseries.offsets.DateOffset.n
DateOffset.n
Methods
DateOffset.apply(other)
DateOffset.apply_index(other)
DateOffset.copy
DateOffset.isAnchored
DateOffset.onOffset
DateOffset.is_anchored
DateOffset.is_on_offset
DateOffset.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.DateOffset.apply
DateOffset.apply(other)
pandas.tseries.offsets.DateOffset.apply_index
DateOffset.apply_index(other)
pandas.tseries.offsets.DateOffset.copy
DateOffset.copy()
pandas.tseries.offsets.DateOffset.isAnchored
DateOffset.isAnchored()
pandas.tseries.offsets.DateOffset.onOffset
DateOffset.onOffset()
pandas.tseries.offsets.DateOffset.is_anchored
DateOffset.is_anchored()
pandas.tseries.offsets.DateOffset.is_on_offset
DateOffset.is_on_offset()
3.7.2 BusinessDay
pandas.tseries.offsets.BusinessDay
class pandas.tseries.offsets.BusinessDay
DateOffset subclass representing possibly n business days.
Attributes
pandas.tseries.offsets.BusinessDay.base
BusinessDay.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.BusinessDay.offset
BusinessDay.offset
Alias for self._offset.
calendar
freqstr
holidays
kwds
n
name
nanos
normalize
rule_code
weekmask
Methods
pandas.tseries.offsets.BusinessDay.__call__
BusinessDay.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.BusinessDay.rollback
BusinessDay.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BusinessDay.rollforward
BusinessDay.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Alias:
pandas.tseries.offsets.BDay
pandas.tseries.offsets.BDay
alias of pandas._libs.tslibs.offsets.BusinessDay
Properties
BusinessDay.freqstr
BusinessDay.kwds
BusinessDay.name
BusinessDay.nanos
BusinessDay.normalize
BusinessDay.rule_code
BusinessDay.n
BusinessDay.weekmask
BusinessDay.holidays
BusinessDay.calendar
pandas.tseries.offsets.BusinessDay.freqstr
BusinessDay.freqstr
pandas.tseries.offsets.BusinessDay.kwds
BusinessDay.kwds
pandas.tseries.offsets.BusinessDay.name
BusinessDay.name
pandas.tseries.offsets.BusinessDay.nanos
BusinessDay.nanos
pandas.tseries.offsets.BusinessDay.normalize
BusinessDay.normalize
pandas.tseries.offsets.BusinessDay.rule_code
BusinessDay.rule_code
pandas.tseries.offsets.BusinessDay.n
BusinessDay.n
pandas.tseries.offsets.BusinessDay.weekmask
BusinessDay.weekmask
pandas.tseries.offsets.BusinessDay.holidays
BusinessDay.holidays
pandas.tseries.offsets.BusinessDay.calendar
BusinessDay.calendar
Methods
BusinessDay.apply(other)
BusinessDay.apply_index(other)
BusinessDay.copy
BusinessDay.isAnchored
BusinessDay.onOffset
BusinessDay.is_anchored
BusinessDay.is_on_offset
BusinessDay.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.BusinessDay.apply
BusinessDay.apply(other)
pandas.tseries.offsets.BusinessDay.apply_index
BusinessDay.apply_index(other)
pandas.tseries.offsets.BusinessDay.copy
BusinessDay.copy()
pandas.tseries.offsets.BusinessDay.isAnchored
BusinessDay.isAnchored()
pandas.tseries.offsets.BusinessDay.onOffset
BusinessDay.onOffset()
pandas.tseries.offsets.BusinessDay.is_anchored
BusinessDay.is_anchored()
pandas.tseries.offsets.BusinessDay.is_on_offset
BusinessDay.is_on_offset()
3.7.3 BusinessHour
pandas.tseries.offsets.BusinessHour
class pandas.tseries.offsets.BusinessHour
DateOffset subclass representing possibly n business hours.
Parameters
n [int, default 1] The number of months represented.
normalize [bool, default False] Normalize start/end dates to midnight before generating date
range.
weekmask [str, Default ‘Mon Tue Wed Thu Fri’] Weekmask of valid business days, passed
to numpy.busdaycalendar.
start [str, default “09:00”] Start time of your custom business hour in 24h format.
end [str, default: “17:00”] End time of your custom business hour in 24h format.
Attributes
pandas.tseries.offsets.BusinessHour.base
BusinessHour.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.BusinessHour.next_bday
BusinessHour.next_bday
Used for moving to next business day.
pandas.tseries.offsets.BusinessHour.offset
BusinessHour.offset
Alias for self._offset.
calendar
end
freqstr
holidays
kwds
n
name
nanos
normalize
rule_code
start
weekmask
Methods
pandas.tseries.offsets.BusinessHour.__call__
BusinessHour.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.BusinessHour.rollback
BusinessHour.rollback(other)
Roll provided date backward to next offset only if not on offset.
pandas.tseries.offsets.BusinessHour.rollforward
BusinessHour.rollforward(other)
Roll provided date forward to next offset only if not on offset.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
BusinessHour.freqstr
BusinessHour.kwds
BusinessHour.name
BusinessHour.nanos
BusinessHour.normalize
BusinessHour.rule_code
BusinessHour.n
BusinessHour.start
BusinessHour.end
BusinessHour.weekmask
BusinessHour.holidays
BusinessHour.calendar
pandas.tseries.offsets.BusinessHour.freqstr
BusinessHour.freqstr
pandas.tseries.offsets.BusinessHour.kwds
BusinessHour.kwds
pandas.tseries.offsets.BusinessHour.name
BusinessHour.name
pandas.tseries.offsets.BusinessHour.nanos
BusinessHour.nanos
pandas.tseries.offsets.BusinessHour.normalize
BusinessHour.normalize
pandas.tseries.offsets.BusinessHour.rule_code
BusinessHour.rule_code
pandas.tseries.offsets.BusinessHour.n
BusinessHour.n
pandas.tseries.offsets.BusinessHour.start
BusinessHour.start
pandas.tseries.offsets.BusinessHour.end
BusinessHour.end
pandas.tseries.offsets.BusinessHour.weekmask
BusinessHour.weekmask
pandas.tseries.offsets.BusinessHour.holidays
BusinessHour.holidays
pandas.tseries.offsets.BusinessHour.calendar
BusinessHour.calendar
Methods
BusinessHour.apply(other)
BusinessHour.apply_index(other)
BusinessHour.copy
BusinessHour.isAnchored
BusinessHour.onOffset
BusinessHour.is_anchored
BusinessHour.is_on_offset
BusinessHour.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.BusinessHour.apply
BusinessHour.apply(other)
pandas.tseries.offsets.BusinessHour.apply_index
BusinessHour.apply_index(other)
pandas.tseries.offsets.BusinessHour.copy
BusinessHour.copy()
pandas.tseries.offsets.BusinessHour.isAnchored
BusinessHour.isAnchored()
pandas.tseries.offsets.BusinessHour.onOffset
BusinessHour.onOffset()
pandas.tseries.offsets.BusinessHour.is_anchored
BusinessHour.is_anchored()
pandas.tseries.offsets.BusinessHour.is_on_offset
BusinessHour.is_on_offset()
3.7.4 CustomBusinessDay
pandas.tseries.offsets.CustomBusinessDay
class pandas.tseries.offsets.CustomBusinessDay
DateOffset subclass representing custom business days excluding holidays.
Parameters
n [int, default 1]
normalize [bool, default False] Normalize start/end dates to midnight before generating date
range.
weekmask [str, Default ‘Mon Tue Wed Thu Fri’] Weekmask of valid business days, passed
to numpy.busdaycalendar.
holidays [list] List/array of dates to exclude from the set of valid business days, passed to
numpy.busdaycalendar.
calendar [pd.HolidayCalendar or np.busdaycalendar]
offset [timedelta, default timedelta(0)]
Attributes
pandas.tseries.offsets.CustomBusinessDay.base
CustomBusinessDay.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CustomBusinessDay.offset
CustomBusinessDay.offset
Alias for self._offset.
calendar
freqstr
holidays
kwds
n
name
nanos
normalize
rule_code
weekmask
Methods
pandas.tseries.offsets.CustomBusinessDay.__call__
CustomBusinessDay.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.CustomBusinessDay.rollback
CustomBusinessDay.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.CustomBusinessDay.rollforward
CustomBusinessDay.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Alias:
pandas.tseries.offsets.CDay
pandas.tseries.offsets.CDay
alias of pandas._libs.tslibs.offsets.CustomBusinessDay
Properties
CustomBusinessDay.freqstr
CustomBusinessDay.kwds
CustomBusinessDay.name
CustomBusinessDay.nanos
CustomBusinessDay.normalize
CustomBusinessDay.rule_code
CustomBusinessDay.n
CustomBusinessDay.weekmask
CustomBusinessDay.calendar
CustomBusinessDay.holidays
pandas.tseries.offsets.CustomBusinessDay.freqstr
CustomBusinessDay.freqstr
pandas.tseries.offsets.CustomBusinessDay.kwds
CustomBusinessDay.kwds
pandas.tseries.offsets.CustomBusinessDay.name
CustomBusinessDay.name
pandas.tseries.offsets.CustomBusinessDay.nanos
CustomBusinessDay.nanos
pandas.tseries.offsets.CustomBusinessDay.normalize
CustomBusinessDay.normalize
pandas.tseries.offsets.CustomBusinessDay.rule_code
CustomBusinessDay.rule_code
pandas.tseries.offsets.CustomBusinessDay.n
CustomBusinessDay.n
pandas.tseries.offsets.CustomBusinessDay.weekmask
CustomBusinessDay.weekmask
pandas.tseries.offsets.CustomBusinessDay.calendar
CustomBusinessDay.calendar
pandas.tseries.offsets.CustomBusinessDay.holidays
CustomBusinessDay.holidays
Methods
CustomBusinessDay.apply_index
CustomBusinessDay.apply(other)
CustomBusinessDay.copy
CustomBusinessDay.isAnchored
CustomBusinessDay.onOffset
CustomBusinessDay.is_anchored
CustomBusinessDay.is_on_offset
CustomBusinessDay.__call__(*args, Call self as a function.
**kwargs)
pandas.tseries.offsets.CustomBusinessDay.apply_index
CustomBusinessDay.apply_index()
pandas.tseries.offsets.CustomBusinessDay.apply
CustomBusinessDay.apply(other)
pandas.tseries.offsets.CustomBusinessDay.copy
CustomBusinessDay.copy()
pandas.tseries.offsets.CustomBusinessDay.isAnchored
CustomBusinessDay.isAnchored()
pandas.tseries.offsets.CustomBusinessDay.onOffset
CustomBusinessDay.onOffset()
pandas.tseries.offsets.CustomBusinessDay.is_anchored
CustomBusinessDay.is_anchored()
pandas.tseries.offsets.CustomBusinessDay.is_on_offset
CustomBusinessDay.is_on_offset()
3.7.5 CustomBusinessHour
pandas.tseries.offsets.CustomBusinessHour
class pandas.tseries.offsets.CustomBusinessHour
DateOffset subclass representing possibly n custom business days.
Parameters
n [int, default 1] The number of months represented.
normalize [bool, default False] Normalize start/end dates to midnight before generating date
range.
weekmask [str, Default ‘Mon Tue Wed Thu Fri’] Weekmask of valid business days, passed
to numpy.busdaycalendar.
start [str, default “09:00”] Start time of your custom business hour in 24h format.
end [str, default: “17:00”] End time of your custom business hour in 24h format.
Attributes
pandas.tseries.offsets.CustomBusinessHour.base
CustomBusinessHour.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CustomBusinessHour.next_bday
CustomBusinessHour.next_bday
Used for moving to next business day.
pandas.tseries.offsets.CustomBusinessHour.offset
CustomBusinessHour.offset
Alias for self._offset.
calendar
end
freqstr
holidays
kwds
n
name
nanos
normalize
rule_code
start
weekmask
Methods
pandas.tseries.offsets.CustomBusinessHour.__call__
CustomBusinessHour.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.CustomBusinessHour.rollback
CustomBusinessHour.rollback(other)
Roll provided date backward to next offset only if not on offset.
pandas.tseries.offsets.CustomBusinessHour.rollforward
CustomBusinessHour.rollforward(other)
Roll provided date forward to next offset only if not on offset.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
CustomBusinessHour.freqstr
CustomBusinessHour.kwds
CustomBusinessHour.name
CustomBusinessHour.nanos
CustomBusinessHour.normalize
CustomBusinessHour.rule_code
CustomBusinessHour.n
CustomBusinessHour.weekmask
CustomBusinessHour.calendar
CustomBusinessHour.holidays
CustomBusinessHour.start
CustomBusinessHour.end
pandas.tseries.offsets.CustomBusinessHour.freqstr
CustomBusinessHour.freqstr
pandas.tseries.offsets.CustomBusinessHour.kwds
CustomBusinessHour.kwds
pandas.tseries.offsets.CustomBusinessHour.name
CustomBusinessHour.name
pandas.tseries.offsets.CustomBusinessHour.nanos
CustomBusinessHour.nanos
pandas.tseries.offsets.CustomBusinessHour.normalize
CustomBusinessHour.normalize
pandas.tseries.offsets.CustomBusinessHour.rule_code
CustomBusinessHour.rule_code
pandas.tseries.offsets.CustomBusinessHour.n
CustomBusinessHour.n
pandas.tseries.offsets.CustomBusinessHour.weekmask
CustomBusinessHour.weekmask
pandas.tseries.offsets.CustomBusinessHour.calendar
CustomBusinessHour.calendar
pandas.tseries.offsets.CustomBusinessHour.holidays
CustomBusinessHour.holidays
pandas.tseries.offsets.CustomBusinessHour.start
CustomBusinessHour.start
pandas.tseries.offsets.CustomBusinessHour.end
CustomBusinessHour.end
Methods
CustomBusinessHour.apply(other)
CustomBusinessHour.apply_index(other)
CustomBusinessHour.copy
CustomBusinessHour.isAnchored
CustomBusinessHour.onOffset
CustomBusinessHour.is_anchored
CustomBusinessHour.is_on_offset
CustomBusinessHour.__call__(*args, Call self as a function.
**kwargs)
pandas.tseries.offsets.CustomBusinessHour.apply
CustomBusinessHour.apply(other)
pandas.tseries.offsets.CustomBusinessHour.apply_index
CustomBusinessHour.apply_index(other)
pandas.tseries.offsets.CustomBusinessHour.copy
CustomBusinessHour.copy()
pandas.tseries.offsets.CustomBusinessHour.isAnchored
CustomBusinessHour.isAnchored()
pandas.tseries.offsets.CustomBusinessHour.onOffset
CustomBusinessHour.onOffset()
pandas.tseries.offsets.CustomBusinessHour.is_anchored
CustomBusinessHour.is_anchored()
pandas.tseries.offsets.CustomBusinessHour.is_on_offset
CustomBusinessHour.is_on_offset()
3.7.6 MonthEnd
pandas.tseries.offsets.MonthEnd
class pandas.tseries.offsets.MonthEnd
DateOffset of one month end.
Attributes
pandas.tseries.offsets.MonthEnd.base
MonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.MonthEnd.__call__
MonthEnd.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.MonthEnd.rollback
MonthEnd.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.MonthEnd.rollforward
MonthEnd.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
MonthEnd.freqstr
MonthEnd.kwds
MonthEnd.name
MonthEnd.nanos
MonthEnd.normalize
MonthEnd.rule_code
MonthEnd.n
pandas.tseries.offsets.MonthEnd.freqstr
MonthEnd.freqstr
pandas.tseries.offsets.MonthEnd.kwds
MonthEnd.kwds
pandas.tseries.offsets.MonthEnd.name
MonthEnd.name
pandas.tseries.offsets.MonthEnd.nanos
MonthEnd.nanos
pandas.tseries.offsets.MonthEnd.normalize
MonthEnd.normalize
pandas.tseries.offsets.MonthEnd.rule_code
MonthEnd.rule_code
pandas.tseries.offsets.MonthEnd.n
MonthEnd.n
Methods
MonthEnd.apply(other)
MonthEnd.apply_index(other)
MonthEnd.copy
MonthEnd.isAnchored
MonthEnd.onOffset
MonthEnd.is_anchored
MonthEnd.is_on_offset
MonthEnd.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.MonthEnd.apply
MonthEnd.apply(other)
pandas.tseries.offsets.MonthEnd.apply_index
MonthEnd.apply_index(other)
pandas.tseries.offsets.MonthEnd.copy
MonthEnd.copy()
pandas.tseries.offsets.MonthEnd.isAnchored
MonthEnd.isAnchored()
pandas.tseries.offsets.MonthEnd.onOffset
MonthEnd.onOffset()
pandas.tseries.offsets.MonthEnd.is_anchored
MonthEnd.is_anchored()
pandas.tseries.offsets.MonthEnd.is_on_offset
MonthEnd.is_on_offset()
3.7.7 MonthBegin
pandas.tseries.offsets.MonthBegin
class pandas.tseries.offsets.MonthBegin
DateOffset of one month at beginning.
Attributes
pandas.tseries.offsets.MonthBegin.base
MonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.MonthBegin.__call__
MonthBegin.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.MonthBegin.rollback
MonthBegin.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.MonthBegin.rollforward
MonthBegin.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
MonthBegin.freqstr
MonthBegin.kwds
MonthBegin.name
MonthBegin.nanos
MonthBegin.normalize
MonthBegin.rule_code
MonthBegin.n
pandas.tseries.offsets.MonthBegin.freqstr
MonthBegin.freqstr
pandas.tseries.offsets.MonthBegin.kwds
MonthBegin.kwds
pandas.tseries.offsets.MonthBegin.name
MonthBegin.name
pandas.tseries.offsets.MonthBegin.nanos
MonthBegin.nanos
pandas.tseries.offsets.MonthBegin.normalize
MonthBegin.normalize
pandas.tseries.offsets.MonthBegin.rule_code
MonthBegin.rule_code
pandas.tseries.offsets.MonthBegin.n
MonthBegin.n
Methods
MonthBegin.apply(other)
MonthBegin.apply_index(other)
MonthBegin.copy
MonthBegin.isAnchored
MonthBegin.onOffset
MonthBegin.is_anchored
MonthBegin.is_on_offset
MonthBegin.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.MonthBegin.apply
MonthBegin.apply(other)
pandas.tseries.offsets.MonthBegin.apply_index
MonthBegin.apply_index(other)
pandas.tseries.offsets.MonthBegin.copy
MonthBegin.copy()
pandas.tseries.offsets.MonthBegin.isAnchored
MonthBegin.isAnchored()
pandas.tseries.offsets.MonthBegin.onOffset
MonthBegin.onOffset()
pandas.tseries.offsets.MonthBegin.is_anchored
MonthBegin.is_anchored()
pandas.tseries.offsets.MonthBegin.is_on_offset
MonthBegin.is_on_offset()
3.7.8 BusinessMonthEnd
pandas.tseries.offsets.BusinessMonthEnd
class pandas.tseries.offsets.BusinessMonthEnd
DateOffset increments between the last business day of the month
Examples
Attributes
pandas.tseries.offsets.BusinessMonthEnd.base
BusinessMonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.BusinessMonthEnd.__call__
BusinessMonthEnd.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.BusinessMonthEnd.rollback
BusinessMonthEnd.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BusinessMonthEnd.rollforward
BusinessMonthEnd.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Alias:
pandas.tseries.offsets.BMonthEnd
pandas.tseries.offsets.BMonthEnd
alias of pandas._libs.tslibs.offsets.BusinessMonthEnd
Properties
BusinessMonthEnd.freqstr
BusinessMonthEnd.kwds
BusinessMonthEnd.name
BusinessMonthEnd.nanos
BusinessMonthEnd.normalize
BusinessMonthEnd.rule_code
BusinessMonthEnd.n
pandas.tseries.offsets.BusinessMonthEnd.freqstr
BusinessMonthEnd.freqstr
pandas.tseries.offsets.BusinessMonthEnd.kwds
BusinessMonthEnd.kwds
pandas.tseries.offsets.BusinessMonthEnd.name
BusinessMonthEnd.name
pandas.tseries.offsets.BusinessMonthEnd.nanos
BusinessMonthEnd.nanos
pandas.tseries.offsets.BusinessMonthEnd.normalize
BusinessMonthEnd.normalize
pandas.tseries.offsets.BusinessMonthEnd.rule_code
BusinessMonthEnd.rule_code
pandas.tseries.offsets.BusinessMonthEnd.n
BusinessMonthEnd.n
Methods
BusinessMonthEnd.apply(other)
BusinessMonthEnd.apply_index(other)
BusinessMonthEnd.copy
BusinessMonthEnd.isAnchored
BusinessMonthEnd.onOffset
BusinessMonthEnd.is_anchored
BusinessMonthEnd.is_on_offset
BusinessMonthEnd.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.BusinessMonthEnd.apply
BusinessMonthEnd.apply(other)
pandas.tseries.offsets.BusinessMonthEnd.apply_index
BusinessMonthEnd.apply_index(other)
pandas.tseries.offsets.BusinessMonthEnd.copy
BusinessMonthEnd.copy()
pandas.tseries.offsets.BusinessMonthEnd.isAnchored
BusinessMonthEnd.isAnchored()
pandas.tseries.offsets.BusinessMonthEnd.onOffset
BusinessMonthEnd.onOffset()
pandas.tseries.offsets.BusinessMonthEnd.is_anchored
BusinessMonthEnd.is_anchored()
pandas.tseries.offsets.BusinessMonthEnd.is_on_offset
BusinessMonthEnd.is_on_offset()
3.7.9 BusinessMonthBegin
pandas.tseries.offsets.BusinessMonthBegin
class pandas.tseries.offsets.BusinessMonthBegin
DateOffset of one month at the first business day.
Examples
Attributes
pandas.tseries.offsets.BusinessMonthBegin.base
BusinessMonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.BusinessMonthBegin.__call__
BusinessMonthBegin.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.BusinessMonthBegin.rollback
BusinessMonthBegin.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BusinessMonthBegin.rollforward
BusinessMonthBegin.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Alias:
pandas.tseries.offsets.BMonthBegin
pandas.tseries.offsets.BMonthBegin
alias of pandas._libs.tslibs.offsets.BusinessMonthBegin
Properties
BusinessMonthBegin.freqstr
BusinessMonthBegin.kwds
BusinessMonthBegin.name
BusinessMonthBegin.nanos
BusinessMonthBegin.normalize
BusinessMonthBegin.rule_code
continues on next page
pandas.tseries.offsets.BusinessMonthBegin.freqstr
BusinessMonthBegin.freqstr
pandas.tseries.offsets.BusinessMonthBegin.kwds
BusinessMonthBegin.kwds
pandas.tseries.offsets.BusinessMonthBegin.name
BusinessMonthBegin.name
pandas.tseries.offsets.BusinessMonthBegin.nanos
BusinessMonthBegin.nanos
pandas.tseries.offsets.BusinessMonthBegin.normalize
BusinessMonthBegin.normalize
pandas.tseries.offsets.BusinessMonthBegin.rule_code
BusinessMonthBegin.rule_code
pandas.tseries.offsets.BusinessMonthBegin.n
BusinessMonthBegin.n
Methods
BusinessMonthBegin.apply(other)
BusinessMonthBegin.apply_index(other)
BusinessMonthBegin.copy
BusinessMonthBegin.isAnchored
BusinessMonthBegin.onOffset
BusinessMonthBegin.is_anchored
BusinessMonthBegin.is_on_offset
BusinessMonthBegin.__call__(*args, Call self as a function.
**kwargs)
pandas.tseries.offsets.BusinessMonthBegin.apply
BusinessMonthBegin.apply(other)
pandas.tseries.offsets.BusinessMonthBegin.apply_index
BusinessMonthBegin.apply_index(other)
pandas.tseries.offsets.BusinessMonthBegin.copy
BusinessMonthBegin.copy()
pandas.tseries.offsets.BusinessMonthBegin.isAnchored
BusinessMonthBegin.isAnchored()
pandas.tseries.offsets.BusinessMonthBegin.onOffset
BusinessMonthBegin.onOffset()
pandas.tseries.offsets.BusinessMonthBegin.is_anchored
BusinessMonthBegin.is_anchored()
pandas.tseries.offsets.BusinessMonthBegin.is_on_offset
BusinessMonthBegin.is_on_offset()
3.7.10 CustomBusinessMonthEnd
CustomBusinessMonthEnd
Attributes
pandas.tseries.offsets.CustomBusinessMonthEnd
class pandas.tseries.offsets.CustomBusinessMonthEnd
Attributes
pandas.tseries.offsets.CustomBusinessMonthEnd.base
CustomBusinessMonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CustomBusinessMonthEnd.cbday_roll
CustomBusinessMonthEnd.cbday_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CustomBusinessMonthEnd.month_roll
CustomBusinessMonthEnd.month_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CustomBusinessMonthEnd.offset
CustomBusinessMonthEnd.offset
Alias for self._offset.
calendar
freqstr
holidays
kwds
m_offset
n
name
nanos
normalize
rule_code
weekmask
Methods
pandas.tseries.offsets.CustomBusinessMonthEnd.__call__
CustomBusinessMonthEnd.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.CustomBusinessMonthEnd.rollback
CustomBusinessMonthEnd.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.CustomBusinessMonthEnd.rollforward
CustomBusinessMonthEnd.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Alias:
pandas.tseries.offsets.CBMonthEnd
pandas.tseries.offsets.CBMonthEnd
alias of pandas._libs.tslibs.offsets.CustomBusinessMonthEnd
Properties
CustomBusinessMonthEnd.freqstr
CustomBusinessMonthEnd.kwds
CustomBusinessMonthEnd.m_offset
CustomBusinessMonthEnd.name
CustomBusinessMonthEnd.nanos
CustomBusinessMonthEnd.normalize
CustomBusinessMonthEnd.rule_code
CustomBusinessMonthEnd.n
CustomBusinessMonthEnd.weekmask
CustomBusinessMonthEnd.calendar
CustomBusinessMonthEnd.holidays
pandas.tseries.offsets.CustomBusinessMonthEnd.freqstr
CustomBusinessMonthEnd.freqstr
pandas.tseries.offsets.CustomBusinessMonthEnd.kwds
CustomBusinessMonthEnd.kwds
pandas.tseries.offsets.CustomBusinessMonthEnd.m_offset
CustomBusinessMonthEnd.m_offset
pandas.tseries.offsets.CustomBusinessMonthEnd.name
CustomBusinessMonthEnd.name
pandas.tseries.offsets.CustomBusinessMonthEnd.nanos
CustomBusinessMonthEnd.nanos
pandas.tseries.offsets.CustomBusinessMonthEnd.normalize
CustomBusinessMonthEnd.normalize
pandas.tseries.offsets.CustomBusinessMonthEnd.rule_code
CustomBusinessMonthEnd.rule_code
pandas.tseries.offsets.CustomBusinessMonthEnd.n
CustomBusinessMonthEnd.n
pandas.tseries.offsets.CustomBusinessMonthEnd.weekmask
CustomBusinessMonthEnd.weekmask
pandas.tseries.offsets.CustomBusinessMonthEnd.calendar
CustomBusinessMonthEnd.calendar
pandas.tseries.offsets.CustomBusinessMonthEnd.holidays
CustomBusinessMonthEnd.holidays
Methods
CustomBusinessMonthEnd.apply(other)
CustomBusinessMonthEnd.
apply_index(other)
CustomBusinessMonthEnd.copy
CustomBusinessMonthEnd.isAnchored
CustomBusinessMonthEnd.onOffset
CustomBusinessMonthEnd.is_anchored
CustomBusinessMonthEnd.is_on_offset
CustomBusinessMonthEnd.__call__(*args, Call self as a function.
**kwargs)
pandas.tseries.offsets.CustomBusinessMonthEnd.apply
CustomBusinessMonthEnd.apply(other)
pandas.tseries.offsets.CustomBusinessMonthEnd.apply_index
CustomBusinessMonthEnd.apply_index(other)
pandas.tseries.offsets.CustomBusinessMonthEnd.copy
CustomBusinessMonthEnd.copy()
pandas.tseries.offsets.CustomBusinessMonthEnd.isAnchored
CustomBusinessMonthEnd.isAnchored()
pandas.tseries.offsets.CustomBusinessMonthEnd.onOffset
CustomBusinessMonthEnd.onOffset()
pandas.tseries.offsets.CustomBusinessMonthEnd.is_anchored
CustomBusinessMonthEnd.is_anchored()
pandas.tseries.offsets.CustomBusinessMonthEnd.is_on_offset
CustomBusinessMonthEnd.is_on_offset()
3.7.11 CustomBusinessMonthBegin
CustomBusinessMonthBegin
Attributes
pandas.tseries.offsets.CustomBusinessMonthBegin
class pandas.tseries.offsets.CustomBusinessMonthBegin
Attributes
pandas.tseries.offsets.CustomBusinessMonthBegin.base
CustomBusinessMonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
pandas.tseries.offsets.CustomBusinessMonthBegin.cbday_roll
CustomBusinessMonthBegin.cbday_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CustomBusinessMonthBegin.month_roll
CustomBusinessMonthBegin.month_roll
Define default roll function to be called in apply method.
pandas.tseries.offsets.CustomBusinessMonthBegin.offset
CustomBusinessMonthBegin.offset
Alias for self._offset.
calendar
freqstr
holidays
kwds
m_offset
n
name
nanos
normalize
rule_code
weekmask
Methods
pandas.tseries.offsets.CustomBusinessMonthBegin.__call__
CustomBusinessMonthBegin.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.CustomBusinessMonthBegin.rollback
CustomBusinessMonthBegin.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.CustomBusinessMonthBegin.rollforward
CustomBusinessMonthBegin.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Alias:
pandas.tseries.offsets.CBMonthBegin
pandas.tseries.offsets.CBMonthBegin
alias of pandas._libs.tslibs.offsets.CustomBusinessMonthBegin
Properties
CustomBusinessMonthBegin.freqstr
CustomBusinessMonthBegin.kwds
CustomBusinessMonthBegin.m_offset
CustomBusinessMonthBegin.name
CustomBusinessMonthBegin.nanos
CustomBusinessMonthBegin.normalize
CustomBusinessMonthBegin.rule_code
CustomBusinessMonthBegin.n
CustomBusinessMonthBegin.weekmask
CustomBusinessMonthBegin.calendar
CustomBusinessMonthBegin.holidays
pandas.tseries.offsets.CustomBusinessMonthBegin.freqstr
CustomBusinessMonthBegin.freqstr
pandas.tseries.offsets.CustomBusinessMonthBegin.kwds
CustomBusinessMonthBegin.kwds
pandas.tseries.offsets.CustomBusinessMonthBegin.m_offset
CustomBusinessMonthBegin.m_offset
pandas.tseries.offsets.CustomBusinessMonthBegin.name
CustomBusinessMonthBegin.name
pandas.tseries.offsets.CustomBusinessMonthBegin.nanos
CustomBusinessMonthBegin.nanos
pandas.tseries.offsets.CustomBusinessMonthBegin.normalize
CustomBusinessMonthBegin.normalize
pandas.tseries.offsets.CustomBusinessMonthBegin.rule_code
CustomBusinessMonthBegin.rule_code
pandas.tseries.offsets.CustomBusinessMonthBegin.n
CustomBusinessMonthBegin.n
pandas.tseries.offsets.CustomBusinessMonthBegin.weekmask
CustomBusinessMonthBegin.weekmask
pandas.tseries.offsets.CustomBusinessMonthBegin.calendar
CustomBusinessMonthBegin.calendar
pandas.tseries.offsets.CustomBusinessMonthBegin.holidays
CustomBusinessMonthBegin.holidays
Methods
CustomBusinessMonthBegin.apply(other)
CustomBusinessMonthBegin.
apply_index(other)
CustomBusinessMonthBegin.copy
CustomBusinessMonthBegin.isAnchored
CustomBusinessMonthBegin.onOffset
CustomBusinessMonthBegin.is_anchored
CustomBusinessMonthBegin.is_on_offset
CustomBusinessMonthBegin. Call self as a function.
__call__(*args, . . . )
pandas.tseries.offsets.CustomBusinessMonthBegin.apply
CustomBusinessMonthBegin.apply(other)
pandas.tseries.offsets.CustomBusinessMonthBegin.apply_index
CustomBusinessMonthBegin.apply_index(other)
pandas.tseries.offsets.CustomBusinessMonthBegin.copy
CustomBusinessMonthBegin.copy()
pandas.tseries.offsets.CustomBusinessMonthBegin.isAnchored
CustomBusinessMonthBegin.isAnchored()
pandas.tseries.offsets.CustomBusinessMonthBegin.onOffset
CustomBusinessMonthBegin.onOffset()
pandas.tseries.offsets.CustomBusinessMonthBegin.is_anchored
CustomBusinessMonthBegin.is_anchored()
pandas.tseries.offsets.CustomBusinessMonthBegin.is_on_offset
CustomBusinessMonthBegin.is_on_offset()
3.7.12 SemiMonthEnd
pandas.tseries.offsets.SemiMonthEnd
class pandas.tseries.offsets.SemiMonthEnd
Two DateOffset’s per month repeating on the last day of the month and day_of_month.
Parameters
n [int]
normalize [bool, default False]
day_of_month [int, {1, 3,. . . ,27}, default 15]
Attributes
pandas.tseries.offsets.SemiMonthEnd.base
SemiMonthEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
day_of_month
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.SemiMonthEnd.__call__
SemiMonthEnd.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.SemiMonthEnd.rollback
SemiMonthEnd.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.SemiMonthEnd.rollforward
SemiMonthEnd.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
SemiMonthEnd.freqstr
SemiMonthEnd.kwds
SemiMonthEnd.name
SemiMonthEnd.nanos
SemiMonthEnd.normalize
SemiMonthEnd.rule_code
SemiMonthEnd.n
SemiMonthEnd.day_of_month
pandas.tseries.offsets.SemiMonthEnd.freqstr
SemiMonthEnd.freqstr
pandas.tseries.offsets.SemiMonthEnd.kwds
SemiMonthEnd.kwds
pandas.tseries.offsets.SemiMonthEnd.name
SemiMonthEnd.name
pandas.tseries.offsets.SemiMonthEnd.nanos
SemiMonthEnd.nanos
pandas.tseries.offsets.SemiMonthEnd.normalize
SemiMonthEnd.normalize
pandas.tseries.offsets.SemiMonthEnd.rule_code
SemiMonthEnd.rule_code
pandas.tseries.offsets.SemiMonthEnd.n
SemiMonthEnd.n
pandas.tseries.offsets.SemiMonthEnd.day_of_month
SemiMonthEnd.day_of_month
Methods
SemiMonthEnd.apply(other)
SemiMonthEnd.apply_index(other)
SemiMonthEnd.copy
SemiMonthEnd.isAnchored
SemiMonthEnd.onOffset
SemiMonthEnd.is_anchored
SemiMonthEnd.is_on_offset
SemiMonthEnd.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.SemiMonthEnd.apply
SemiMonthEnd.apply(other)
pandas.tseries.offsets.SemiMonthEnd.apply_index
SemiMonthEnd.apply_index(other)
pandas.tseries.offsets.SemiMonthEnd.copy
SemiMonthEnd.copy()
pandas.tseries.offsets.SemiMonthEnd.isAnchored
SemiMonthEnd.isAnchored()
pandas.tseries.offsets.SemiMonthEnd.onOffset
SemiMonthEnd.onOffset()
pandas.tseries.offsets.SemiMonthEnd.is_anchored
SemiMonthEnd.is_anchored()
pandas.tseries.offsets.SemiMonthEnd.is_on_offset
SemiMonthEnd.is_on_offset()
3.7.13 SemiMonthBegin
pandas.tseries.offsets.SemiMonthBegin
class pandas.tseries.offsets.SemiMonthBegin
Two DateOffset’s per month repeating on the first day of the month and day_of_month.
Parameters
n [int]
normalize [bool, default False]
day_of_month [int, {2, 3,. . . ,27}, default 15]
Attributes
pandas.tseries.offsets.SemiMonthBegin.base
SemiMonthBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
day_of_month
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.SemiMonthBegin.__call__
SemiMonthBegin.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.SemiMonthBegin.rollback
SemiMonthBegin.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.SemiMonthBegin.rollforward
SemiMonthBegin.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
SemiMonthBegin.freqstr
SemiMonthBegin.kwds
SemiMonthBegin.name
SemiMonthBegin.nanos
SemiMonthBegin.normalize
SemiMonthBegin.rule_code
SemiMonthBegin.n
SemiMonthBegin.day_of_month
pandas.tseries.offsets.SemiMonthBegin.freqstr
SemiMonthBegin.freqstr
pandas.tseries.offsets.SemiMonthBegin.kwds
SemiMonthBegin.kwds
pandas.tseries.offsets.SemiMonthBegin.name
SemiMonthBegin.name
pandas.tseries.offsets.SemiMonthBegin.nanos
SemiMonthBegin.nanos
pandas.tseries.offsets.SemiMonthBegin.normalize
SemiMonthBegin.normalize
pandas.tseries.offsets.SemiMonthBegin.rule_code
SemiMonthBegin.rule_code
pandas.tseries.offsets.SemiMonthBegin.n
SemiMonthBegin.n
pandas.tseries.offsets.SemiMonthBegin.day_of_month
SemiMonthBegin.day_of_month
Methods
SemiMonthBegin.apply(other)
SemiMonthBegin.apply_index(other)
SemiMonthBegin.copy
SemiMonthBegin.isAnchored
SemiMonthBegin.onOffset
SemiMonthBegin.is_anchored
SemiMonthBegin.is_on_offset
SemiMonthBegin.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.SemiMonthBegin.apply
SemiMonthBegin.apply(other)
pandas.tseries.offsets.SemiMonthBegin.apply_index
SemiMonthBegin.apply_index(other)
pandas.tseries.offsets.SemiMonthBegin.copy
SemiMonthBegin.copy()
pandas.tseries.offsets.SemiMonthBegin.isAnchored
SemiMonthBegin.isAnchored()
pandas.tseries.offsets.SemiMonthBegin.onOffset
SemiMonthBegin.onOffset()
pandas.tseries.offsets.SemiMonthBegin.is_anchored
SemiMonthBegin.is_anchored()
pandas.tseries.offsets.SemiMonthBegin.is_on_offset
SemiMonthBegin.is_on_offset()
3.7.14 Week
pandas.tseries.offsets.Week
class pandas.tseries.offsets.Week
Weekly offset.
Parameters
weekday [int or None, default None] Always generate specific day of week. 0 for Monday.
Attributes
pandas.tseries.offsets.Week.base
Week.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
weekday
Methods
pandas.tseries.offsets.Week.__call__
Week.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Week.rollback
Week.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Week.rollforward
Week.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Week.freqstr
Week.kwds
Week.name
Week.nanos
Week.normalize
Week.rule_code
Week.n
Week.weekday
pandas.tseries.offsets.Week.freqstr
Week.freqstr
pandas.tseries.offsets.Week.kwds
Week.kwds
pandas.tseries.offsets.Week.name
Week.name
pandas.tseries.offsets.Week.nanos
Week.nanos
pandas.tseries.offsets.Week.normalize
Week.normalize
pandas.tseries.offsets.Week.rule_code
Week.rule_code
pandas.tseries.offsets.Week.n
Week.n
pandas.tseries.offsets.Week.weekday
Week.weekday
Methods
Week.apply(other)
Week.apply_index(other)
Week.copy
Week.isAnchored
Week.onOffset
Week.is_anchored
Week.is_on_offset
Week.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.Week.apply
Week.apply(other)
pandas.tseries.offsets.Week.apply_index
Week.apply_index(other)
pandas.tseries.offsets.Week.copy
Week.copy()
pandas.tseries.offsets.Week.isAnchored
Week.isAnchored()
pandas.tseries.offsets.Week.onOffset
Week.onOffset()
pandas.tseries.offsets.Week.is_anchored
Week.is_anchored()
pandas.tseries.offsets.Week.is_on_offset
Week.is_on_offset()
3.7.15 WeekOfMonth
pandas.tseries.offsets.WeekOfMonth
class pandas.tseries.offsets.WeekOfMonth
Describes monthly dates like “the Tuesday of the 2nd week of each month”.
Parameters
n [int]
week [int {0, 1, 2, 3, . . . }, default 0] A specific integer for the week of the month. e.g. 0 is
1st week of month, 1 is the 2nd week, etc.
weekday [int {0, 1, . . . , 6}, default 0] A specific integer for the day of the week.
• 0 is Monday
• 1 is Tuesday
• 2 is Wednesday
• 3 is Thursday
• 4 is Friday
• 5 is Saturday
• 6 is Sunday.
Attributes
pandas.tseries.offsets.WeekOfMonth.base
WeekOfMonth.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
week
weekday
Methods
pandas.tseries.offsets.WeekOfMonth.__call__
WeekOfMonth.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.WeekOfMonth.rollback
WeekOfMonth.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.WeekOfMonth.rollforward
WeekOfMonth.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
WeekOfMonth.freqstr
WeekOfMonth.kwds
WeekOfMonth.name
WeekOfMonth.nanos
WeekOfMonth.normalize
WeekOfMonth.rule_code
WeekOfMonth.n
WeekOfMonth.week
pandas.tseries.offsets.WeekOfMonth.freqstr
WeekOfMonth.freqstr
pandas.tseries.offsets.WeekOfMonth.kwds
WeekOfMonth.kwds
pandas.tseries.offsets.WeekOfMonth.name
WeekOfMonth.name
pandas.tseries.offsets.WeekOfMonth.nanos
WeekOfMonth.nanos
pandas.tseries.offsets.WeekOfMonth.normalize
WeekOfMonth.normalize
pandas.tseries.offsets.WeekOfMonth.rule_code
WeekOfMonth.rule_code
pandas.tseries.offsets.WeekOfMonth.n
WeekOfMonth.n
pandas.tseries.offsets.WeekOfMonth.week
WeekOfMonth.week
Methods
WeekOfMonth.apply(other)
WeekOfMonth.apply_index(other)
WeekOfMonth.copy
WeekOfMonth.isAnchored
WeekOfMonth.onOffset
WeekOfMonth.is_anchored
WeekOfMonth.is_on_offset
WeekOfMonth.__call__(*args, **kwargs) Call self as a function.
WeekOfMonth.weekday
pandas.tseries.offsets.WeekOfMonth.apply
WeekOfMonth.apply(other)
pandas.tseries.offsets.WeekOfMonth.apply_index
WeekOfMonth.apply_index(other)
pandas.tseries.offsets.WeekOfMonth.copy
WeekOfMonth.copy()
pandas.tseries.offsets.WeekOfMonth.isAnchored
WeekOfMonth.isAnchored()
pandas.tseries.offsets.WeekOfMonth.onOffset
WeekOfMonth.onOffset()
pandas.tseries.offsets.WeekOfMonth.is_anchored
WeekOfMonth.is_anchored()
pandas.tseries.offsets.WeekOfMonth.is_on_offset
WeekOfMonth.is_on_offset()
pandas.tseries.offsets.WeekOfMonth.weekday
WeekOfMonth.weekday
3.7.16 LastWeekOfMonth
pandas.tseries.offsets.LastWeekOfMonth
class pandas.tseries.offsets.LastWeekOfMonth
Describes monthly dates in last week of month like “the last Tuesday of each month”.
Parameters
n [int, default 1]
weekday [int {0, 1, . . . , 6}, default 0] A specific integer for the day of the week.
• 0 is Monday
• 1 is Tuesday
• 2 is Wednesday
• 3 is Thursday
• 4 is Friday
• 5 is Saturday
• 6 is Sunday.
Attributes
pandas.tseries.offsets.LastWeekOfMonth.base
LastWeekOfMonth.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
week
weekday
Methods
pandas.tseries.offsets.LastWeekOfMonth.__call__
LastWeekOfMonth.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.LastWeekOfMonth.rollback
LastWeekOfMonth.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.LastWeekOfMonth.rollforward
LastWeekOfMonth.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
LastWeekOfMonth.freqstr
LastWeekOfMonth.kwds
LastWeekOfMonth.name
LastWeekOfMonth.nanos
LastWeekOfMonth.normalize
LastWeekOfMonth.rule_code
LastWeekOfMonth.n
LastWeekOfMonth.weekday
LastWeekOfMonth.week
pandas.tseries.offsets.LastWeekOfMonth.freqstr
LastWeekOfMonth.freqstr
pandas.tseries.offsets.LastWeekOfMonth.kwds
LastWeekOfMonth.kwds
pandas.tseries.offsets.LastWeekOfMonth.name
LastWeekOfMonth.name
pandas.tseries.offsets.LastWeekOfMonth.nanos
LastWeekOfMonth.nanos
pandas.tseries.offsets.LastWeekOfMonth.normalize
LastWeekOfMonth.normalize
pandas.tseries.offsets.LastWeekOfMonth.rule_code
LastWeekOfMonth.rule_code
pandas.tseries.offsets.LastWeekOfMonth.n
LastWeekOfMonth.n
pandas.tseries.offsets.LastWeekOfMonth.weekday
LastWeekOfMonth.weekday
pandas.tseries.offsets.LastWeekOfMonth.week
LastWeekOfMonth.week
Methods
LastWeekOfMonth.apply(other)
LastWeekOfMonth.apply_index(other)
LastWeekOfMonth.copy
LastWeekOfMonth.isAnchored
LastWeekOfMonth.onOffset
LastWeekOfMonth.is_anchored
LastWeekOfMonth.is_on_offset
LastWeekOfMonth.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.LastWeekOfMonth.apply
LastWeekOfMonth.apply(other)
pandas.tseries.offsets.LastWeekOfMonth.apply_index
LastWeekOfMonth.apply_index(other)
pandas.tseries.offsets.LastWeekOfMonth.copy
LastWeekOfMonth.copy()
pandas.tseries.offsets.LastWeekOfMonth.isAnchored
LastWeekOfMonth.isAnchored()
pandas.tseries.offsets.LastWeekOfMonth.onOffset
LastWeekOfMonth.onOffset()
pandas.tseries.offsets.LastWeekOfMonth.is_anchored
LastWeekOfMonth.is_anchored()
pandas.tseries.offsets.LastWeekOfMonth.is_on_offset
LastWeekOfMonth.is_on_offset()
3.7.17 BQuarterEnd
pandas.tseries.offsets.BQuarterEnd
class pandas.tseries.offsets.BQuarterEnd
DateOffset increments between the last business day of each Quarter.
startingMonth = 1 corresponds to dates like 1/31/2007, 4/30/2007, . . . startingMonth = 2 corresponds to dates
like 2/28/2007, 5/31/2007, . . . startingMonth = 3 corresponds to dates like 3/30/2007, 6/29/2007, . . .
Examples
Attributes
pandas.tseries.offsets.BQuarterEnd.base
BQuarterEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
startingMonth
Methods
pandas.tseries.offsets.BQuarterEnd.__call__
BQuarterEnd.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.BQuarterEnd.rollback
BQuarterEnd.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BQuarterEnd.rollforward
BQuarterEnd.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
BQuarterEnd.freqstr
BQuarterEnd.kwds
BQuarterEnd.name
BQuarterEnd.nanos
BQuarterEnd.normalize
BQuarterEnd.rule_code
BQuarterEnd.n
BQuarterEnd.startingMonth
pandas.tseries.offsets.BQuarterEnd.freqstr
BQuarterEnd.freqstr
pandas.tseries.offsets.BQuarterEnd.kwds
BQuarterEnd.kwds
pandas.tseries.offsets.BQuarterEnd.name
BQuarterEnd.name
pandas.tseries.offsets.BQuarterEnd.nanos
BQuarterEnd.nanos
pandas.tseries.offsets.BQuarterEnd.normalize
BQuarterEnd.normalize
pandas.tseries.offsets.BQuarterEnd.rule_code
BQuarterEnd.rule_code
pandas.tseries.offsets.BQuarterEnd.n
BQuarterEnd.n
pandas.tseries.offsets.BQuarterEnd.startingMonth
BQuarterEnd.startingMonth
Methods
BQuarterEnd.apply(other)
BQuarterEnd.apply_index(other)
BQuarterEnd.copy
BQuarterEnd.isAnchored
BQuarterEnd.onOffset
BQuarterEnd.is_anchored
BQuarterEnd.is_on_offset
BQuarterEnd.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.BQuarterEnd.apply
BQuarterEnd.apply(other)
pandas.tseries.offsets.BQuarterEnd.apply_index
BQuarterEnd.apply_index(other)
pandas.tseries.offsets.BQuarterEnd.copy
BQuarterEnd.copy()
pandas.tseries.offsets.BQuarterEnd.isAnchored
BQuarterEnd.isAnchored()
pandas.tseries.offsets.BQuarterEnd.onOffset
BQuarterEnd.onOffset()
pandas.tseries.offsets.BQuarterEnd.is_anchored
BQuarterEnd.is_anchored()
pandas.tseries.offsets.BQuarterEnd.is_on_offset
BQuarterEnd.is_on_offset()
3.7.18 BQuarterBegin
pandas.tseries.offsets.BQuarterBegin
class pandas.tseries.offsets.BQuarterBegin
DateOffset increments between the first business day of each Quarter.
startingMonth = 1 corresponds to dates like 1/01/2007, 4/01/2007, . . . startingMonth = 2 corresponds to dates
like 2/01/2007, 5/01/2007, . . . startingMonth = 3 corresponds to dates like 3/01/2007, 6/01/2007, . . .
Examples
Attributes
pandas.tseries.offsets.BQuarterBegin.base
BQuarterBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
startingMonth
Methods
pandas.tseries.offsets.BQuarterBegin.__call__
BQuarterBegin.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.BQuarterBegin.rollback
BQuarterBegin.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BQuarterBegin.rollforward
BQuarterBegin.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
BQuarterBegin.freqstr
BQuarterBegin.kwds
BQuarterBegin.name
BQuarterBegin.nanos
BQuarterBegin.normalize
BQuarterBegin.rule_code
BQuarterBegin.n
BQuarterBegin.startingMonth
pandas.tseries.offsets.BQuarterBegin.freqstr
BQuarterBegin.freqstr
pandas.tseries.offsets.BQuarterBegin.kwds
BQuarterBegin.kwds
pandas.tseries.offsets.BQuarterBegin.name
BQuarterBegin.name
pandas.tseries.offsets.BQuarterBegin.nanos
BQuarterBegin.nanos
pandas.tseries.offsets.BQuarterBegin.normalize
BQuarterBegin.normalize
pandas.tseries.offsets.BQuarterBegin.rule_code
BQuarterBegin.rule_code
pandas.tseries.offsets.BQuarterBegin.n
BQuarterBegin.n
pandas.tseries.offsets.BQuarterBegin.startingMonth
BQuarterBegin.startingMonth
Methods
BQuarterBegin.apply(other)
BQuarterBegin.apply_index(other)
BQuarterBegin.copy
BQuarterBegin.isAnchored
BQuarterBegin.onOffset
BQuarterBegin.is_anchored
BQuarterBegin.is_on_offset
BQuarterBegin.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.BQuarterBegin.apply
BQuarterBegin.apply(other)
pandas.tseries.offsets.BQuarterBegin.apply_index
BQuarterBegin.apply_index(other)
pandas.tseries.offsets.BQuarterBegin.copy
BQuarterBegin.copy()
pandas.tseries.offsets.BQuarterBegin.isAnchored
BQuarterBegin.isAnchored()
pandas.tseries.offsets.BQuarterBegin.onOffset
BQuarterBegin.onOffset()
pandas.tseries.offsets.BQuarterBegin.is_anchored
BQuarterBegin.is_anchored()
pandas.tseries.offsets.BQuarterBegin.is_on_offset
BQuarterBegin.is_on_offset()
3.7.19 QuarterEnd
pandas.tseries.offsets.QuarterEnd
class pandas.tseries.offsets.QuarterEnd
DateOffset increments between Quarter end dates.
startingMonth = 1 corresponds to dates like 1/31/2007, 4/30/2007, . . . startingMonth = 2 corresponds to dates
like 2/28/2007, 5/31/2007, . . . startingMonth = 3 corresponds to dates like 3/31/2007, 6/30/2007, . . .
Attributes
pandas.tseries.offsets.QuarterEnd.base
QuarterEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
startingMonth
Methods
pandas.tseries.offsets.QuarterEnd.__call__
QuarterEnd.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.QuarterEnd.rollback
QuarterEnd.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.QuarterEnd.rollforward
QuarterEnd.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
QuarterEnd.freqstr
QuarterEnd.kwds
QuarterEnd.name
QuarterEnd.nanos
QuarterEnd.normalize
QuarterEnd.rule_code
QuarterEnd.n
QuarterEnd.startingMonth
pandas.tseries.offsets.QuarterEnd.freqstr
QuarterEnd.freqstr
pandas.tseries.offsets.QuarterEnd.kwds
QuarterEnd.kwds
pandas.tseries.offsets.QuarterEnd.name
QuarterEnd.name
pandas.tseries.offsets.QuarterEnd.nanos
QuarterEnd.nanos
pandas.tseries.offsets.QuarterEnd.normalize
QuarterEnd.normalize
pandas.tseries.offsets.QuarterEnd.rule_code
QuarterEnd.rule_code
pandas.tseries.offsets.QuarterEnd.n
QuarterEnd.n
pandas.tseries.offsets.QuarterEnd.startingMonth
QuarterEnd.startingMonth
Methods
QuarterEnd.apply(other)
QuarterEnd.apply_index(other)
QuarterEnd.copy
QuarterEnd.isAnchored
QuarterEnd.onOffset
QuarterEnd.is_anchored
QuarterEnd.is_on_offset
QuarterEnd.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.QuarterEnd.apply
QuarterEnd.apply(other)
pandas.tseries.offsets.QuarterEnd.apply_index
QuarterEnd.apply_index(other)
pandas.tseries.offsets.QuarterEnd.copy
QuarterEnd.copy()
pandas.tseries.offsets.QuarterEnd.isAnchored
QuarterEnd.isAnchored()
pandas.tseries.offsets.QuarterEnd.onOffset
QuarterEnd.onOffset()
pandas.tseries.offsets.QuarterEnd.is_anchored
QuarterEnd.is_anchored()
pandas.tseries.offsets.QuarterEnd.is_on_offset
QuarterEnd.is_on_offset()
3.7.20 QuarterBegin
pandas.tseries.offsets.QuarterBegin
class pandas.tseries.offsets.QuarterBegin
DateOffset increments between Quarter start dates.
startingMonth = 1 corresponds to dates like 1/01/2007, 4/01/2007, . . . startingMonth = 2 corresponds to dates
like 2/01/2007, 5/01/2007, . . . startingMonth = 3 corresponds to dates like 3/01/2007, 6/01/2007, . . .
Attributes
pandas.tseries.offsets.QuarterBegin.base
QuarterBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
startingMonth
Methods
pandas.tseries.offsets.QuarterBegin.__call__
QuarterBegin.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.QuarterBegin.rollback
QuarterBegin.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.QuarterBegin.rollforward
QuarterBegin.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
QuarterBegin.freqstr
QuarterBegin.kwds
QuarterBegin.name
QuarterBegin.nanos
QuarterBegin.normalize
QuarterBegin.rule_code
QuarterBegin.n
QuarterBegin.startingMonth
pandas.tseries.offsets.QuarterBegin.freqstr
QuarterBegin.freqstr
pandas.tseries.offsets.QuarterBegin.kwds
QuarterBegin.kwds
pandas.tseries.offsets.QuarterBegin.name
QuarterBegin.name
pandas.tseries.offsets.QuarterBegin.nanos
QuarterBegin.nanos
pandas.tseries.offsets.QuarterBegin.normalize
QuarterBegin.normalize
pandas.tseries.offsets.QuarterBegin.rule_code
QuarterBegin.rule_code
pandas.tseries.offsets.QuarterBegin.n
QuarterBegin.n
pandas.tseries.offsets.QuarterBegin.startingMonth
QuarterBegin.startingMonth
Methods
QuarterBegin.apply(other)
QuarterBegin.apply_index(other)
QuarterBegin.copy
QuarterBegin.isAnchored
QuarterBegin.onOffset
QuarterBegin.is_anchored
QuarterBegin.is_on_offset
QuarterBegin.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.QuarterBegin.apply
QuarterBegin.apply(other)
pandas.tseries.offsets.QuarterBegin.apply_index
QuarterBegin.apply_index(other)
pandas.tseries.offsets.QuarterBegin.copy
QuarterBegin.copy()
pandas.tseries.offsets.QuarterBegin.isAnchored
QuarterBegin.isAnchored()
pandas.tseries.offsets.QuarterBegin.onOffset
QuarterBegin.onOffset()
pandas.tseries.offsets.QuarterBegin.is_anchored
QuarterBegin.is_anchored()
pandas.tseries.offsets.QuarterBegin.is_on_offset
QuarterBegin.is_on_offset()
3.7.21 BYearEnd
pandas.tseries.offsets.BYearEnd
class pandas.tseries.offsets.BYearEnd
DateOffset increments between the last business day of the year.
Examples
Attributes
pandas.tseries.offsets.BYearEnd.base
BYearEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
month
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.BYearEnd.__call__
BYearEnd.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.BYearEnd.rollback
BYearEnd.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BYearEnd.rollforward
BYearEnd.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
BYearEnd.freqstr
BYearEnd.kwds
BYearEnd.name
BYearEnd.nanos
BYearEnd.normalize
BYearEnd.rule_code
BYearEnd.n
BYearEnd.month
pandas.tseries.offsets.BYearEnd.freqstr
BYearEnd.freqstr
pandas.tseries.offsets.BYearEnd.kwds
BYearEnd.kwds
pandas.tseries.offsets.BYearEnd.name
BYearEnd.name
pandas.tseries.offsets.BYearEnd.nanos
BYearEnd.nanos
pandas.tseries.offsets.BYearEnd.normalize
BYearEnd.normalize
pandas.tseries.offsets.BYearEnd.rule_code
BYearEnd.rule_code
pandas.tseries.offsets.BYearEnd.n
BYearEnd.n
pandas.tseries.offsets.BYearEnd.month
BYearEnd.month
Methods
BYearEnd.apply(other)
BYearEnd.apply_index(other)
BYearEnd.copy
BYearEnd.isAnchored
BYearEnd.onOffset
BYearEnd.is_anchored
BYearEnd.is_on_offset
BYearEnd.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.BYearEnd.apply
BYearEnd.apply(other)
pandas.tseries.offsets.BYearEnd.apply_index
BYearEnd.apply_index(other)
pandas.tseries.offsets.BYearEnd.copy
BYearEnd.copy()
pandas.tseries.offsets.BYearEnd.isAnchored
BYearEnd.isAnchored()
pandas.tseries.offsets.BYearEnd.onOffset
BYearEnd.onOffset()
pandas.tseries.offsets.BYearEnd.is_anchored
BYearEnd.is_anchored()
pandas.tseries.offsets.BYearEnd.is_on_offset
BYearEnd.is_on_offset()
3.7.22 BYearBegin
pandas.tseries.offsets.BYearBegin
class pandas.tseries.offsets.BYearBegin
DateOffset increments between the first business day of the year.
Examples
Attributes
pandas.tseries.offsets.BYearBegin.base
BYearBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
month
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.BYearBegin.__call__
BYearBegin.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.BYearBegin.rollback
BYearBegin.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.BYearBegin.rollforward
BYearBegin.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
BYearBegin.freqstr
BYearBegin.kwds
BYearBegin.name
BYearBegin.nanos
BYearBegin.normalize
BYearBegin.rule_code
BYearBegin.n
BYearBegin.month
pandas.tseries.offsets.BYearBegin.freqstr
BYearBegin.freqstr
pandas.tseries.offsets.BYearBegin.kwds
BYearBegin.kwds
pandas.tseries.offsets.BYearBegin.name
BYearBegin.name
pandas.tseries.offsets.BYearBegin.nanos
BYearBegin.nanos
pandas.tseries.offsets.BYearBegin.normalize
BYearBegin.normalize
pandas.tseries.offsets.BYearBegin.rule_code
BYearBegin.rule_code
pandas.tseries.offsets.BYearBegin.n
BYearBegin.n
pandas.tseries.offsets.BYearBegin.month
BYearBegin.month
Methods
BYearBegin.apply(other)
BYearBegin.apply_index(other)
BYearBegin.copy
BYearBegin.isAnchored
BYearBegin.onOffset
BYearBegin.is_anchored
BYearBegin.is_on_offset
BYearBegin.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.BYearBegin.apply
BYearBegin.apply(other)
pandas.tseries.offsets.BYearBegin.apply_index
BYearBegin.apply_index(other)
pandas.tseries.offsets.BYearBegin.copy
BYearBegin.copy()
pandas.tseries.offsets.BYearBegin.isAnchored
BYearBegin.isAnchored()
pandas.tseries.offsets.BYearBegin.onOffset
BYearBegin.onOffset()
pandas.tseries.offsets.BYearBegin.is_anchored
BYearBegin.is_anchored()
pandas.tseries.offsets.BYearBegin.is_on_offset
BYearBegin.is_on_offset()
3.7.23 YearEnd
pandas.tseries.offsets.YearEnd
class pandas.tseries.offsets.YearEnd
DateOffset increments between calendar year ends.
Attributes
pandas.tseries.offsets.YearEnd.base
YearEnd.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
month
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.YearEnd.__call__
YearEnd.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.YearEnd.rollback
YearEnd.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.YearEnd.rollforward
YearEnd.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
YearEnd.freqstr
YearEnd.kwds
YearEnd.name
YearEnd.nanos
YearEnd.normalize
YearEnd.rule_code
YearEnd.n
YearEnd.month
pandas.tseries.offsets.YearEnd.freqstr
YearEnd.freqstr
pandas.tseries.offsets.YearEnd.kwds
YearEnd.kwds
pandas.tseries.offsets.YearEnd.name
YearEnd.name
pandas.tseries.offsets.YearEnd.nanos
YearEnd.nanos
pandas.tseries.offsets.YearEnd.normalize
YearEnd.normalize
pandas.tseries.offsets.YearEnd.rule_code
YearEnd.rule_code
pandas.tseries.offsets.YearEnd.n
YearEnd.n
pandas.tseries.offsets.YearEnd.month
YearEnd.month
Methods
YearEnd.apply(other)
YearEnd.apply_index(other)
YearEnd.copy
YearEnd.isAnchored
YearEnd.onOffset
YearEnd.is_anchored
YearEnd.is_on_offset
YearEnd.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.YearEnd.apply
YearEnd.apply(other)
pandas.tseries.offsets.YearEnd.apply_index
YearEnd.apply_index(other)
pandas.tseries.offsets.YearEnd.copy
YearEnd.copy()
pandas.tseries.offsets.YearEnd.isAnchored
YearEnd.isAnchored()
pandas.tseries.offsets.YearEnd.onOffset
YearEnd.onOffset()
pandas.tseries.offsets.YearEnd.is_anchored
YearEnd.is_anchored()
pandas.tseries.offsets.YearEnd.is_on_offset
YearEnd.is_on_offset()
3.7.24 YearBegin
pandas.tseries.offsets.YearBegin
class pandas.tseries.offsets.YearBegin
DateOffset increments between calendar year begin dates.
Attributes
pandas.tseries.offsets.YearBegin.base
YearBegin.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
month
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.YearBegin.__call__
YearBegin.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.YearBegin.rollback
YearBegin.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.YearBegin.rollforward
YearBegin.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
YearBegin.freqstr
YearBegin.kwds
YearBegin.name
YearBegin.nanos
YearBegin.normalize
YearBegin.rule_code
YearBegin.n
YearBegin.month
pandas.tseries.offsets.YearBegin.freqstr
YearBegin.freqstr
pandas.tseries.offsets.YearBegin.kwds
YearBegin.kwds
pandas.tseries.offsets.YearBegin.name
YearBegin.name
pandas.tseries.offsets.YearBegin.nanos
YearBegin.nanos
pandas.tseries.offsets.YearBegin.normalize
YearBegin.normalize
pandas.tseries.offsets.YearBegin.rule_code
YearBegin.rule_code
pandas.tseries.offsets.YearBegin.n
YearBegin.n
pandas.tseries.offsets.YearBegin.month
YearBegin.month
Methods
YearBegin.apply(other)
YearBegin.apply_index(other)
YearBegin.copy
YearBegin.isAnchored
YearBegin.onOffset
YearBegin.is_anchored
YearBegin.is_on_offset
YearBegin.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.YearBegin.apply
YearBegin.apply(other)
pandas.tseries.offsets.YearBegin.apply_index
YearBegin.apply_index(other)
pandas.tseries.offsets.YearBegin.copy
YearBegin.copy()
pandas.tseries.offsets.YearBegin.isAnchored
YearBegin.isAnchored()
pandas.tseries.offsets.YearBegin.onOffset
YearBegin.onOffset()
pandas.tseries.offsets.YearBegin.is_anchored
YearBegin.is_anchored()
pandas.tseries.offsets.YearBegin.is_on_offset
YearBegin.is_on_offset()
3.7.25 FY5253
pandas.tseries.offsets.FY5253
class pandas.tseries.offsets.FY5253
Describes 52-53 week fiscal year. This is also known as a 4-4-5 calendar.
It is used by companies that desire that their fiscal year always end on the same day of the week.
It is a method of managing accounting periods. It is a common calendar structure for some industries, such as
retail, manufacturing and parking industry.
For more information see: https://en.wikipedia.org/wiki/4-4-5_calendar
The year may either:
• end on the last X day of the Y month.
• end on the last X day closest to the last day of the Y month.
Attributes
pandas.tseries.offsets.FY5253.base
FY5253.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
startingMonth
variation
weekday
Methods
pandas.tseries.offsets.FY5253.__call__
FY5253.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.FY5253.rollback
FY5253.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.FY5253.rollforward
FY5253.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
get_rule_code_suffix
get_year_end
isAnchored
is_anchored
is_on_offset
onOffset
Properties
FY5253.freqstr
FY5253.kwds
FY5253.name
FY5253.nanos
FY5253.normalize
FY5253.rule_code
FY5253.n
FY5253.startingMonth
FY5253.variation
FY5253.weekday
pandas.tseries.offsets.FY5253.freqstr
FY5253.freqstr
pandas.tseries.offsets.FY5253.kwds
FY5253.kwds
pandas.tseries.offsets.FY5253.name
FY5253.name
pandas.tseries.offsets.FY5253.nanos
FY5253.nanos
pandas.tseries.offsets.FY5253.normalize
FY5253.normalize
pandas.tseries.offsets.FY5253.rule_code
FY5253.rule_code
pandas.tseries.offsets.FY5253.n
FY5253.n
pandas.tseries.offsets.FY5253.startingMonth
FY5253.startingMonth
pandas.tseries.offsets.FY5253.variation
FY5253.variation
pandas.tseries.offsets.FY5253.weekday
FY5253.weekday
Methods
FY5253.apply(other)
FY5253.apply_index(other)
FY5253.copy
FY5253.get_rule_code_suffix
FY5253.get_year_end
FY5253.isAnchored
FY5253.onOffset
FY5253.is_anchored
FY5253.is_on_offset
FY5253.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.FY5253.apply
FY5253.apply(other)
pandas.tseries.offsets.FY5253.apply_index
FY5253.apply_index(other)
pandas.tseries.offsets.FY5253.copy
FY5253.copy()
pandas.tseries.offsets.FY5253.get_rule_code_suffix
FY5253.get_rule_code_suffix()
pandas.tseries.offsets.FY5253.get_year_end
FY5253.get_year_end()
pandas.tseries.offsets.FY5253.isAnchored
FY5253.isAnchored()
pandas.tseries.offsets.FY5253.onOffset
FY5253.onOffset()
pandas.tseries.offsets.FY5253.is_anchored
FY5253.is_anchored()
pandas.tseries.offsets.FY5253.is_on_offset
FY5253.is_on_offset()
3.7.26 FY5253Quarter
pandas.tseries.offsets.FY5253Quarter
class pandas.tseries.offsets.FY5253Quarter
DateOffset increments between business quarter dates for 52-53 week fiscal year (also known as a 4-4-5 calen-
dar).
It is used by companies that desire that their fiscal year always end on the same day of the week.
It is a method of managing accounting periods. It is a common calendar structure for some industries, such as
retail, manufacturing and parking industry.
For more information see: https://en.wikipedia.org/wiki/4-4-5_calendar
The year may either:
• end on the last X day of the Y month.
• end on the last X day closest to the last day of the Y month.
X is a specific day of the week. Y is a certain month of the year
startingMonth = 1 corresponds to dates like 1/31/2007, 4/30/2007, . . . startingMonth = 2 corresponds to dates
like 2/28/2007, 5/31/2007, . . . startingMonth = 3 corresponds to dates like 3/30/2007, 6/29/2007, . . .
Parameters
n [int]
weekday [int {0, 1, . . . , 6}, default 0] A specific integer for the day of the week.
• 0 is Monday
• 1 is Tuesday
• 2 is Wednesday
• 3 is Thursday
• 4 is Friday
• 5 is Saturday
• 6 is Sunday.
startingMonth [int {1, 2, . . . , 12}, default 1] The month in which fiscal years end.
qtr_with_extra_week [int {1, 2, 3, 4}, default 1] The quarter number that has the leap or 14
week when needed.
variation [str, default “nearest”] Method of employing 4-4-5 calendar.
There are two options:
• “nearest” means year end is weekday closest to last day of month in year.
• “last” means year end is final weekday of the final month in fiscal year.
Attributes
pandas.tseries.offsets.FY5253Quarter.base
FY5253Quarter.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
qtr_with_extra_week
rule_code
startingMonth
variation
weekday
Methods
pandas.tseries.offsets.FY5253Quarter.__call__
FY5253Quarter.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.FY5253Quarter.rollback
FY5253Quarter.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.FY5253Quarter.rollforward
FY5253Quarter.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
get_rule_code_suffix
get_weeks
isAnchored
is_anchored
is_on_offset
onOffset
year_has_extra_week
Properties
FY5253Quarter.freqstr
FY5253Quarter.kwds
FY5253Quarter.name
FY5253Quarter.nanos
FY5253Quarter.normalize
FY5253Quarter.rule_code
FY5253Quarter.n
FY5253Quarter.qtr_with_extra_week
FY5253Quarter.startingMonth
FY5253Quarter.variation
FY5253Quarter.weekday
pandas.tseries.offsets.FY5253Quarter.freqstr
FY5253Quarter.freqstr
pandas.tseries.offsets.FY5253Quarter.kwds
FY5253Quarter.kwds
pandas.tseries.offsets.FY5253Quarter.name
FY5253Quarter.name
pandas.tseries.offsets.FY5253Quarter.nanos
FY5253Quarter.nanos
pandas.tseries.offsets.FY5253Quarter.normalize
FY5253Quarter.normalize
pandas.tseries.offsets.FY5253Quarter.rule_code
FY5253Quarter.rule_code
pandas.tseries.offsets.FY5253Quarter.n
FY5253Quarter.n
pandas.tseries.offsets.FY5253Quarter.qtr_with_extra_week
FY5253Quarter.qtr_with_extra_week
pandas.tseries.offsets.FY5253Quarter.startingMonth
FY5253Quarter.startingMonth
pandas.tseries.offsets.FY5253Quarter.variation
FY5253Quarter.variation
pandas.tseries.offsets.FY5253Quarter.weekday
FY5253Quarter.weekday
Methods
FY5253Quarter.apply(other)
FY5253Quarter.apply_index(other)
FY5253Quarter.copy
FY5253Quarter.get_rule_code_suffix
FY5253Quarter.get_weeks
FY5253Quarter.isAnchored
FY5253Quarter.onOffset
FY5253Quarter.is_anchored
FY5253Quarter.is_on_offset
FY5253Quarter.year_has_extra_week
FY5253Quarter.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.FY5253Quarter.apply
FY5253Quarter.apply(other)
pandas.tseries.offsets.FY5253Quarter.apply_index
FY5253Quarter.apply_index(other)
pandas.tseries.offsets.FY5253Quarter.copy
FY5253Quarter.copy()
pandas.tseries.offsets.FY5253Quarter.get_rule_code_suffix
FY5253Quarter.get_rule_code_suffix()
pandas.tseries.offsets.FY5253Quarter.get_weeks
FY5253Quarter.get_weeks()
pandas.tseries.offsets.FY5253Quarter.isAnchored
FY5253Quarter.isAnchored()
pandas.tseries.offsets.FY5253Quarter.onOffset
FY5253Quarter.onOffset()
pandas.tseries.offsets.FY5253Quarter.is_anchored
FY5253Quarter.is_anchored()
pandas.tseries.offsets.FY5253Quarter.is_on_offset
FY5253Quarter.is_on_offset()
pandas.tseries.offsets.FY5253Quarter.year_has_extra_week
FY5253Quarter.year_has_extra_week()
3.7.27 Easter
pandas.tseries.offsets.Easter
class pandas.tseries.offsets.Easter
DateOffset for the Easter holiday using logic defined in dateutil.
Right now uses the revised method which is valid in years 1583-4099.
Attributes
pandas.tseries.offsets.Easter.base
Easter.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.Easter.__call__
Easter.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Easter.rollback
Easter.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Easter.rollforward
Easter.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Easter.freqstr
Easter.kwds
Easter.name
Easter.nanos
Easter.normalize
Easter.rule_code
Easter.n
pandas.tseries.offsets.Easter.freqstr
Easter.freqstr
pandas.tseries.offsets.Easter.kwds
Easter.kwds
pandas.tseries.offsets.Easter.name
Easter.name
pandas.tseries.offsets.Easter.nanos
Easter.nanos
pandas.tseries.offsets.Easter.normalize
Easter.normalize
pandas.tseries.offsets.Easter.rule_code
Easter.rule_code
pandas.tseries.offsets.Easter.n
Easter.n
Methods
Easter.apply(other)
Easter.apply_index(other)
Easter.copy
Easter.isAnchored
Easter.onOffset
Easter.is_anchored
Easter.is_on_offset
Easter.__call__(*args, **kwargs) Call self as a function.
pandas.tseries.offsets.Easter.apply
Easter.apply(other)
pandas.tseries.offsets.Easter.apply_index
Easter.apply_index(other)
pandas.tseries.offsets.Easter.copy
Easter.copy()
pandas.tseries.offsets.Easter.isAnchored
Easter.isAnchored()
pandas.tseries.offsets.Easter.onOffset
Easter.onOffset()
pandas.tseries.offsets.Easter.is_anchored
Easter.is_anchored()
pandas.tseries.offsets.Easter.is_on_offset
Easter.is_on_offset()
3.7.28 Tick
Tick
Attributes
pandas.tseries.offsets.Tick
class pandas.tseries.offsets.Tick
Attributes
pandas.tseries.offsets.Tick.base
Tick.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.Tick.__call__
Tick.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Tick.rollback
Tick.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Tick.rollforward
Tick.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Tick.delta
Tick.freqstr
Tick.kwds
Tick.name
Tick.nanos
Tick.normalize
Tick.rule_code
Tick.n
pandas.tseries.offsets.Tick.delta
Tick.delta
pandas.tseries.offsets.Tick.freqstr
Tick.freqstr
pandas.tseries.offsets.Tick.kwds
Tick.kwds
pandas.tseries.offsets.Tick.name
Tick.name
pandas.tseries.offsets.Tick.nanos
Tick.nanos
pandas.tseries.offsets.Tick.normalize
Tick.normalize
pandas.tseries.offsets.Tick.rule_code
Tick.rule_code
pandas.tseries.offsets.Tick.n
Tick.n
Methods
Tick.copy
Tick.isAnchored
Tick.onOffset
Tick.is_anchored
Tick.is_on_offset
Tick.__call__(*args, **kwargs) Call self as a function.
Tick.apply
Tick.apply_index(other)
pandas.tseries.offsets.Tick.copy
Tick.copy()
pandas.tseries.offsets.Tick.isAnchored
Tick.isAnchored()
pandas.tseries.offsets.Tick.onOffset
Tick.onOffset()
pandas.tseries.offsets.Tick.is_anchored
Tick.is_anchored()
pandas.tseries.offsets.Tick.is_on_offset
Tick.is_on_offset()
pandas.tseries.offsets.Tick.apply
Tick.apply()
pandas.tseries.offsets.Tick.apply_index
Tick.apply_index(other)
3.7.29 Day
Day
Attributes
pandas.tseries.offsets.Day
class pandas.tseries.offsets.Day
Attributes
pandas.tseries.offsets.Day.base
Day.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.Day.__call__
Day.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Day.rollback
Day.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Day.rollforward
Day.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Day.delta
Day.freqstr
Day.kwds
Day.name
Day.nanos
Day.normalize
Day.rule_code
Day.n
pandas.tseries.offsets.Day.delta
Day.delta
pandas.tseries.offsets.Day.freqstr
Day.freqstr
pandas.tseries.offsets.Day.kwds
Day.kwds
pandas.tseries.offsets.Day.name
Day.name
pandas.tseries.offsets.Day.nanos
Day.nanos
pandas.tseries.offsets.Day.normalize
Day.normalize
pandas.tseries.offsets.Day.rule_code
Day.rule_code
pandas.tseries.offsets.Day.n
Day.n
Methods
Day.copy
Day.isAnchored
Day.onOffset
Day.is_anchored
Day.is_on_offset
Day.__call__(*args, **kwargs) Call self as a function.
Day.apply
Day.apply_index(other)
pandas.tseries.offsets.Day.copy
Day.copy()
pandas.tseries.offsets.Day.isAnchored
Day.isAnchored()
pandas.tseries.offsets.Day.onOffset
Day.onOffset()
pandas.tseries.offsets.Day.is_anchored
Day.is_anchored()
pandas.tseries.offsets.Day.is_on_offset
Day.is_on_offset()
pandas.tseries.offsets.Day.apply
Day.apply()
pandas.tseries.offsets.Day.apply_index
Day.apply_index(other)
3.7.30 Hour
Hour
Attributes
pandas.tseries.offsets.Hour
class pandas.tseries.offsets.Hour
Attributes
pandas.tseries.offsets.Hour.base
Hour.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.Hour.__call__
Hour.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Hour.rollback
Hour.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Hour.rollforward
Hour.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Hour.delta
Hour.freqstr
Hour.kwds
Hour.name
Hour.nanos
Hour.normalize
Hour.rule_code
Hour.n
pandas.tseries.offsets.Hour.delta
Hour.delta
pandas.tseries.offsets.Hour.freqstr
Hour.freqstr
pandas.tseries.offsets.Hour.kwds
Hour.kwds
pandas.tseries.offsets.Hour.name
Hour.name
pandas.tseries.offsets.Hour.nanos
Hour.nanos
pandas.tseries.offsets.Hour.normalize
Hour.normalize
pandas.tseries.offsets.Hour.rule_code
Hour.rule_code
pandas.tseries.offsets.Hour.n
Hour.n
Methods
Hour.copy
Hour.isAnchored
Hour.onOffset
Hour.is_anchored
Hour.is_on_offset
Hour.__call__(*args, **kwargs) Call self as a function.
Hour.apply
Hour.apply_index(other)
pandas.tseries.offsets.Hour.copy
Hour.copy()
pandas.tseries.offsets.Hour.isAnchored
Hour.isAnchored()
pandas.tseries.offsets.Hour.onOffset
Hour.onOffset()
pandas.tseries.offsets.Hour.is_anchored
Hour.is_anchored()
pandas.tseries.offsets.Hour.is_on_offset
Hour.is_on_offset()
pandas.tseries.offsets.Hour.apply
Hour.apply()
pandas.tseries.offsets.Hour.apply_index
Hour.apply_index(other)
3.7.31 Minute
Minute
Attributes
pandas.tseries.offsets.Minute
class pandas.tseries.offsets.Minute
Attributes
pandas.tseries.offsets.Minute.base
Minute.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.Minute.__call__
Minute.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Minute.rollback
Minute.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Minute.rollforward
Minute.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Minute.delta
Minute.freqstr
Minute.kwds
Minute.name
Minute.nanos
Minute.normalize
Minute.rule_code
Minute.n
pandas.tseries.offsets.Minute.delta
Minute.delta
pandas.tseries.offsets.Minute.freqstr
Minute.freqstr
pandas.tseries.offsets.Minute.kwds
Minute.kwds
pandas.tseries.offsets.Minute.name
Minute.name
pandas.tseries.offsets.Minute.nanos
Minute.nanos
pandas.tseries.offsets.Minute.normalize
Minute.normalize
pandas.tseries.offsets.Minute.rule_code
Minute.rule_code
pandas.tseries.offsets.Minute.n
Minute.n
Methods
Minute.copy
Minute.isAnchored
Minute.onOffset
Minute.is_anchored
Minute.is_on_offset
Minute.__call__(*args, **kwargs) Call self as a function.
Minute.apply
Minute.apply_index(other)
pandas.tseries.offsets.Minute.copy
Minute.copy()
pandas.tseries.offsets.Minute.isAnchored
Minute.isAnchored()
pandas.tseries.offsets.Minute.onOffset
Minute.onOffset()
pandas.tseries.offsets.Minute.is_anchored
Minute.is_anchored()
pandas.tseries.offsets.Minute.is_on_offset
Minute.is_on_offset()
pandas.tseries.offsets.Minute.apply
Minute.apply()
pandas.tseries.offsets.Minute.apply_index
Minute.apply_index(other)
3.7.32 Second
Second
Attributes
pandas.tseries.offsets.Second
class pandas.tseries.offsets.Second
Attributes
pandas.tseries.offsets.Second.base
Second.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.Second.__call__
Second.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Second.rollback
Second.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Second.rollforward
Second.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Second.delta
Second.freqstr
Second.kwds
Second.name
Second.nanos
Second.normalize
Second.rule_code
Second.n
pandas.tseries.offsets.Second.delta
Second.delta
pandas.tseries.offsets.Second.freqstr
Second.freqstr
pandas.tseries.offsets.Second.kwds
Second.kwds
pandas.tseries.offsets.Second.name
Second.name
pandas.tseries.offsets.Second.nanos
Second.nanos
pandas.tseries.offsets.Second.normalize
Second.normalize
pandas.tseries.offsets.Second.rule_code
Second.rule_code
pandas.tseries.offsets.Second.n
Second.n
Methods
Second.copy
Second.isAnchored
Second.onOffset
Second.is_anchored
Second.is_on_offset
Second.__call__(*args, **kwargs) Call self as a function.
Second.apply
Second.apply_index(other)
pandas.tseries.offsets.Second.copy
Second.copy()
pandas.tseries.offsets.Second.isAnchored
Second.isAnchored()
pandas.tseries.offsets.Second.onOffset
Second.onOffset()
pandas.tseries.offsets.Second.is_anchored
Second.is_anchored()
pandas.tseries.offsets.Second.is_on_offset
Second.is_on_offset()
pandas.tseries.offsets.Second.apply
Second.apply()
pandas.tseries.offsets.Second.apply_index
Second.apply_index(other)
3.7.33 Milli
Milli
Attributes
pandas.tseries.offsets.Milli
class pandas.tseries.offsets.Milli
Attributes
pandas.tseries.offsets.Milli.base
Milli.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.Milli.__call__
Milli.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Milli.rollback
Milli.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Milli.rollforward
Milli.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Milli.delta
Milli.freqstr
Milli.kwds
Milli.name
Milli.nanos
Milli.normalize
Milli.rule_code
Milli.n
pandas.tseries.offsets.Milli.delta
Milli.delta
pandas.tseries.offsets.Milli.freqstr
Milli.freqstr
pandas.tseries.offsets.Milli.kwds
Milli.kwds
pandas.tseries.offsets.Milli.name
Milli.name
pandas.tseries.offsets.Milli.nanos
Milli.nanos
pandas.tseries.offsets.Milli.normalize
Milli.normalize
pandas.tseries.offsets.Milli.rule_code
Milli.rule_code
pandas.tseries.offsets.Milli.n
Milli.n
Methods
Milli.copy
Milli.isAnchored
Milli.onOffset
Milli.is_anchored
Milli.is_on_offset
Milli.__call__(*args, **kwargs) Call self as a function.
Milli.apply
Milli.apply_index(other)
pandas.tseries.offsets.Milli.copy
Milli.copy()
pandas.tseries.offsets.Milli.isAnchored
Milli.isAnchored()
pandas.tseries.offsets.Milli.onOffset
Milli.onOffset()
pandas.tseries.offsets.Milli.is_anchored
Milli.is_anchored()
pandas.tseries.offsets.Milli.is_on_offset
Milli.is_on_offset()
pandas.tseries.offsets.Milli.apply
Milli.apply()
pandas.tseries.offsets.Milli.apply_index
Milli.apply_index(other)
3.7.34 Micro
Micro
Attributes
pandas.tseries.offsets.Micro
class pandas.tseries.offsets.Micro
Attributes
pandas.tseries.offsets.Micro.base
Micro.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.Micro.__call__
Micro.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Micro.rollback
Micro.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Micro.rollforward
Micro.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Micro.delta
Micro.freqstr
Micro.kwds
Micro.name
Micro.nanos
Micro.normalize
Micro.rule_code
Micro.n
pandas.tseries.offsets.Micro.delta
Micro.delta
pandas.tseries.offsets.Micro.freqstr
Micro.freqstr
pandas.tseries.offsets.Micro.kwds
Micro.kwds
pandas.tseries.offsets.Micro.name
Micro.name
pandas.tseries.offsets.Micro.nanos
Micro.nanos
pandas.tseries.offsets.Micro.normalize
Micro.normalize
pandas.tseries.offsets.Micro.rule_code
Micro.rule_code
pandas.tseries.offsets.Micro.n
Micro.n
Methods
Micro.copy
Micro.isAnchored
Micro.onOffset
Micro.is_anchored
Micro.is_on_offset
Micro.__call__(*args, **kwargs) Call self as a function.
Micro.apply
Micro.apply_index(other)
pandas.tseries.offsets.Micro.copy
Micro.copy()
pandas.tseries.offsets.Micro.isAnchored
Micro.isAnchored()
pandas.tseries.offsets.Micro.onOffset
Micro.onOffset()
pandas.tseries.offsets.Micro.is_anchored
Micro.is_anchored()
pandas.tseries.offsets.Micro.is_on_offset
Micro.is_on_offset()
pandas.tseries.offsets.Micro.apply
Micro.apply()
pandas.tseries.offsets.Micro.apply_index
Micro.apply_index(other)
3.7.35 Nano
Nano
Attributes
pandas.tseries.offsets.Nano
class pandas.tseries.offsets.Nano
Attributes
pandas.tseries.offsets.Nano.base
Nano.base
Returns a copy of the calling offset object with n=1 and all other attributes equal.
delta
freqstr
kwds
n
name
nanos
normalize
rule_code
Methods
pandas.tseries.offsets.Nano.__call__
Nano.__call__(*args, **kwargs)
Call self as a function.
pandas.tseries.offsets.Nano.rollback
Nano.rollback()
Roll provided date backward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
pandas.tseries.offsets.Nano.rollforward
Nano.rollforward()
Roll provided date forward to next offset only if not on offset.
Returns
TimeStamp Rolled timestamp if not on offset, otherwise unchanged timestamp.
apply
apply_index
copy
isAnchored
is_anchored
is_on_offset
onOffset
Properties
Nano.delta
Nano.freqstr
Nano.kwds
Nano.name
Nano.nanos
Nano.normalize
Nano.rule_code
Nano.n
pandas.tseries.offsets.Nano.delta
Nano.delta
pandas.tseries.offsets.Nano.freqstr
Nano.freqstr
pandas.tseries.offsets.Nano.kwds
Nano.kwds
pandas.tseries.offsets.Nano.name
Nano.name
pandas.tseries.offsets.Nano.nanos
Nano.nanos
pandas.tseries.offsets.Nano.normalize
Nano.normalize
pandas.tseries.offsets.Nano.rule_code
Nano.rule_code
pandas.tseries.offsets.Nano.n
Nano.n
Methods
Nano.copy
Nano.isAnchored
Nano.onOffset
Nano.is_anchored
Nano.is_on_offset
Nano.__call__(*args, **kwargs) Call self as a function.
Nano.apply
Nano.apply_index(other)
pandas.tseries.offsets.Nano.copy
Nano.copy()
pandas.tseries.offsets.Nano.isAnchored
Nano.isAnchored()
pandas.tseries.offsets.Nano.onOffset
Nano.onOffset()
pandas.tseries.offsets.Nano.is_anchored
Nano.is_anchored()
pandas.tseries.offsets.Nano.is_on_offset
Nano.is_on_offset()
pandas.tseries.offsets.Nano.apply
Nano.apply()
pandas.tseries.offsets.Nano.apply_index
Nano.apply_index(other)
3.8 Frequencies
3.8.1 pandas.tseries.frequencies.to_offset
pandas.tseries.frequencies.to_offset()
Return DateOffset object from string or tuple representation or datetime.timedelta object.
Parameters
freq [str, tuple, datetime.timedelta, DateOffset or None]
Returns
DateOffset or None
Raises
Examples
>>> to_offset("5min")
<5 * Minutes>
>>> to_offset("1D1H")
<25 * Hours>
>>> to_offset("2W")
<2 * Weeks: weekday=6>
>>> to_offset("2B")
<2 * BusinessDays>
>>> to_offset(pd.Timedelta(days=1))
<Day>
>>> to_offset(Hour())
<Hour>
3.9 Window
pandas.core.window.rolling.Rolling.count
Rolling.count()
The rolling count of any non-NaN observations inside the window.
Returns
Series or DataFrame Returned object type is determined by the caller of the rolling calcu-
lation.
See also:
pandas.Series.rolling Calling object with Series data.
pandas.DataFrame.rolling Calling object with DataFrames.
pandas.DataFrame.count Count of the full DataFrame.
Examples
pandas.core.window.rolling.Rolling.sum
Rolling.sum(*args, **kwargs)
Calculate rolling sum of given DataFrame or Series.
Parameters
*args, **kwargs For compatibility with other rolling methods. Has no effect on the com-
puted value.
Returns
Series or DataFrame Same type as the input, with the same index, containing the rolling
sum.
See also:
pandas.Series.sum Reducing sum for Series.
pandas.DataFrame.sum Reducing sum for DataFrame.
Examples
>>> s.rolling(3).sum()
0 NaN
1 NaN
2 6.0
3 9.0
4 12.0
dtype: float64
>>> s.expanding(3).sum()
0 NaN
1 NaN
2 6.0
3 10.0
4 15.0
dtype: float64
>>> df.rolling(3).sum()
A B
0 NaN NaN
1 NaN NaN
(continues on next page)
pandas.core.window.rolling.Rolling.mean
Rolling.mean(*args, **kwargs)
Calculate the rolling mean of the values.
Parameters
*args Under Review.
**kwargs Under Review.
Returns
Series or DataFrame Returned object type is determined by the caller of the rolling calcu-
lation.
See also:
pandas.Series.rolling Calling object with Series data.
pandas.DataFrame.rolling Calling object with DataFrames.
pandas.Series.mean Equivalent method for Series.
pandas.DataFrame.mean Equivalent method for DataFrame.
Examples
The below examples will show rolling mean calculations with window sizes of two and three, respectively.
>>> s.rolling(3).mean()
0 NaN
1 NaN
2 2.0
3 3.0
dtype: float64
pandas.core.window.rolling.Rolling.median
Rolling.median(**kwargs)
Calculate the rolling median.
Parameters
**kwargs For compatibility with other rolling methods. Has no effect on the computed
median.
Returns
Series or DataFrame Returned type is the same as the original object.
See also:
pandas.Series.rolling Calling object with Series data.
pandas.DataFrame.rolling Calling object with DataFrames.
pandas.Series.median Equivalent method for Series.
pandas.DataFrame.median Equivalent method for DataFrame.
Examples
pandas.core.window.rolling.Rolling.var
Notes
The default ddof of 1 used in Series.var() is different than the default ddof of 0 in numpy.var().
A minimum of 1 period is required for the rolling calculation.
Examples
>>> s.expanding(3).var()
0 NaN
1 NaN
2 0.333333
3 0.916667
4 0.800000
5 0.700000
6 0.619048
dtype: float64
pandas.core.window.rolling.Rolling.std
Notes
The default ddof of 1 used in Series.std is different than the default ddof of 0 in numpy.std.
A minimum of one period is required for the rolling calculation.
Examples
>>> s.expanding(3).std()
0 NaN
1 NaN
2 0.577350
3 0.957427
4 0.894427
5 0.836660
6 0.786796
dtype: float64
pandas.core.window.rolling.Rolling.min
Rolling.min(*args, **kwargs)
Calculate the rolling minimum.
Parameters
**kwargs Under Review.
Returns
Series or DataFrame Returned object type is determined by the caller of the rolling calcu-
lation.
See also:
pandas.Series.rolling Calling object with a Series.
pandas.DataFrame.rolling Calling object with a DataFrame.
pandas.Series.min Similar method for Series.
pandas.DataFrame.min Similar method for DataFrame.
Examples
pandas.core.window.rolling.Rolling.max
Rolling.max(*args, **kwargs)
Calculate the rolling maximum.
Parameters
*args, **kwargs Arguments and keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.Series.rolling Calling object with Series data.
pandas.DataFrame.rolling Calling object with DataFrame data.
pandas.Series.max Similar method for Series.
pandas.DataFrame.max Similar method for DataFrame.
pandas.core.window.rolling.Rolling.corr
Notes
Examples
The below example shows a rolling calculation with a window size of four matching the equivalent function call
using numpy.corrcoef().
>>> v1 = [3, 3, 3, 5, 8]
>>> v2 = [3, 4, 4, 4, 8]
>>> # numpy returns a 2X2 array, the correlation coefficient
>>> # is the number at entry [0][1]
>>> print(f"{np.corrcoef(v1[:-1], v2[:-1])[0][1]:.6f}")
0.333333
>>> print(f"{np.corrcoef(v1[1:], v2[1:])[0][1]:.6f}")
0.916949
>>> s1 = pd.Series(v1)
>>> s2 = pd.Series(v2)
>>> s1.rolling(4).corr(s2)
0 NaN
1 NaN
2 NaN
3 0.333333
4 0.916949
dtype: float64
The below example shows a similar rolling calculation on a DataFrame using the pairwise option.
>>> matrix = np.array([[51., 35.], [49., 30.], [47., 32.], [46., 31.], [50.,
˓→36.]])
pandas.core.window.rolling.Rolling.cov
pandas.core.window.rolling.Rolling.skew
Rolling.skew(**kwargs)
Unbiased rolling skewness.
Parameters
**kwargs Keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.Series.rolling Calling object with Series data.
pandas.core.window.rolling.Rolling.kurt
Rolling.kurt(**kwargs)
Calculate unbiased rolling kurtosis.
This function uses Fisher’s definition of kurtosis without bias.
Parameters
**kwargs Under Review.
Returns
Series or DataFrame Returned object type is determined by the caller of the rolling calcu-
lation.
See also:
pandas.Series.rolling Calling object with Series data.
pandas.DataFrame.rolling Calling object with DataFrames.
pandas.Series.kurt Equivalent method for Series.
pandas.DataFrame.kurt Equivalent method for DataFrame.
scipy.stats.skew Third moment of a probability density.
scipy.stats.kurtosis Reference SciPy method.
Notes
Examples
The example below will show a rolling calculation with a window size of four matching the equivalent function
call using scipy.stats.
pandas.core.window.rolling.Rolling.apply
Notes
See Numba engine for extended documentation and performance considerations for the Numba engine.
pandas.core.window.rolling.Rolling.aggregate
Notes
Examples
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
>>> df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
>>> df.rolling(2).sum()
A B C
0 NaN NaN NaN
1 3.0 9.0 15.0
2 5.0 11.0 17.0
pandas.core.window.rolling.Rolling.quantile
Examples
pandas.core.window.rolling.Rolling.sem
Notes
Examples
>>> s.expanding().sem()
0 NaN
1 0.707107
2 0.707107
3 0.745356
dtype: float64
pandas.core.window.rolling.Window.mean
Window.mean(*args, **kwargs)
Calculate the window mean of the values.
Parameters
*args Under Review.
**kwargs Under Review.
Returns
Series or DataFrame Returned object type is determined by the caller of the window cal-
culation.
See also:
pandas.Series.window Calling object with Series data.
pandas.DataFrame.window Calling object with DataFrames.
pandas.Series.mean Equivalent method for Series.
pandas.DataFrame.mean Equivalent method for DataFrame.
Examples
The below examples will show rolling mean calculations with window sizes of two and three, respectively.
>>> s.rolling(3).mean()
0 NaN
1 NaN
2 2.0
3 3.0
dtype: float64
pandas.core.window.rolling.Window.sum
Window.sum(*args, **kwargs)
Calculate window sum of given DataFrame or Series.
Parameters
*args, **kwargs For compatibility with other window methods. Has no effect on the com-
puted value.
Returns
Series or DataFrame Same type as the input, with the same index, containing the window
sum.
See also:
pandas.Series.sum Reducing sum for Series.
pandas.DataFrame.sum Reducing sum for DataFrame.
Examples
>>> s.rolling(3).sum()
0 NaN
1 NaN
2 6.0
3 9.0
4 12.0
dtype: float64
>>> s.expanding(3).sum()
0 NaN
1 NaN
2 6.0
3 10.0
4 15.0
dtype: float64
>>> df.rolling(3).sum()
A B
0 NaN NaN
1 NaN NaN
2 6.0 14.0
3 9.0 29.0
4 12.0 50.0
pandas.core.window.rolling.Window.var
Notes
The default ddof of 1 used in Series.var() is different than the default ddof of 0 in numpy.var().
A minimum of 1 period is required for the rolling calculation.
Examples
>>> s.expanding(3).var()
0 NaN
1 NaN
2 0.333333
3 0.916667
4 0.800000
5 0.700000
6 0.619048
dtype: float64
pandas.core.window.rolling.Window.std
Notes
The default ddof of 1 used in Series.std is different than the default ddof of 0 in numpy.std.
A minimum of one period is required for the rolling calculation.
Examples
>>> s.expanding(3).std()
0 NaN
1 NaN
2 0.577350
3 0.957427
4 0.894427
5 0.836660
6 0.786796
dtype: float64
pandas.core.window.expanding.Expanding.count
Expanding.count()
The expanding count of any non-NaN observations inside the window.
Returns
Series or DataFrame Returned object type is determined by the caller of the expanding
calculation.
See also:
pandas.Series.expanding Calling object with Series data.
pandas.DataFrame.expanding Calling object with DataFrames.
pandas.DataFrame.count Count of the full DataFrame.
Examples
pandas.core.window.expanding.Expanding.sum
Expanding.sum(*args, **kwargs)
Calculate expanding sum of given DataFrame or Series.
Parameters
*args, **kwargs For compatibility with other expanding methods. Has no effect on the
computed value.
Returns
Series or DataFrame Same type as the input, with the same index, containing the expanding
sum.
See also:
pandas.Series.sum Reducing sum for Series.
pandas.DataFrame.sum Reducing sum for DataFrame.
Examples
>>> s.rolling(3).sum()
0 NaN
1 NaN
2 6.0
3 9.0
4 12.0
dtype: float64
>>> s.expanding(3).sum()
0 NaN
1 NaN
2 6.0
3 10.0
4 15.0
dtype: float64
>>> df.rolling(3).sum()
A B
0 NaN NaN
1 NaN NaN
2 6.0 14.0
3 9.0 29.0
4 12.0 50.0
pandas.core.window.expanding.Expanding.mean
Expanding.mean(*args, **kwargs)
Calculate the expanding mean of the values.
Parameters
*args Under Review.
**kwargs Under Review.
Returns
Series or DataFrame Returned object type is determined by the caller of the expanding
calculation.
See also:
pandas.Series.expanding Calling object with Series data.
pandas.DataFrame.expanding Calling object with DataFrames.
pandas.Series.mean Equivalent method for Series.
pandas.DataFrame.mean Equivalent method for DataFrame.
Examples
The below examples will show rolling mean calculations with window sizes of two and three, respectively.
>>> s = pd.Series([1, 2, 3, 4])
>>> s.rolling(2).mean()
0 NaN
1 1.5
2 2.5
3 3.5
dtype: float64
>>> s.rolling(3).mean()
0 NaN
1 NaN
2 2.0
3 3.0
dtype: float64
pandas.core.window.expanding.Expanding.median
Expanding.median(**kwargs)
Calculate the expanding median.
Parameters
**kwargs For compatibility with other expanding methods. Has no effect on the computed
median.
Returns
Series or DataFrame Returned type is the same as the original object.
See also:
pandas.Series.expanding Calling object with Series data.
pandas.DataFrame.expanding Calling object with DataFrames.
pandas.Series.median Equivalent method for Series.
pandas.DataFrame.median Equivalent method for DataFrame.
Examples
pandas.core.window.expanding.Expanding.var
Notes
The default ddof of 1 used in Series.var() is different than the default ddof of 0 in numpy.var().
A minimum of 1 period is required for the rolling calculation.
Examples
>>> s.expanding(3).var()
0 NaN
1 NaN
2 0.333333
3 0.916667
4 0.800000
5 0.700000
6 0.619048
dtype: float64
pandas.core.window.expanding.Expanding.std
Notes
The default ddof of 1 used in Series.std is different than the default ddof of 0 in numpy.std.
A minimum of one period is required for the rolling calculation.
Examples
>>> s.expanding(3).std()
0 NaN
1 NaN
2 0.577350
3 0.957427
4 0.894427
5 0.836660
6 0.786796
dtype: float64
pandas.core.window.expanding.Expanding.min
Expanding.min(*args, **kwargs)
Calculate the expanding minimum.
Parameters
**kwargs Under Review.
Returns
Series or DataFrame Returned object type is determined by the caller of the expanding
calculation.
See also:
pandas.Series.expanding Calling object with a Series.
pandas.DataFrame.expanding Calling object with a DataFrame.
pandas.Series.min Similar method for Series.
pandas.DataFrame.min Similar method for DataFrame.
Examples
pandas.core.window.expanding.Expanding.max
Expanding.max(*args, **kwargs)
Calculate the expanding maximum.
Parameters
*args, **kwargs Arguments and keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.Series.expanding Calling object with Series data.
pandas.core.window.expanding.Expanding.corr
Notes
Examples
The below example shows a rolling calculation with a window size of four matching the equivalent function call
using numpy.corrcoef().
>>> v1 = [3, 3, 3, 5, 8]
>>> v2 = [3, 4, 4, 4, 8]
>>> # numpy returns a 2X2 array, the correlation coefficient
>>> # is the number at entry [0][1]
>>> print(f"{np.corrcoef(v1[:-1], v2[:-1])[0][1]:.6f}")
0.333333
>>> print(f"{np.corrcoef(v1[1:], v2[1:])[0][1]:.6f}")
0.916949
>>> s1 = pd.Series(v1)
>>> s2 = pd.Series(v2)
>>> s1.rolling(4).corr(s2)
0 NaN
1 NaN
2 NaN
3 0.333333
4 0.916949
dtype: float64
The below example shows a similar rolling calculation on a DataFrame using the pairwise option.
>>> matrix = np.array([[51., 35.], [49., 30.], [47., 32.], [46., 31.], [50.,
˓→36.]])
pandas.core.window.expanding.Expanding.cov
pandas.core.window.expanding.Expanding.skew
Expanding.skew(**kwargs)
Unbiased expanding skewness.
Parameters
**kwargs Keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.Series.expanding Calling object with Series data.
pandas.DataFrame.expanding Calling object with DataFrame data.
pandas.Series.skew Similar method for Series.
pandas.DataFrame.skew Similar method for DataFrame.
pandas.core.window.expanding.Expanding.kurt
Expanding.kurt(**kwargs)
Calculate unbiased expanding kurtosis.
This function uses Fisher’s definition of kurtosis without bias.
Parameters
**kwargs Under Review.
Returns
Series or DataFrame Returned object type is determined by the caller of the expanding
calculation.
See also:
pandas.Series.expanding Calling object with Series data.
pandas.DataFrame.expanding Calling object with DataFrames.
pandas.Series.kurt Equivalent method for Series.
pandas.DataFrame.kurt Equivalent method for DataFrame.
scipy.stats.skew Third moment of a probability density.
scipy.stats.kurtosis Reference SciPy method.
Notes
Examples
The example below will show an expanding calculation with a window size of four matching the equivalent
function call using scipy.stats.
pandas.core.window.expanding.Expanding.apply
• 'numba' : Runs rolling apply through JIT compiled code from numba. Only
available when raw is set to True.
• None : Defaults to 'cython' or globally setting compute.use_numba
New in version 1.0.0.
engine_kwargs [dict, default None]
• For 'cython' engine, there are no accepted engine_kwargs
• For 'numba' engine, the engine can accept nopython, nogil and parallel
dictionary keys. The values must either be True or False. The de-
fault engine_kwargs for the 'numba' engine is {'nopython': True,
'nogil': False, 'parallel': False} and will be applied to both the
func and the apply rolling aggregation.
New in version 1.0.0.
args [tuple, default None] Positional arguments to be passed into func.
kwargs [dict, default None] Keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.Series.expanding Calling object with Series data.
pandas.DataFrame.expanding Calling object with DataFrame data.
pandas.Series.apply Similar method for Series.
pandas.DataFrame.apply Similar method for DataFrame.
Notes
See Numba engine for extended documentation and performance considerations for the Numba engine.
pandas.core.window.expanding.Expanding.aggregate
Notes
Examples
>>> df = pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6], "C": [7, 8, 9]})
>>> df
A B C
0 1 4 7
1 2 5 8
2 3 6 9
>>> df.ewm(alpha=0.5).mean()
A B C
0 1.000000 4.000000 7.000000
1 1.666667 4.666667 7.666667
2 2.428571 5.428571 8.428571
pandas.core.window.expanding.Expanding.quantile
Series or DataFrame Returned object type is determined by the caller of the expanding
calculation.
See also:
pandas.Series.quantile Computes value at the given quantile over all data in Series.
pandas.DataFrame.quantile Computes values at the given quantile over requested axis in DataFrame.
Examples
pandas.core.window.expanding.Expanding.sem
Notes
Examples
>>> s.expanding().sem()
0 NaN
1 0.707107
2 0.707107
3 0.745356
dtype: float64
pandas.core.window.ewm.ExponentialMovingWindow.mean
ExponentialMovingWindow.mean(*args, **kwargs)
Exponential weighted moving average.
Parameters
*args, **kwargs Arguments and keyword arguments to be passed into func.
Returns
Series or DataFrame Return type is determined by the caller.
See also:
pandas.Series.ewm Calling object with Series data.
pandas.DataFrame.ewm Calling object with DataFrame data.
pandas.Series.mean Similar method for Series.
pandas.DataFrame.mean Similar method for DataFrame.
pandas.core.window.ewm.ExponentialMovingWindow.std
pandas.core.window.ewm.ExponentialMovingWindow.var
pandas.core.window.ewm.ExponentialMovingWindow.corr
See also:
pandas.Series.ewm Calling object with Series data.
pandas.DataFrame.ewm Calling object with DataFrame data.
pandas.Series.corr Similar method for Series.
pandas.DataFrame.corr Similar method for DataFrame.
pandas.core.window.ewm.ExponentialMovingWindow.cov
pandas.api.indexers.BaseIndexer
Methods
pandas.api.indexers.BaseIndexer.get_window_bounds
pandas.api.indexers.FixedForwardWindowIndexer
Examples
Methods
pandas.api.indexers.FixedForwardWindowIndexer.get_window_bounds
FixedForwardWindowIndexer.get_window_bounds(num_values=0, min_periods=None,
center=None, closed=None)
Computes the bounds of a window.
Parameters
num_values [int, default 0] number of values that will be aggregated over
window_size [int, default 0] the number of rows in a window
min_periods [int, default None] min_periods passed from the top level rolling API
center [bool, default None] center passed from the top level rolling API
closed [str, default None] closed passed from the top level rolling API
win_type [str, default None] win_type passed from the top level rolling API
Returns
A tuple of ndarray[int64]s, indicating the boundaries of each
window
pandas.api.indexers.VariableOffsetWindowIndexer
Methods
pandas.api.indexers.VariableOffsetWindowIndexer.get_window_bounds
VariableOffsetWindowIndexer.get_window_bounds(num_values=0,
min_periods=None, center=None,
closed=None)
Computes the bounds of a window.
Parameters
num_values [int, default 0] number of values that will be aggregated over
window_size [int, default 0] the number of rows in a window
min_periods [int, default None] min_periods passed from the top level rolling API
center [bool, default None] center passed from the top level rolling API
closed [str, default None] closed passed from the top level rolling API
win_type [str, default None] win_type passed from the top level rolling API
Returns
A tuple of ndarray[int64]s, indicating the boundaries of each
window
3.10 GroupBy
pandas.core.groupby.GroupBy.__iter__
GroupBy.__iter__()
Groupby iterator.
Returns
Generator yielding sequence of (name, subsetted object)
for each group
pandas.core.groupby.GroupBy.groups
property GroupBy.groups
Dict {group name -> group labels}.
pandas.core.groupby.GroupBy.indices
property GroupBy.indices
Dict {group name -> group indices}.
pandas.core.groupby.GroupBy.get_group
GroupBy.get_group(name, obj=None)
Construct DataFrame from group with provided name.
Parameters
name [object] The name of the group to get as a DataFrame.
obj [DataFrame, default None] The DataFrame to take the DataFrame out of. If it is None,
the object groupby was called on will be used.
Returns
group [same type as obj]
pandas.Grouper
Examples
>>> df = pd.DataFrame(
... {
... "Animal": ["Falcon", "Parrot", "Falcon", "Falcon", "Parrot"],
... "Speed": [100, 5, 200, 300, 15],
... }
... )
>>> df
(continues on next page)
>>> df = pd.DataFrame(
... {
... "Publish date": [
... pd.Timestamp("2000-01-02"),
... pd.Timestamp("2000-01-02"),
... pd.Timestamp("2000-01-09"),
... pd.Timestamp("2000-01-16")
... ],
... "ID": [0, 1, 2, 3],
... "Price": [10, 20, 30, 40]
... }
... )
>>> df
Publish date ID Price
0 2000-01-02 0 10
1 2000-01-02 1 20
2 2000-01-09 2 30
3 2000-01-16 3 40
>>> df.groupby(pd.Grouper(key="Publish date", freq="1W")).mean()
ID Price
Publish date
2000-01-02 0.5 15.0
2000-01-09 2.0 30.0
2000-01-16 3.0 40.0
If you want to adjust the start of the bins based on a fixed timestamp:
>>> ts.groupby(pd.Grouper(freq='17min')).sum()
2000-10-01 23:14:00 0
2000-10-01 23:31:00 9
2000-10-01 23:48:00 21
2000-10-02 00:05:00 54
2000-10-02 00:22:00 24
Freq: 17T, dtype: int64
If you want to adjust the start of the bins with an offset Timedelta, the two following lines are equivalent:
To replace the use of the deprecated base argument, you can now use offset, in this example it is equivalent to
have base=2:
Attributes
ax
groups
GroupBy.apply(func, *args, **kwargs) Apply function func group-wise and combine the results
together.
GroupBy.agg(func, *args, **kwargs)
SeriesGroupBy.aggregate([func, engine, . . . ]) Aggregate using one or more operations over the speci-
fied axis.
DataFrameGroupBy.aggregate([func, engine, Aggregate using one or more operations over the speci-
. . . ]) fied axis.
SeriesGroupBy.transform(func, *args[, . . . ]) Call function producing a like-indexed Series on each
group and return a Series having the same indexes as
the original object filled with the transformed values
DataFrameGroupBy.transform(func, *args[, Call function producing a like-indexed DataFrame on
. . . ]) each group and return a DataFrame having the same in-
dexes as the original object filled with the transformed
values
GroupBy.pipe(func, *args, **kwargs) Apply a function func with arguments to this GroupBy
object and return the function’s result.
pandas.core.groupby.GroupBy.apply
pandas.core.groupby.GroupBy.agg
pandas.core.groupby.SeriesGroupBy.aggregate
Series.groupby.transform Aggregate using one or more operations over the specified axis.
Series.aggregate Transforms the Series on each group based on the given function.
Notes
When using engine='numba', there will be no “fall back” behavior internally. The group data and group
index will be passed as numpy arrays to the JITed user defined function, and no alternative execution attempts
will be tried.
Examples
>>> s
0 1
1 2
2 3
3 4
dtype: int64
The output column names can be controlled by passing the desired column names and aggregations as keyword
arguments.
pandas.core.groupby.DataFrameGroupBy.aggregate
Notes
When using engine='numba', there will be no “fall back” behavior internally. The group data and group
index will be passed as numpy arrays to the JITed user defined function, and no alternative execution attempts
will be tried.
Examples
>>> df = pd.DataFrame(
... {
... "A": [1, 1, 2, 2],
... "B": [1, 2, 3, 4],
... "C": [0.362838, 0.227877, 1.267767, -0.562860],
... }
... )
>>> df
A B C
0 1 1 0.362838
1 1 2 0.227877
2 2 3 1.267767
3 2 4 -0.562860
>>> df.groupby('A').agg('min')
B C
A
1 1 0.227877
2 3 -0.562860
Multiple aggregations
To control the output names with different aggregations per column, pandas supports “named aggregation”
>>> df.groupby("A").agg(
... b_min=pd.NamedAgg(column="B", aggfunc="min"),
... c_sum=pd.NamedAgg(column="C", aggfunc="sum"))
b_min c_sum
A
1 1 0.590715
2 3 0.704907
pandas.core.groupby.SeriesGroupBy.transform
Returns
Series
See also:
Series.groupby.apply Apply function func group-wise and combine the results together.
Series.groupby.aggregate Aggregate using one or more operations over the specified axis.
Series.transform Transforms the Series on each group based on the given function.
Notes
Each group is endowed the attribute ‘name’ in case you need to know which group you are working on.
The current implementation imposes three requirements on f:
• f must return a value that either has the same shape as the input subframe or can be broadcast to the shape
of the input subframe. For example, if f returns a scalar it will be broadcast to have the same shape as the
input subframe.
• if this is a DataFrame, f must support application column-by-column in the subframe. If f also supports
application to the entire subframe, then a fast path is used starting from the second chunk.
• f must not mutate groups. Mutation is not supported and may produce unexpected results.
When using engine='numba', there will be no “fall back” behavior internally. The group data and group
index will be passed as numpy arrays to the JITed user defined function, and no alternative execution attempts
will be tried.
Examples
pandas.core.groupby.DataFrameGroupBy.transform
Notes
Each group is endowed the attribute ‘name’ in case you need to know which group you are working on.
The current implementation imposes three requirements on f:
• f must return a value that either has the same shape as the input subframe or can be broadcast to the shape
of the input subframe. For example, if f returns a scalar it will be broadcast to have the same shape as the
input subframe.
• if this is a DataFrame, f must support application column-by-column in the subframe. If f also supports
application to the entire subframe, then a fast path is used starting from the second chunk.
• f must not mutate groups. Mutation is not supported and may produce unexpected results.
When using engine='numba', there will be no “fall back” behavior internally. The group data and group
index will be passed as numpy arrays to the JITed user defined function, and no alternative execution attempts
will be tried.
Examples
pandas.core.groupby.GroupBy.pipe
>>> (df.groupby('group')
... .pipe(f)
... .pipe(g, arg1=a)
... .pipe(h, arg2=b, arg3=c))
Notes
Examples
To get the difference between each groups maximum and minimum value in one pass, you can do
GroupBy.all([skipna]) Return True if all values in the group are truthful, else
False.
GroupBy.any([skipna]) Return True if any value in the group is truthful, else
False.
GroupBy.bfill([limit]) Backward fill the values.
GroupBy.backfill([limit]) Backward fill the values.
GroupBy.count() Compute count of group, excluding missing values.
GroupBy.cumcount([ascending]) Number each item in each group from 0 to the length of
that group - 1.
GroupBy.cummax([axis]) Cumulative max for each group.
GroupBy.cummin([axis]) Cumulative min for each group.
GroupBy.cumprod([axis]) Cumulative product for each group.
GroupBy.cumsum([axis]) Cumulative sum for each group.
GroupBy.ffill([limit]) Forward fill the values.
GroupBy.first([numeric_only, min_count]) Compute first of group values.
GroupBy.head([n]) Return first n rows of each group.
GroupBy.last([numeric_only, min_count]) Compute last of group values.
GroupBy.max([numeric_only, min_count]) Compute max of group values.
GroupBy.mean([numeric_only]) Compute mean of groups, excluding missing values.
GroupBy.median([numeric_only]) Compute median of groups, excluding missing values.
GroupBy.min([numeric_only, min_count]) Compute min of group values.
GroupBy.ngroup([ascending]) Number each group from 0 to the number of groups - 1.
GroupBy.nth(n[, dropna]) Take the nth row from each group if n is an int, or a
subset of rows if n is a list of ints.
GroupBy.ohlc() Compute open, high, low and close values of a group,
excluding missing values.
GroupBy.pad([limit]) Forward fill the values.
GroupBy.prod([numeric_only, min_count]) Compute prod of group values.
GroupBy.rank([method, ascending, na_option, . . . ]) Provide the rank of values within each group.
GroupBy.pct_change([periods, fill_method, . . . ]) Calculate pct_change of each value to previous entry in
group.
GroupBy.size() Compute group sizes.
GroupBy.sem([ddof]) Compute standard error of the mean of groups, exclud-
ing missing values.
GroupBy.std([ddof]) Compute standard deviation of groups, excluding miss-
ing values.
GroupBy.sum([numeric_only, min_count]) Compute sum of group values.
GroupBy.var([ddof]) Compute variance of groups, excluding missing values.
GroupBy.tail([n]) Return last n rows of each group.
pandas.core.groupby.GroupBy.all
GroupBy.all(skipna=True)
Return True if all values in the group are truthful, else False.
Parameters
skipna [bool, default True] Flag to ignore nan values during truth testing.
Returns
Series or DataFrame DataFrame or Series of boolean values, where a value is True if all
elements are True within its respective group, False otherwise.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.any
GroupBy.any(skipna=True)
Return True if any value in the group is truthful, else False.
Parameters
skipna [bool, default True] Flag to ignore nan values during truth testing.
Returns
Series or DataFrame DataFrame or Series of boolean values, where a value is True if any
element is True within its respective group, False otherwise.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.bfill
GroupBy.bfill(limit=None)
Backward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.backfill Backward fill the missing values in the dataset.
DataFrame.backfill Backward fill the missing values in the dataset.
Series.fillna Fill NaN values of a Series.
DataFrame.fillna Fill NaN values of a DataFrame.
pandas.core.groupby.GroupBy.backfill
GroupBy.backfill(limit=None)
Backward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.backfill Backward fill the missing values in the dataset.
DataFrame.backfill Backward fill the missing values in the dataset.
Series.fillna Fill NaN values of a Series.
DataFrame.fillna Fill NaN values of a DataFrame.
pandas.core.groupby.GroupBy.count
GroupBy.count()
Compute count of group, excluding missing values.
Returns
Series or DataFrame Count of values within each group.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.cumcount
GroupBy.cumcount(ascending=True)
Number each item in each group from 0 to the length of that group - 1.
Essentially this is equivalent to
Parameters
ascending [bool, default True] If False, number in reverse, from length of group - 1 to 0.
Returns
Series Sequence number of each element within each group.
See also:
ngroup Number the groups themselves.
Examples
pandas.core.groupby.GroupBy.cummax
GroupBy.cummax(axis=0, **kwargs)
Cumulative max for each group.
Returns
Series or DataFrame
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.cummin
GroupBy.cummin(axis=0, **kwargs)
Cumulative min for each group.
Returns
Series or DataFrame
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.cumprod
pandas.core.groupby.GroupBy.cumsum
pandas.core.groupby.GroupBy.ffill
GroupBy.ffill(limit=None)
Forward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.pad Returns Series with minimum number of char in object.
DataFrame.pad Object with missing values filled or None if inplace=True.
Series.fillna Fill NaN values of a Series.
DataFrame.fillna Fill NaN values of a DataFrame.
pandas.core.groupby.GroupBy.first
GroupBy.first(numeric_only=False, min_count=- 1)
Compute first of group values.
Parameters
numeric_only [bool, default False] Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data.
min_count [int, default -1] The required number of valid values to perform the operation. If
fewer than min_count non-NA values are present the result will be NA.
Returns
Series or DataFrame Computed first of values within each group.
pandas.core.groupby.GroupBy.head
GroupBy.head(n=5)
Return first n rows of each group.
Similar to .apply(lambda x: x.head(n)), but it returns a subset of rows from the original DataFrame
with original index and order preserved (as_index flag is ignored).
Does not work for negative values of n.
Returns
Series or DataFrame
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
Examples
pandas.core.groupby.GroupBy.last
GroupBy.last(numeric_only=False, min_count=- 1)
Compute last of group values.
Parameters
numeric_only [bool, default False] Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data.
min_count [int, default -1] The required number of valid values to perform the operation. If
fewer than min_count non-NA values are present the result will be NA.
Returns
Series or DataFrame Computed last of values within each group.
pandas.core.groupby.GroupBy.max
GroupBy.max(numeric_only=False, min_count=- 1)
Compute max of group values.
Parameters
numeric_only [bool, default False] Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data.
min_count [int, default -1] The required number of valid values to perform the operation. If
fewer than min_count non-NA values are present the result will be NA.
Returns
Series or DataFrame Computed max of values within each group.
pandas.core.groupby.GroupBy.mean
GroupBy.mean(numeric_only=True)
Compute mean of groups, excluding missing values.
Parameters
numeric_only [bool, default True] Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data.
Returns
pandas.Series or pandas.DataFrame
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
Examples
Groupby one column and return the mean of the remaining columns in each group.
>>> df.groupby('A').mean()
B C
A
1 3.0 1.333333
2 4.0 1.500000
Groupby two columns and return the mean of the remaining column.
Groupby one column and return the mean of only particular column in the group.
>>> df.groupby('A')['B'].mean()
A
1 3.0
2 4.0
Name: B, dtype: float64
pandas.core.groupby.GroupBy.median
GroupBy.median(numeric_only=True)
Compute median of groups, excluding missing values.
For multiple groupings, the result index will be a MultiIndex
Parameters
numeric_only [bool, default True] Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data.
Returns
Series or DataFrame Median of values within each group.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.min
GroupBy.min(numeric_only=False, min_count=- 1)
Compute min of group values.
Parameters
numeric_only [bool, default False] Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data.
min_count [int, default -1] The required number of valid values to perform the operation. If
fewer than min_count non-NA values are present the result will be NA.
Returns
Series or DataFrame Computed min of values within each group.
pandas.core.groupby.GroupBy.ngroup
GroupBy.ngroup(ascending=True)
Number each group from 0 to the number of groups - 1.
This is the enumerative complement of cumcount. Note that the numbers given to the groups match the order in
which the groups would be seen when iterating over the groupby object, not the order they are first observed.
Parameters
ascending [bool, default True] If False, number in reverse, from number of group - 1 to 0.
Returns
Series Unique numbers for each group.
See also:
cumcount Number the rows in each group.
Examples
pandas.core.groupby.GroupBy.nth
GroupBy.nth(n, dropna=None)
Take the nth row from each group if n is an int, or a subset of rows if n is a list of ints.
If dropna, will take the nth non-null row, dropna is either ‘all’ or ‘any’; this is equivalent to calling
dropna(how=dropna) before the groupby.
Parameters
n [int or list of ints] A single nth value for the row or a list of nth values.
dropna [None or str, optional] Apply the specified dropna operation before counting which
row is the nth row. Needs to be None, ‘any’ or ‘all’.
Returns
Series or DataFrame N-th value within each group.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
Examples
pandas.core.groupby.GroupBy.ohlc
GroupBy.ohlc()
Compute open, high, low and close values of a group, excluding missing values.
For multiple groupings, the result index will be a MultiIndex
Returns
DataFrame Open, high, low and close values within each group.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.pad
GroupBy.pad(limit=None)
Forward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.pad Returns Series with minimum number of char in object.
DataFrame.pad Object with missing values filled or None if inplace=True.
Series.fillna Fill NaN values of a Series.
DataFrame.fillna Fill NaN values of a DataFrame.
pandas.core.groupby.GroupBy.prod
GroupBy.prod(numeric_only=True, min_count=0)
Compute prod of group values.
Parameters
numeric_only [bool, default True] Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data.
min_count [int, default 0] The required number of valid values to perform the operation. If
fewer than min_count non-NA values are present the result will be NA.
Returns
Series or DataFrame Computed prod of values within each group.
pandas.core.groupby.GroupBy.rank
pandas.core.groupby.GroupBy.pct_change
pandas.core.groupby.GroupBy.size
GroupBy.size()
Compute group sizes.
Returns
DataFrame or Series Number of rows in each group as a Series if as_index is True or a
DataFrame if as_index is False.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.sem
GroupBy.sem(ddof=1)
Compute standard error of the mean of groups, excluding missing values.
For multiple groupings, the result index will be a MultiIndex.
Parameters
ddof [int, default 1] Degrees of freedom.
Returns
Series or DataFrame Standard error of the mean of values within each group.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.std
GroupBy.std(ddof=1)
Compute standard deviation of groups, excluding missing values.
For multiple groupings, the result index will be a MultiIndex.
Parameters
ddof [int, default 1] Degrees of freedom.
Returns
Series or DataFrame Standard deviation of values within each group.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.sum
GroupBy.sum(numeric_only=True, min_count=0)
Compute sum of group values.
Parameters
numeric_only [bool, default True] Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data.
min_count [int, default 0] The required number of valid values to perform the operation. If
fewer than min_count non-NA values are present the result will be NA.
Returns
Series or DataFrame Computed sum of values within each group.
pandas.core.groupby.GroupBy.var
GroupBy.var(ddof=1)
Compute variance of groups, excluding missing values.
For multiple groupings, the result index will be a MultiIndex.
Parameters
ddof [int, default 1] Degrees of freedom.
Returns
Series or DataFrame Variance of values within each group.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.GroupBy.tail
GroupBy.tail(n=5)
Return last n rows of each group.
Similar to .apply(lambda x: x.tail(n)), but it returns a subset of rows from the original DataFrame
with original index and order preserved (as_index flag is ignored).
Does not work for negative values of n.
Returns
Series or DataFrame
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
Examples
The following methods are available in both SeriesGroupBy and DataFrameGroupBy objects, but may differ
slightly, usually in that the DataFrameGroupBy version usually permits the specification of an axis argument, and
often an argument indicating whether to restrict application to columns of a specific data type.
DataFrameGroupBy.all([skipna]) Return True if all values in the group are truthful, else
False.
DataFrameGroupBy.any([skipna]) Return True if any value in the group is truthful, else
False.
DataFrameGroupBy.backfill([limit]) Backward fill the values.
DataFrameGroupBy.bfill([limit]) Backward fill the values.
DataFrameGroupBy.corr Compute pairwise correlation of columns, excluding
NA/null values.
DataFrameGroupBy.count() Compute count of group, excluding missing values.
DataFrameGroupBy.cov Compute pairwise covariance of columns, excluding
NA/null values.
DataFrameGroupBy.cumcount([ascending]) Number each item in each group from 0 to the length of
that group - 1.
DataFrameGroupBy.cummax([axis]) Cumulative max for each group.
DataFrameGroupBy.cummin([axis]) Cumulative min for each group.
DataFrameGroupBy.cumprod([axis]) Cumulative product for each group.
DataFrameGroupBy.cumsum([axis]) Cumulative sum for each group.
DataFrameGroupBy.describe(**kwargs) Generate descriptive statistics.
DataFrameGroupBy.diff First discrete difference of element.
DataFrameGroupBy.ffill([limit]) Forward fill the values.
DataFrameGroupBy.fillna Fill NA/NaN values using the specified method.
DataFrameGroupBy.filter(func[, dropna]) Return a copy of a DataFrame excluding filtered ele-
ments.
DataFrameGroupBy.hist Make a histogram of the DataFrame’s.
DataFrameGroupBy.idxmax([axis, skipna]) Return index of first occurrence of maximum over re-
quested axis.
DataFrameGroupBy.idxmin([axis, skipna]) Return index of first occurrence of minimum over re-
quested axis.
DataFrameGroupBy.mad {desc}
DataFrameGroupBy.nunique([dropna]) Return DataFrame with counts of unique elements in
each position.
DataFrameGroupBy.pad([limit]) Forward fill the values.
DataFrameGroupBy.pct_change([periods, . . . ]) Calculate pct_change of each value to previous entry in
group.
continues on next page
pandas.core.groupby.DataFrameGroupBy.all
DataFrameGroupBy.all(skipna=True)
Return True if all values in the group are truthful, else False.
Parameters
skipna [bool, default True] Flag to ignore nan values during truth testing.
Returns
Series or DataFrame DataFrame or Series of boolean values, where a value is True if all
elements are True within its respective group, False otherwise.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.DataFrameGroupBy.any
DataFrameGroupBy.any(skipna=True)
Return True if any value in the group is truthful, else False.
Parameters
skipna [bool, default True] Flag to ignore nan values during truth testing.
Returns
Series or DataFrame DataFrame or Series of boolean values, where a value is True if any
element is True within its respective group, False otherwise.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.DataFrameGroupBy.backfill
DataFrameGroupBy.backfill(limit=None)
Backward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.backfill Backward fill the missing values in the dataset.
DataFrame.backfill Backward fill the missing values in the dataset.
Series.fillna Fill NaN values of a Series.
DataFrame.fillna Fill NaN values of a DataFrame.
pandas.core.groupby.DataFrameGroupBy.bfill
DataFrameGroupBy.bfill(limit=None)
Backward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.backfill Backward fill the missing values in the dataset.
DataFrame.backfill Backward fill the missing values in the dataset.
Series.fillna Fill NaN values of a Series.
DataFrame.fillna Fill NaN values of a DataFrame.
pandas.core.groupby.DataFrameGroupBy.corr
property DataFrameGroupBy.corr
Compute pairwise correlation of columns, excluding NA/null values.
Parameters
method [{‘pearson’, ‘kendall’, ‘spearman’} or callable] Method of correlation:
• pearson : standard correlation coefficient
• kendall : Kendall Tau correlation coefficient
• spearman : Spearman rank correlation
• callable: callable with input two 1d ndarrays and returning a float. Note that
the returned matrix from corr will have 1 along the diagonals and will be sym-
metric regardless of the callable’s behavior.
New in version 0.24.0.
min_periods [int, optional] Minimum number of observations required per pair of columns
to have a valid result. Currently only available for Pearson and Spearman correlation.
Returns
Examples
pandas.core.groupby.DataFrameGroupBy.count
DataFrameGroupBy.count()
Compute count of group, excluding missing values.
Returns
DataFrame Count of values within each group.
pandas.core.groupby.DataFrameGroupBy.cov
property DataFrameGroupBy.cov
Compute pairwise covariance of columns, excluding NA/null values.
Compute the pairwise covariance among the series of a DataFrame. The returned data frame is the covariance
matrix of the columns of the DataFrame.
Both NA and null values are automatically excluded from the calculation. (See the note below about bias
from missing values.) A threshold can be set for the minimum number of observations for each value created.
Comparisons with observations below this threshold will be returned as NaN.
This method is generally used for the analysis of time series data to understand the relationship between different
measures across time.
Parameters
min_periods [int, optional] Minimum number of observations required per pair of columns
to have a valid result.
ddof [int, default 1] Delta degrees of freedom. The divisor used in calculations is N -
ddof, where N represents the number of elements.
New in version 1.1.0.
Returns
DataFrame The covariance matrix of the series of the DataFrame.
See also:
Series.cov Compute covariance with another Series.
core.window.ExponentialMovingWindow.cov Exponential weighted sample covariance.
Notes
Returns the covariance matrix of the DataFrame’s time series. The covariance is normalized by N-ddof.
For DataFrames that have Series that are missing data (assuming that data is missing at random) the returned
covariance matrix will be an unbiased estimate of the variance and covariance between the member Series.
However, for many applications this estimate may not be acceptable because the estimate covariance matrix
is not guaranteed to be positive semi-definite. This could lead to estimate correlations having absolute values
which are greater than one, and/or a non-invertible covariance matrix. See Estimation of covariance matrices
for more details.
Examples
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(1000, 5),
... columns=['a', 'b', 'c', 'd', 'e'])
>>> df.cov()
a b c d e
a 0.998438 -0.020161 0.059277 -0.008943 0.014144
b -0.020161 1.059352 -0.008543 -0.024738 0.009826
c 0.059277 -0.008543 1.010670 -0.001486 -0.000271
d -0.008943 -0.024738 -0.001486 0.921297 -0.013692
e 0.014144 0.009826 -0.000271 -0.013692 0.977795
>>> np.random.seed(42)
>>> df = pd.DataFrame(np.random.randn(20, 3),
... columns=['a', 'b', 'c'])
>>> df.loc[df.index[:5], 'a'] = np.nan
>>> df.loc[df.index[5:10], 'b'] = np.nan
>>> df.cov(min_periods=12)
a b c
a 0.316741 NaN -0.150812
b NaN 1.248003 0.191417
c -0.150812 0.191417 0.895202
pandas.core.groupby.DataFrameGroupBy.cumcount
DataFrameGroupBy.cumcount(ascending=True)
Number each item in each group from 0 to the length of that group - 1.
Essentially this is equivalent to
Parameters
ascending [bool, default True] If False, number in reverse, from length of group - 1 to 0.
Returns
Series Sequence number of each element within each group.
See also:
ngroup Number the groups themselves.
Examples
pandas.core.groupby.DataFrameGroupBy.cummax
DataFrameGroupBy.cummax(axis=0, **kwargs)
Cumulative max for each group.
Returns
Series or DataFrame
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.DataFrameGroupBy.cummin
DataFrameGroupBy.cummin(axis=0, **kwargs)
Cumulative min for each group.
Returns
Series or DataFrame
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.DataFrameGroupBy.cumprod
pandas.core.groupby.DataFrameGroupBy.cumsum
pandas.core.groupby.DataFrameGroupBy.describe
DataFrameGroupBy.describe(**kwargs)
Generate descriptive statistics.
Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s
distribution, excluding NaN values.
Analyzes both numeric and object series, as well as DataFrame column sets of mixed data types. The output
will vary depending on what is provided. Refer to the notes below for more detail.
Parameters
percentiles [list-like of numbers, optional] The percentiles to include in the output. All
should fall between 0 and 1. The default is [.25, .5, .75], which returns the
25th, 50th, and 75th percentiles.
include [‘all’, list-like of dtypes or None (default), optional] A white list of data types to
include in the result. Ignored for Series. Here are the options:
• ‘all’ : All columns of the input will be included in the output.
• A list-like of dtypes : Limits the results to the provided data types. To limit the
result to numeric types submit numpy.number. To limit it instead to object
columns submit the numpy.object data type. Strings can also be used in the
style of select_dtypes (e.g. df.describe(include=['O'])). To se-
lect pandas categorical columns, use 'category'
• None (default) : The result will include all numeric columns.
exclude [list-like of dtypes or None (default), optional,] A black list of data types to omit
from the result. Ignored for Series. Here are the options:
• A list-like of dtypes : Excludes the provided data types from the result. To ex-
clude numeric types submit numpy.number. To exclude object columns sub-
mit the data type numpy.object. Strings can also be used in the style of
select_dtypes (e.g. df.describe(include=['O'])). To exclude pan-
das categorical columns, use 'category'
• None (default) : The result will exclude nothing.
datetime_is_numeric [bool, default False] Whether to treat datetime dtypes as numeric.
This affects statistics calculated for the column. For DataFrame input, this also controls
whether datetime columns are included by default.
New in version 1.1.0.
Returns
Series or DataFrame Summary statistics of the Series or Dataframe provided.
See also:
DataFrame.count Count number of non-NA/null observations.
DataFrame.max Maximum of the values in the object.
DataFrame.min Minimum of the values in the object.
DataFrame.mean Mean of the values.
DataFrame.std Standard deviation of the observations.
DataFrame.select_dtypes Subset of a DataFrame including/excluding columns based on their dtype.
Notes
For numeric data, the result’s index will include count, mean, std, min, max as well as lower, 50 and upper
percentiles. By default the lower percentile is 25 and the upper percentile is 75. The 50 percentile is the same
as the median.
For object data (e.g. strings or timestamps), the result’s index will include count, unique, top, and freq.
The top is the most common value. The freq is the most common value’s frequency. Timestamps also include
the first and last items.
If multiple object values have the highest count, then the count and top results will be arbitrarily chosen from
among those with the highest count.
For mixed data types provided via a DataFrame, the default is to return only an analysis of numeric columns.
If the dataframe consists only of object and categorical data without any numeric columns, the default is to
return an analysis of both the object and categorical columns. If include='all' is provided as an option,
the result will include a union of attributes of each type.
The include and exclude parameters can be used to limit which columns in a DataFrame are analyzed for the
output. The parameters are ignored when analyzing a Series.
Examples
>>> s = pd.Series([
... np.datetime64("2000-01-01"),
... np.datetime64("2010-01-01"),
... np.datetime64("2010-01-01")
... ])
>>> s.describe(datetime_is_numeric=True)
count 3
mean 2006-09-01 08:00:00
(continues on next page)
>>> df.describe(include=[object])
object
count 3
unique 3
top a
freq 1
>>> df.describe(include=['category'])
categorical
count 3
unique 3
top d
freq 1
>>> df.describe(exclude=[np.number])
categorical object
count 3 3
unique 3 3
top f a
freq 1 1
>>> df.describe(exclude=[object])
categorical numeric
count 3 3.0
unique 3 NaN
top f NaN
freq 1 NaN
mean NaN 2.0
std NaN 1.0
min NaN 1.0
25% NaN 1.5
50% NaN 2.0
75% NaN 2.5
max NaN 3.0
pandas.core.groupby.DataFrameGroupBy.diff
property DataFrameGroupBy.diff
First discrete difference of element.
Calculates the difference of a Dataframe element compared with another element in the Dataframe (default is
element in previous row).
Parameters
periods [int, default 1] Periods to shift for calculating difference, accepts negative values.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] Take difference over rows (0) or columns (1).
Returns
Dataframe First differences of the Series.
See also:
Dataframe.pct_change Percent change over given number of periods.
Dataframe.shift Shift index by desired number of periods with an optional time freq.
Series.diff First discrete difference of object.
Notes
For boolean dtypes, this uses operator.xor() rather than operator.sub(). The result is calculated
according to current dtype in Dataframe, however dtype of the result is always float64.
Examples
>>> df.diff()
a b c
0 NaN NaN NaN
1 1.0 0.0 3.0
2 1.0 1.0 5.0
3 1.0 1.0 7.0
4 1.0 2.0 9.0
5 1.0 3.0 11.0
>>> df.diff(periods=3)
a b c
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 3.0 2.0 15.0
4 3.0 4.0 21.0
5 3.0 6.0 27.0
>>> df.diff(periods=-1)
a b c
0 -1.0 0.0 -3.0
1 -1.0 -1.0 -5.0
2 -1.0 -1.0 -7.0
3 -1.0 -2.0 -9.0
4 -1.0 -3.0 -11.0
5 NaN NaN NaN
pandas.core.groupby.DataFrameGroupBy.ffill
DataFrameGroupBy.ffill(limit=None)
Forward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.pad Returns Series with minimum number of char in object.
DataFrame.pad Object with missing values filled or None if inplace=True.
Series.fillna Fill NaN values of a Series.
DataFrame.fillna Fill NaN values of a DataFrame.
pandas.core.groupby.DataFrameGroupBy.fillna
property DataFrameGroupBy.fillna
Fill NA/NaN values using the specified method.
Parameters
value [scalar, dict, Series, or DataFrame] Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for each index (for a
Series) or column (for a DataFrame). Values not in the dict/Series/DataFrame will not
be filled. This value cannot be a list.
method [{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None] Method to use for filling
holes in reindexed Series pad / ffill: propagate last valid observation forward to next
valid backfill / bfill: use next valid observation to fill gap.
axis [{0 or ‘index’, 1 or ‘columns’}] Axis along which to fill missing values.
inplace [bool, default False] If True, fill in-place. Note: this will modify any other views on
this object (e.g., a no-copy slice for a column in a DataFrame).
limit [int, default None] If method is specified, this is the maximum number of consecutive
NaN values to forward/backward fill. In other words, if there is a gap with more than this
number of consecutive NaNs, it will only be partially filled. If method is not specified,
this is the maximum number of entries along the entire axis where NaNs will be filled.
Must be greater than 0 if not None.
downcast [dict, default is None] A dict of item->dtype of what to downcast if possible, or
the string ‘infer’ which will try to downcast to an appropriate equal type (e.g. float64 to
int64 if possible).
Returns
DataFrame or None Object with missing values filled or None if inplace=True.
See also:
interpolate Fill NaN values using interpolation.
reindex Conform object to new index.
asfreq Convert TimeSeries to specified frequency.
Examples
>>> df.fillna(0)
A B C D
0 0.0 2.0 0.0 0
(continues on next page)
>>> df.fillna(method='ffill')
A B C D
0 NaN 2.0 NaN 0
1 3.0 4.0 NaN 1
2 3.0 4.0 NaN 5
3 3.0 3.0 NaN 4
Replace all NaN elements in column ‘A’, ‘B’, ‘C’, and ‘D’, with 0, 1, 2, and 3 respectively.
pandas.core.groupby.DataFrameGroupBy.filter
Notes
Each subframe is endowed the attribute ‘name’ in case you need to know which group you are working on.
Examples
pandas.core.groupby.DataFrameGroupBy.hist
property DataFrameGroupBy.hist
Make a histogram of the DataFrame’s.
A histogram is a representation of the distribution of data. This function calls matplotlib.pyplot.
hist(), on each series in the DataFrame, resulting in one histogram per column.
Parameters
data [DataFrame] The pandas object holding the data.
column [str or sequence] If passed, will be used to limit data to a subset of columns.
by [object, optional] If passed, then used to form histograms for separate groups.
grid [bool, default True] Whether to show axis grid lines.
xlabelsize [int, default None] If specified changes the x-axis label size.
xrot [float, default None] Rotation of x axis labels. For example, a value of 90 displays the
x labels rotated 90 degrees clockwise.
ylabelsize [int, default None] If specified changes the y-axis label size.
yrot [float, default None] Rotation of y axis labels. For example, a value of 90 displays the
y labels rotated 90 degrees clockwise.
ax [Matplotlib axes object, default None] The axes to plot the histogram on.
sharex [bool, default True if ax is None else False] In case subplots=True, share x axis and
set some x axis labels to invisible; defaults to True if ax is None otherwise False if an ax
is passed in. Note that passing in both an ax and sharex=True will alter all x axis labels
for all subplots in a figure.
sharey [bool, default False] In case subplots=True, share y axis and set some y axis labels to
invisible.
figsize [tuple] The size in inches of the figure to create. Uses the value in mat-
plotlib.rcParams by default.
layout [tuple, optional] Tuple of (rows, columns) for the layout of the histograms.
bins [int or sequence, default 10] Number of histogram bins to be used. If an integer is given,
bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges,
including left edge of first bin and right edge of last bin. In this case, bins is returned
unmodified.
backend [str, default None] Backend to use instead of the backend specified in the op-
tion plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify
the plotting.backend for the whole session, set pd.options.plotting.
backend.
New in version 1.0.0.
legend [bool, default False] Whether to show the legend.
New in version 1.1.0.
**kwargs All other plotting keyword arguments to be passed to matplotlib.pyplot.
hist().
Returns
matplotlib.AxesSubplot or numpy.ndarray of them
See also:
matplotlib.pyplot.hist Plot a histogram using matplotlib.
Examples
This example draws a histogram based on the length and width of some animals, displayed in three bins
>>> df = pd.DataFrame({
... 'length': [1.5, 0.5, 1.2, 0.9, 3],
... 'width': [0.7, 0.2, 0.15, 0.2, 1.1]
... }, index=['pig', 'rabbit', 'duck', 'chicken', 'horse'])
>>> hist = df.hist(bins=3)
pandas.core.groupby.DataFrameGroupBy.idxmax
DataFrameGroupBy.idxmax(axis=0, skipna=True)
Return index of first occurrence of maximum over requested axis.
NA/null values are excluded.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] The axis to use. 0 or ‘index’ for row-wise, 1
or ‘columns’ for column-wise.
skipna [bool, default True] Exclude NA/null values. If an entire row/column is NA, the
result will be NA.
Returns
Series Indexes of maxima along the specified axis.
Raises
ValueError
• If the row/column is empty
See also:
Notes
Examples
>>> df
consumption co2_emissions
Pork 10.51 37.20
Wheat Products 103.11 19.66
Beef 55.48 1712.00
By default, it returns the index for the maximum value in each column.
>>> df.idxmax()
consumption Wheat Products
co2_emissions Beef
dtype: object
To return the index for the maximum value in each row, use axis="columns".
>>> df.idxmax(axis="columns")
Pork co2_emissions
Wheat Products consumption
Beef co2_emissions
dtype: object
pandas.core.groupby.DataFrameGroupBy.idxmin
DataFrameGroupBy.idxmin(axis=0, skipna=True)
Return index of first occurrence of minimum over requested axis.
NA/null values are excluded.
Parameters
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] The axis to use. 0 or ‘index’ for row-wise, 1
or ‘columns’ for column-wise.
skipna [bool, default True] Exclude NA/null values. If an entire row/column is NA, the
result will be NA.
Returns
Series Indexes of minima along the specified axis.
Raises
ValueError
Notes
Examples
>>> df
consumption co2_emissions
Pork 10.51 37.20
Wheat Products 103.11 19.66
Beef 55.48 1712.00
By default, it returns the index for the minimum value in each column.
>>> df.idxmin()
consumption Pork
co2_emissions Wheat Products
dtype: object
To return the index for the minimum value in each row, use axis="columns".
>>> df.idxmin(axis="columns")
Pork consumption
Wheat Products co2_emissions
Beef consumption
dtype: object
pandas.core.groupby.DataFrameGroupBy.mad
property DataFrameGroupBy.mad
{desc}
Parameters
axis [{axis_descr}] Axis for the function to be applied on.
skipna [bool, default None] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical), count along
a particular level, collapsing into a {name1}.
Returns
{name1} or {name2} (if level specified) {see_also} {examples}
pandas.core.groupby.DataFrameGroupBy.nunique
DataFrameGroupBy.nunique(dropna=True)
Return DataFrame with counts of unique elements in each position.
Parameters
dropna [bool, default True] Don’t include NaN in the counts.
Returns
nunique: DataFrame
Examples
>>> df.groupby('id').nunique()
value1 value2
id
egg 1 1
ham 1 2
spam 2 1
pandas.core.groupby.DataFrameGroupBy.pad
DataFrameGroupBy.pad(limit=None)
Forward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series or DataFrame Object with missing values filled.
See also:
Series.pad Returns Series with minimum number of char in object.
pandas.core.groupby.DataFrameGroupBy.pct_change
pandas.core.groupby.DataFrameGroupBy.plot
property DataFrameGroupBy.plot
Class implementing the .plot attribute for groupby objects.
pandas.core.groupby.DataFrameGroupBy.quantile
DataFrameGroupBy.quantile(q=0.5, interpolation='linear')
Return group values at the given quantile, a la numpy.percentile.
Parameters
q [float or array-like, default 0.5 (50% quantile)] Value(s) between 0 and 1 providing the
quantile(s) to compute.
interpolation [{‘linear’, ‘lower’, ‘higher’, ‘midpoint’, ‘nearest’}] Method to use when the
desired quantile falls between two points.
Returns
Series or DataFrame Return type determined by caller of GroupBy object.
See also:
Series.quantile Similar method for Series.
DataFrame.quantile Similar method for DataFrame.
numpy.percentile NumPy method to compute qth percentile.
Examples
>>> df = pd.DataFrame([
... ['a', 1], ['a', 2], ['a', 3],
... ['b', 1], ['b', 3], ['b', 5]
... ], columns=['key', 'val'])
>>> df.groupby('key').quantile()
val
key
a 2.0
b 3.0
pandas.core.groupby.DataFrameGroupBy.rank
pandas.core.groupby.DataFrameGroupBy.resample
Examples
Downsample the DataFrame into 3 minute bins and sum the values of the timestamps falling into a bin.
>>> df.groupby('a').resample('3T').sum()
a b
a
0 2000-01-01 00:00:00 0 2
2000-01-01 00:03:00 0 1
5 2000-01-01 00:00:00 5 1
>>> df.groupby('a').resample('30S').sum()
a b
a
0 2000-01-01 00:00:00 0 1
2000-01-01 00:00:30 0 0
2000-01-01 00:01:00 0 1
2000-01-01 00:01:30 0 0
2000-01-01 00:02:00 0 0
2000-01-01 00:02:30 0 0
2000-01-01 00:03:00 0 1
5 2000-01-01 00:02:00 5 1
>>> df.groupby('a').resample('M').sum()
a b
a
0 2000-01-31 0 3
5 2000-01-31 5 1
Downsample the series into 3 minute bins as above, but close the right side of the bin interval.
Downsample the series into 3 minute bins and close the right side of the bin interval, but label each bin using
the right edge instead of the left.
pandas.core.groupby.DataFrameGroupBy.sample
Examples
>>> df = pd.DataFrame(
... {"a": ["red"] * 2 + ["blue"] * 2 + ["black"] * 2, "b": range(6)}
... )
>>> df
a b
0 red 0
1 red 1
2 blue 2
3 blue 3
4 black 4
5 black 5
Select one row at random for each distinct value in column a. The random_state argument can be used to
guarantee reproducibility:
>>> df.groupby("a").sample(
... n=1,
... weights=[1, 1, 1, 0, 0, 1],
... random_state=1,
... )
a b
5 black 5
2 blue 2
0 red 0
pandas.core.groupby.DataFrameGroupBy.shift
pandas.core.groupby.DataFrameGroupBy.size
DataFrameGroupBy.size()
Compute group sizes.
Returns
DataFrame or Series Number of rows in each group as a Series if as_index is True or a
DataFrame if as_index is False.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.groupby.DataFrameGroupBy.skew
property DataFrameGroupBy.skew
Return unbiased skew over requested axis.
Normalized by N-1.
Parameters
axis [{index (0), columns (1)}] Axis for the function to be applied on.
skipna [bool, default True] Exclude NA/null values when computing the result.
level [int or level name, default None] If the axis is a MultiIndex (hierarchical), count along
a particular level, collapsing into a Series.
numeric_only [bool, default None] Include only float, int, boolean columns. If None, will
attempt to use everything, then use only numeric data. Not implemented for Series.
**kwargs Additional keyword arguments to be passed to the function.
Returns
Series or DataFrame (if level specified)
pandas.core.groupby.DataFrameGroupBy.take
property DataFrameGroupBy.take
Return the elements in the given positional indices along an axis.
This means that we are not indexing according to actual values in the index attribute of the object. We are
indexing according to the actual position of the element in the object.
Parameters
indices [array-like] An array of ints indicating which positions to take.
axis [{0 or ‘index’, 1 or ‘columns’, None}, default 0] The axis on which to select elements.
0 means that we are selecting rows, 1 means that we are selecting columns.
is_copy [bool] Before pandas 1.0, is_copy=False can be specified to ensure that the
return value is an actual copy. Starting with pandas 1.0, take always returns a copy,
and the keyword is therefore deprecated.
Deprecated since version 1.0.0.
**kwargs For compatibility with numpy.take(). Has no effect on the output.
Returns
taken [same type as caller] An array-like containing the elements taken from the object.
See also:
DataFrame.loc Select a subset of a DataFrame by labels.
DataFrame.iloc Select a subset of a DataFrame by positions.
numpy.take Take elements from an array along an axis.
Examples
We may take elements using negative integers for positive indices, starting from the end of the object, just like
with Python lists.
pandas.core.groupby.DataFrameGroupBy.tshift
property DataFrameGroupBy.tshift
Shift the time index, using the index’s frequency if available.
Deprecated since version 1.1.0: Use shift instead.
Parameters
periods [int] Number of periods to move, can be positive or negative.
freq [DateOffset, timedelta, or str, default None] Increment to use from the tseries module
or time rule expressed as a string (e.g. ‘EOM’).
axis [{0 or ‘index’, 1 or ‘columns’, None}, default 0] Corresponds to the axis that contains
the Index.
Returns
shifted [Series/DataFrame]
Notes
If freq is not specified then tries to use the freq or inferred_freq attributes of the index. If neither of those
attributes exist, a ValueError is thrown
The following methods are available only for SeriesGroupBy objects.
pandas.core.groupby.SeriesGroupBy.hist
property SeriesGroupBy.hist
Draw histogram of the input series using matplotlib.
Parameters
by [object, optional] If passed, then used to form histograms for separate groups.
ax [matplotlib axis object] If not passed, uses gca().
grid [bool, default True] Whether to show axis grid lines.
xlabelsize [int, default None] If specified changes the x-axis label size.
xrot [float, default None] Rotation of x axis labels.
ylabelsize [int, default None] If specified changes the y-axis label size.
yrot [float, default None] Rotation of y axis labels.
figsize [tuple, default None] Figure size in inches by default.
bins [int or sequence, default 10] Number of histogram bins to be used. If an integer is given,
bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges,
including left edge of first bin and right edge of last bin. In this case, bins is returned
unmodified.
backend [str, default None] Backend to use instead of the backend specified in the op-
tion plotting.backend. For instance, ‘matplotlib’. Alternatively, to specify
the plotting.backend for the whole session, set pd.options.plotting.
backend.
New in version 1.0.0.
legend [bool, default False] Whether to show the legend.
New in version 1.1.0.
**kwargs To be passed to the actual plotting function.
Returns
matplotlib.AxesSubplot A histogram plot.
See also:
matplotlib.axes.Axes.hist Plot a histogram using matplotlib.
pandas.core.groupby.SeriesGroupBy.nlargest
property SeriesGroupBy.nlargest
Return the largest n elements.
Parameters
n [int, default 5] Return this many descending sorted values.
keep [{‘first’, ‘last’, ‘all’}, default ‘first’] When there are duplicate values that cannot all fit
in a Series of n elements:
• first [return the first n occurrences in order] of appearance.
• last [return the last n occurrences in reverse] order of appearance.
• all [keep all occurrences. This can result in a Series of] size larger than n.
Returns
Series The n largest values in the Series, sorted in decreasing order.
See also:
Series.nsmallest Get the n smallest elements.
Series.sort_values Sort Series by values.
Series.head Return the first n rows.
Notes
Examples
>>> s.nlargest()
France 65000000
Italy 59000000
Malta 434000
Maldives 434000
Brunei 434000
dtype: int64
The n largest elements where n=3. Default keep value is ‘first’ so Malta will be kept.
>>> s.nlargest(3)
France 65000000
Italy 59000000
Malta 434000
dtype: int64
The n largest elements where n=3 and keeping the last duplicates. Brunei will be kept since it is the last with
value 434000 based on the index order.
The n largest elements where n=3 with all duplicates kept. Note that the returned Series has five elements due
to the three duplicates.
pandas.core.groupby.SeriesGroupBy.nsmallest
property SeriesGroupBy.nsmallest
Return the smallest n elements.
Parameters
n [int, default 5] Return this many ascending sorted values.
keep [{‘first’, ‘last’, ‘all’}, default ‘first’] When there are duplicate values that cannot all fit
in a Series of n elements:
• first [return the first n occurrences in order] of appearance.
• last [return the last n occurrences in reverse] order of appearance.
• all [keep all occurrences. This can result in a Series of] size larger than n.
Returns
Series The n smallest values in the Series, sorted in increasing order.
See also:
Series.nlargest Get the n largest elements.
Series.sort_values Sort Series by values.
Series.head Return the first n rows.
Notes
Faster than .sort_values().head(n) for small n relative to the size of the Series object.
Examples
>>> s.nsmallest()
Montserrat 5200
Nauru 11300
Tuvalu 11300
Anguilla 11300
Iceland 337000
dtype: int64
The n smallest elements where n=3. Default keep value is ‘first’ so Nauru and Tuvalu will be kept.
>>> s.nsmallest(3)
Montserrat 5200
Nauru 11300
Tuvalu 11300
dtype: int64
The n smallest elements where n=3 and keeping the last duplicates. Anguilla and Tuvalu will be kept since they
are the last with value 11300 based on the index order.
The n smallest elements where n=3 with all duplicates kept. Note that the returned Series has four elements due
to the three duplicates.
pandas.core.groupby.SeriesGroupBy.nunique
SeriesGroupBy.nunique(dropna=True)
Return number of unique elements in the group.
Returns
Series Number of unique values within each group.
pandas.core.groupby.SeriesGroupBy.unique
property SeriesGroupBy.unique
Return unique values of Series object.
Uniques are returned in order of appearance. Hash table-based unique, therefore does NOT sort.
Returns
ndarray or ExtensionArray The unique values returned as a NumPy array. See Notes.
See also:
unique Top-level unique method for any 1-d array-like object.
Index.unique Return Index with unique values from an Index object.
Notes
Returns the unique values as a NumPy array. In case of an extension-array backed Series, a new
ExtensionArray of that type with just the unique values is returned. This includes
• Categorical
• Period
• Datetime with Timezone
• Interval
• Sparse
• IntegerNA
See Examples section.
Examples
>>> pd.Series(pd.Categorical(list('baabc'))).unique()
['b', 'a', 'c']
Categories (3, object): ['b', 'a', 'c']
pandas.core.groupby.SeriesGroupBy.value_counts
pandas.core.groupby.SeriesGroupBy.is_monotonic_increasing
property SeriesGroupBy.is_monotonic_increasing
Alias for is_monotonic.
pandas.core.groupby.SeriesGroupBy.is_monotonic_decreasing
property SeriesGroupBy.is_monotonic_decreasing
Return boolean if values in the object are monotonic_decreasing.
Returns
bool
The following methods are available only for DataFrameGroupBy objects.
pandas.core.groupby.DataFrameGroupBy.corrwith
property DataFrameGroupBy.corrwith
Compute pairwise correlation.
Pairwise correlation is computed between rows or columns of DataFrame with rows or columns of Series or
DataFrame. DataFrames are first aligned along both axes before computing the correlations.
Parameters
other [DataFrame, Series] Object with which to compute correlations.
axis [{0 or ‘index’, 1 or ‘columns’}, default 0] The axis to use. 0 or ‘index’ to compute
column-wise, 1 or ‘columns’ for row-wise.
drop [bool, default False] Drop missing indices from result.
method [{‘pearson’, ‘kendall’, ‘spearman’} or callable] Method of correlation:
• pearson : standard correlation coefficient
• kendall : Kendall Tau correlation coefficient
• spearman : Spearman rank correlation
• callable: callable with input two 1d ndarrays and returning a float.
New in version 0.24.0.
Returns
Series Pairwise correlations.
See also:
DataFrame.corr Compute pairwise correlation of columns.
pandas.core.groupby.DataFrameGroupBy.boxplot
Examples
You can create boxplots for grouped data and show them as separate subplots:
3.11 Resampling
pandas.core.resample.Resampler.__iter__
Resampler.__iter__()
Resampler iterator.
Returns
Generator yielding sequence of (name, subsetted object)
for each group.
See also:
GroupBy.__iter__ Generator yielding sequence for each group.
pandas.core.resample.Resampler.groups
property Resampler.groups
Dict {group name -> group labels}.
pandas.core.resample.Resampler.indices
property Resampler.indices
Dict {group name -> group indices}.
pandas.core.resample.Resampler.get_group
Resampler.get_group(name, obj=None)
Construct DataFrame from group with provided name.
Parameters
name [object] The name of the group to get as a DataFrame.
obj [DataFrame, default None] The DataFrame to take the DataFrame out of. If it is None,
the object groupby was called on will be used.
Returns
group [same type as obj]
Resampler.apply(func, *args, **kwargs) Aggregate using one or more operations over the speci-
fied axis.
Resampler.aggregate(func, *args, **kwargs) Aggregate using one or more operations over the speci-
fied axis.
Resampler.transform(arg, *args, **kwargs) Call function producing a like-indexed Series on each
group and return a Series with the transformed values.
Resampler.pipe(func, *args, **kwargs) Apply a function func with arguments to this Resampler
object and return the function’s result.
pandas.core.resample.Resampler.apply
Notes
Examples
>>> s = pd.Series([1,2,3,4,5],
index=pd.date_range('20130101', periods=5,freq='s'))
2013-01-01 00:00:00 1
2013-01-01 00:00:01 2
2013-01-01 00:00:02 3
2013-01-01 00:00:03 4
2013-01-01 00:00:04 5
Freq: S, dtype: int64
>>> r = s.resample('2s')
DatetimeIndexResampler [freq=<2 * Seconds>, axis=0, closed=left,
label=left, convention=start]
>>> r.agg(np.sum)
2013-01-01 00:00:00 3
2013-01-01 00:00:02 7
2013-01-01 00:00:04 5
Freq: 2S, dtype: int64
>>> r.agg(['sum','mean','max'])
sum mean max
2013-01-01 00:00:00 3 1.5 2
2013-01-01 00:00:02 7 3.5 4
2013-01-01 00:00:04 5 5.0 5
pandas.core.resample.Resampler.aggregate
Notes
Examples
>>> s = pd.Series([1,2,3,4,5],
index=pd.date_range('20130101', periods=5,freq='s'))
2013-01-01 00:00:00 1
2013-01-01 00:00:01 2
2013-01-01 00:00:02 3
2013-01-01 00:00:03 4
2013-01-01 00:00:04 5
Freq: S, dtype: int64
>>> r = s.resample('2s')
DatetimeIndexResampler [freq=<2 * Seconds>, axis=0, closed=left,
label=left, convention=start]
>>> r.agg(np.sum)
2013-01-01 00:00:00 3
2013-01-01 00:00:02 7
2013-01-01 00:00:04 5
Freq: 2S, dtype: int64
>>> r.agg(['sum','mean','max'])
sum mean max
2013-01-01 00:00:00 3 1.5 2
2013-01-01 00:00:02 7 3.5 4
2013-01-01 00:00:04 5 5.0 5
pandas.core.resample.Resampler.transform
Examples
pandas.core.resample.Resampler.pipe
>>> (df.groupby('group')
... .pipe(f)
... .pipe(g, arg1=a)
... .pipe(h, arg2=b, arg3=c))
Notes
Examples
To get the difference between each 2-day period’s maximum and minimum value in one pass, you can do
3.11.3 Upsampling
pandas.core.resample.Resampler.ffill
Resampler.ffill(limit=None)
Forward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
An upsampled Series.
See also:
Series.fillna Fill NA/NaN values using the specified method.
DataFrame.fillna Fill NA/NaN values using the specified method.
pandas.core.resample.Resampler.backfill
Resampler.backfill(limit=None)
Backward fill the new missing values in the resampled data.
In statistics, imputation is the process of replacing missing data with substituted values [1]. When resampling
data, missing values may appear (e.g., when the resampling frequency is higher than the original frequency).
The backward fill will replace NaN values that appeared in the resampled data with the next value in the original
sequence. Missing values that existed in the original data will not be modified.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series, DataFrame An upsampled Series or DataFrame with backward filled NaN values.
See also:
References
[1]
Examples
Resampling a Series:
>>> s.resample('30min').backfill()
2018-01-01 00:00:00 1
2018-01-01 00:30:00 2
2018-01-01 01:00:00 2
2018-01-01 01:30:00 3
2018-01-01 02:00:00 3
Freq: 30T, dtype: int64
>>> s.resample('15min').backfill(limit=2)
2018-01-01 00:00:00 1.0
2018-01-01 00:15:00 NaN
2018-01-01 00:30:00 2.0
2018-01-01 00:45:00 2.0
2018-01-01 01:00:00 2.0
2018-01-01 01:15:00 NaN
2018-01-01 01:30:00 3.0
2018-01-01 01:45:00 3.0
2018-01-01 02:00:00 3.0
Freq: 15T, dtype: float64
>>> df.resample('30min').backfill()
a b
2018-01-01 00:00:00 2.0 1
2018-01-01 00:30:00 NaN 3
2018-01-01 01:00:00 NaN 3
2018-01-01 01:30:00 6.0 5
2018-01-01 02:00:00 6.0 5
>>> df.resample('15min').backfill(limit=2)
a b
2018-01-01 00:00:00 2.0 1.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:30:00 NaN 3.0
2018-01-01 00:45:00 NaN 3.0
2018-01-01 01:00:00 NaN 3.0
2018-01-01 01:15:00 NaN NaN
2018-01-01 01:30:00 6.0 5.0
2018-01-01 01:45:00 6.0 5.0
2018-01-01 02:00:00 6.0 5.0
pandas.core.resample.Resampler.bfill
Resampler.bfill(limit=None)
Backward fill the new missing values in the resampled data.
In statistics, imputation is the process of replacing missing data with substituted values [1]. When resampling
data, missing values may appear (e.g., when the resampling frequency is higher than the original frequency).
The backward fill will replace NaN values that appeared in the resampled data with the next value in the original
sequence. Missing values that existed in the original data will not be modified.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series, DataFrame An upsampled Series or DataFrame with backward filled NaN values.
See also:
bfill Alias of backfill.
fillna Fill NaN values using the specified method, which can be ‘backfill’.
nearest Fill NaN values with nearest neighbor starting from center.
pad Forward fill NaN values.
Series.fillna Fill NaN values in the Series using the specified method, which can be ‘backfill’.
DataFrame.fillna Fill NaN values in the DataFrame using the specified method, which can be ‘backfill’.
References
[1]
Examples
Resampling a Series:
>>> s.resample('30min').backfill()
2018-01-01 00:00:00 1
2018-01-01 00:30:00 2
2018-01-01 01:00:00 2
2018-01-01 01:30:00 3
2018-01-01 02:00:00 3
Freq: 30T, dtype: int64
>>> s.resample('15min').backfill(limit=2)
2018-01-01 00:00:00 1.0
2018-01-01 00:15:00 NaN
2018-01-01 00:30:00 2.0
2018-01-01 00:45:00 2.0
2018-01-01 01:00:00 2.0
2018-01-01 01:15:00 NaN
2018-01-01 01:30:00 3.0
2018-01-01 01:45:00 3.0
2018-01-01 02:00:00 3.0
Freq: 15T, dtype: float64
>>> df.resample('30min').backfill()
a b
2018-01-01 00:00:00 2.0 1
2018-01-01 00:30:00 NaN 3
2018-01-01 01:00:00 NaN 3
2018-01-01 01:30:00 6.0 5
2018-01-01 02:00:00 6.0 5
>>> df.resample('15min').backfill(limit=2)
a b
2018-01-01 00:00:00 2.0 1.0
2018-01-01 00:15:00 NaN NaN
2018-01-01 00:30:00 NaN 3.0
(continues on next page)
pandas.core.resample.Resampler.pad
Resampler.pad(limit=None)
Forward fill the values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
An upsampled Series.
See also:
Series.fillna Fill NA/NaN values using the specified method.
DataFrame.fillna Fill NA/NaN values using the specified method.
pandas.core.resample.Resampler.nearest
Resampler.nearest(limit=None)
Resample by using the nearest value.
When resampling data, missing values may appear (e.g., when the resampling frequency is higher than the
original frequency). The nearest method will replace NaN values that appeared in the resampled data with the
value from the nearest member of the sequence, based on the index value. Missing values that existed in the
original data will not be modified. If limit is given, fill only this many values in each direction for each of the
original values.
Parameters
limit [int, optional] Limit of how many values to fill.
Returns
Series or DataFrame An upsampled Series or DataFrame with NaN values filled with their
nearest value.
See also:
backfill Backward fill the new missing values in the resampled data.
pad Forward fill NaN values.
Examples
>>> s.resample('15min').nearest()
2018-01-01 00:00:00 1
2018-01-01 00:15:00 1
2018-01-01 00:30:00 2
2018-01-01 00:45:00 2
2018-01-01 01:00:00 2
Freq: 15T, dtype: int64
>>> s.resample('15min').nearest(limit=1)
2018-01-01 00:00:00 1.0
2018-01-01 00:15:00 1.0
2018-01-01 00:30:00 NaN
2018-01-01 00:45:00 2.0
2018-01-01 01:00:00 2.0
Freq: 15T, dtype: float64
pandas.core.resample.Resampler.fillna
Resampler.fillna(method, limit=None)
Fill missing values introduced by upsampling.
In statistics, imputation is the process of replacing missing data with substituted values [1]. When resampling
data, missing values may appear (e.g., when the resampling frequency is higher than the original frequency).
Missing values that existed in the original data will not be modified.
Parameters
method [{‘pad’, ‘backfill’, ‘ffill’, ‘bfill’, ‘nearest’}] Method to use for filling holes in resam-
pled data
• ‘pad’ or ‘ffill’: use previous valid observation to fill gap (forward fill).
• ‘backfill’ or ‘bfill’: use next valid observation to fill gap.
• ‘nearest’: use nearest valid observation to fill gap.
limit [int, optional] Limit of how many consecutive missing values to fill.
Returns
Series or DataFrame An upsampled Series or DataFrame with missing values filled.
See also:
backfill Backward fill NaN values in the resampled data.
pad Forward fill NaN values in the resampled data.
nearest Fill NaN values in the resampled data with nearest neighbor starting from center.
References
[1]
Examples
Resampling a Series:
>>> s.resample("30min").asfreq()
2018-01-01 00:00:00 1.0
2018-01-01 00:30:00 NaN
2018-01-01 01:00:00 2.0
2018-01-01 01:30:00 NaN
2018-01-01 02:00:00 3.0
Freq: 30T, dtype: float64
>>> s.resample('30min').fillna("backfill")
2018-01-01 00:00:00 1
2018-01-01 00:30:00 2
2018-01-01 01:00:00 2
2018-01-01 01:30:00 3
2018-01-01 02:00:00 3
Freq: 30T, dtype: int64
>>> s.resample('30min').fillna("pad")
2018-01-01 00:00:00 1
2018-01-01 00:30:00 1
(continues on next page)
>>> s.resample('30min').fillna("nearest")
2018-01-01 00:00:00 1
2018-01-01 00:30:00 2
2018-01-01 01:00:00 2
2018-01-01 01:30:00 3
2018-01-01 02:00:00 3
Freq: 30T, dtype: int64
>>> sm.resample('30min').fillna('backfill')
2018-01-01 00:00:00 1.0
2018-01-01 00:30:00 NaN
2018-01-01 01:00:00 NaN
2018-01-01 01:30:00 3.0
2018-01-01 02:00:00 3.0
Freq: 30T, dtype: float64
>>> sm.resample('30min').fillna('pad')
2018-01-01 00:00:00 1.0
2018-01-01 00:30:00 1.0
2018-01-01 01:00:00 NaN
2018-01-01 01:30:00 NaN
2018-01-01 02:00:00 3.0
Freq: 30T, dtype: float64
>>> sm.resample('30min').fillna('nearest')
2018-01-01 00:00:00 1.0
2018-01-01 00:30:00 NaN
2018-01-01 01:00:00 NaN
2018-01-01 01:30:00 3.0
2018-01-01 02:00:00 3.0
Freq: 30T, dtype: float64
DataFrame resampling is done column-wise. All the same options are available.
>>> df = pd.DataFrame({'a': [2, np.nan, 6], 'b': [1, 3, 5]},
... index=pd.date_range('20180101', periods=3,
... freq='h'))
>>> df
a b
2018-01-01 00:00:00 2.0 1
(continues on next page)
>>> df.resample('30min').fillna("bfill")
a b
2018-01-01 00:00:00 2.0 1
2018-01-01 00:30:00 NaN 3
2018-01-01 01:00:00 NaN 3
2018-01-01 01:30:00 6.0 5
2018-01-01 02:00:00 6.0 5
pandas.core.resample.Resampler.asfreq
Resampler.asfreq(fill_value=None)
Return the values at the new freq, essentially a reindex.
Parameters
fill_value [scalar, optional] Value to use for missing values, applied during upsampling (note
this does not fill NaNs that already were present).
Returns
DataFrame or Series Values at the specified freq.
See also:
Series.asfreq Convert TimeSeries to specified frequency.
DataFrame.asfreq Convert TimeSeries to specified frequency.
pandas.core.resample.Resampler.interpolate
Notes
The ‘krogh’, ‘piecewise_polynomial’, ‘spline’, ‘pchip’ and ‘akima’ methods are wrappers around the respec-
tive SciPy implementations of similar names. These use the actual numerical values of the index. For more
information on their behavior, see the SciPy documentation and SciPy tutorial.
Examples
Filling in NaN in a Series by padding, but filling at most two consecutive NaN at a time.
Filling in NaN in a Series via polynomial interpolation or splines: Both ‘polynomial’ and ‘spline’ methods
require that you also specify an order (int).
Fill the DataFrame forward (that is, going down) along each column using linear interpolation.
Note how the last entry in column ‘a’ is interpolated differently, because there is no entry after it to use for
interpolation. Note how the first entry in column ‘b’ remains NaN, because there is no entry before it to use for
interpolation.
pandas.core.resample.Resampler.count
Resampler.count()
Compute count of group, excluding missing values.
Returns
Series or DataFrame Count of values within each group.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.resample.Resampler.nunique
Resampler.nunique(_method='nunique')
Return number of unique elements in the group.
Returns
Series Number of unique values within each group.
pandas.core.resample.Resampler.first
pandas.core.resample.Resampler.last
pandas.core.resample.Resampler.max
pandas.core.resample.Resampler.mean
Examples
Groupby one column and return the mean of the remaining columns in each group.
>>> df.groupby('A').mean()
B C
A
1 3.0 1.333333
2 4.0 1.500000
Groupby two columns and return the mean of the remaining column.
Groupby one column and return the mean of only particular column in the group.
>>> df.groupby('A')['B'].mean()
A
1 3.0
2 4.0
Name: B, dtype: float64
pandas.core.resample.Resampler.median
pandas.core.resample.Resampler.min
pandas.core.resample.Resampler.ohlc
pandas.core.resample.Resampler.prod
pandas.core.resample.Resampler.size
Resampler.size()
Compute group sizes.
Returns
DataFrame or Series Number of rows in each group as a Series if as_index is True or a
DataFrame if as_index is False.
See also:
Series.groupby Apply a function groupby to a Series.
DataFrame.groupby Apply a function groupby to each row or column of a DataFrame.
pandas.core.resample.Resampler.sem
pandas.core.resample.Resampler.std
pandas.core.resample.Resampler.sum
pandas.core.resample.Resampler.var
pandas.core.resample.Resampler.quantile
Resampler.quantile(q=0.5, **kwargs)
Return value at the given quantile.
New in version 0.24.0.
Parameters
q [float or array-like, default 0.5 (50% quantile)]
Returns
DataFrame or Series Quantile of values within each group.
See also:
Series.quantile Return a series, where the index is q and the values are the quantiles.
DataFrame.quantile Return a DataFrame, where the columns are the columns of self, and the values are
the quantiles.
DataFrameGroupBy.quantile Return a DataFrame, where the coulmns are groupby columns, and the
values are its quantiles.
3.12 Style
Styler(data[, precision, table_styles, . . . ]) Helps style a DataFrame or Series according to the data
with HTML and CSS.
Styler.from_custom_template(searchpath, Factory function for creating a subclass of Styler.
name)
pandas.io.formats.style.Styler
<uuid> is the unique identifier, <num_row> is the row number and <num_col> is
the column number.
na_rep [str, optional] Representation for missing values. If na_rep is None, no special
formatting is applied.
New in version 1.0.0.
uuid_len [int, default 5] If uuid is not specified, the length of the uuid to randomly gen-
erate expressed in hex characters, in range [0, 32].
New in version 1.2.0.
See also:
DataFrame.style Return a Styler object containing methods for building a styled HTML representation
for the DataFrame.
Notes
Most styling will be done by passing style functions into Styler.apply or Styler.applymap. Style
functions should return values with strings containing CSS 'attr: value' that will be applied to the indi-
cated cells.
If using in the Jupyter notebook, Styler has defined a _repr_html_ to automatically render itself. Otherwise
call Styler.render to get the generated HTML.
CSS classes are attached to the generated HTML
• Index and Column names include index_name and level<k> where k is its level in a MultiIndex
• Index label cells include
– row_heading
– row<n> where n is the numeric position of the row
– level<k> where k is the level in a MultiIndex
• Column label cells include * col_heading * col<n> where n is the numeric position of the column
* level<k> where k is the level in a MultiIndex
• Blank cells include blank
• Data cells include data
Attributes
Methods
pandas.io.formats.style.Styler.apply
Notes
The output shape of func should match the input, i.e. if x is the input row, column, or table (depending
on axis), then func(x).shape == x.shape should be true.
This is similar to DataFrame.apply, except that axis=None applies the function to the entire
DataFrame at once, rather than column-wise or row-wise.
Examples
pandas.io.formats.style.Styler.applymap
Styler.where Updates the HTML representation with a style which is selected in accordance with
the return value of a function.
pandas.io.formats.style.Styler.background_gradient
axis [{0 or ‘index’, 1 or ‘columns’, None}, default 0] Apply to each column (axis=0
or 'index'), to each row (axis=1 or 'columns'), or to the entire DataFrame
at once with axis=None.
subset [IndexSlice] A valid slice for data to limit the style application to.
text_color_threshold [float or int] Luminance threshold for determining text color.
Facilitates text visibility across varying background colors. From 0 to 1. 0 = all
text is dark colored, 1 = all text is light colored.
New in version 0.24.0.
vmin [float, optional] Minimum data value that corresponds to colormap minimum
value. When None (default): the minimum value of the data will be used.
New in version 1.0.0.
vmax [float, optional] Maximum data value that corresponds to colormap maximum
value. When None (default): the maximum value of the data will be used.
New in version 1.0.0.
Returns
self [Styler]
Raises
ValueError If text_color_threshold is not a value from 0 to 1.
Notes
Set text_color_threshold or tune low and high to keep the text legible by not using the entire
range of the color map. The range of the data is extended by low * (x.max() - x.min()) and
high * (x.max() - x.min()) before normalizing.
pandas.io.formats.style.Styler.bar
• ‘mid’ : the center of the cell is at (max-min)/2, or if values are all negative
(positive) the zero is aligned at the right (left) of the cell.
vmin [float, optional] Minimum bar value, defining the left hand limit of the bar draw-
ing range, lower values are clipped to vmin. When None (default): the minimum
value of the data will be used.
New in version 0.24.0.
vmax [float, optional] Maximum bar value, defining the right hand limit of the bar
drawing range, higher values are clipped to vmax. When None (default): the max-
imum value of the data will be used.
New in version 0.24.0.
Returns
self [Styler]
pandas.io.formats.style.Styler.clear
Styler.clear()
Reset the styler, removing any previously applied styles.
Returns None.
pandas.io.formats.style.Styler.export
Styler.export()
Export the styles to applied to the current Styler.
Can be applied to a second style with Styler.use.
Returns
styles [list]
See also:
pandas.io.formats.style.Styler.format
Returns
self [Styler]
Notes
Examples
pandas.io.formats.style.Styler.from_custom_template
pandas.io.formats.style.Styler.hide_columns
Styler.hide_columns(subset)
Hide columns from rendering.
Parameters
subset [IndexSlice] An argument to DataFrame.loc that identifies which columns
are hidden.
Returns
self [Styler]
pandas.io.formats.style.Styler.hide_index
Styler.hide_index()
Hide any indices from rendering.
Returns
self [Styler]
pandas.io.formats.style.Styler.highlight_max
pandas.io.formats.style.Styler.highlight_min
pandas.io.formats.style.Styler.highlight_null
Styler.highlight_null(null_color='red', subset=None)
Shade the background null_color for missing values.
Parameters
null_color [str, default ‘red’]
subset [label or list of labels, default None] A valid slice for data to limit the style
application to.
New in version 1.1.0.
Returns
self [Styler]
pandas.io.formats.style.Styler.pipe
Notes
Like DataFrame.pipe(), this method can simplify the application of several user-defined functions
to a styler. Instead of writing:
(df.style.set_precision(3)
.pipe(g, arg1=a)
.pipe(f, arg2=b, arg3=c))
In particular, this allows users to define functions that take a styler object, along with other parameters,
and return the styler after making styling changes (such as calling Styler.apply() or Styler.
set_properties()). Using .pipe, these user-defined style “transformations” can be interleaved
with calls to the built-in Styler interface.
Examples
The user-defined format_conversion function above can be called within a sequence of other style
modifications:
pandas.io.formats.style.Styler.render
Styler.render(**kwargs)
Render the built up styles to HTML.
Parameters
**kwargs Any additional keyword arguments are passed through to self.
template.render. This is useful when you need to provide additional vari-
ables for a custom template.
Returns
rendered [str] The rendered HTML.
Notes
Styler objects have defined the _repr_html_ method which automatically calls self.render()
when it’s the last item in a Notebook cell. When calling Styler.render() directly, wrap the result
in IPython.display.HTML to view the rendered HTML in the notebook.
Pandas uses the following keys in render. Arguments passed in **kwargs take precedence, so think
carefully if you want to override them:
• head
• cellstyle
• body
• uuid
• precision
• table_styles
• caption
• table_attributes
pandas.io.formats.style.Styler.set_caption
Styler.set_caption(caption)
Set the caption on a Styler.
Parameters
caption [str]
Returns
self [Styler]
pandas.io.formats.style.Styler.set_na_rep
Styler.set_na_rep(na_rep)
Set the missing data representation on a Styler.
New in version 1.0.0.
Parameters
na_rep [str]
Returns
self [Styler]
pandas.io.formats.style.Styler.set_precision
Styler.set_precision(precision)
Set the precision used to render.
Parameters
precision [int]
Returns
self [Styler]
pandas.io.formats.style.Styler.set_properties
Styler.set_properties(subset=None, **kwargs)
Method to set one or more non-data dependent properties or each cell.
Parameters
subset [IndexSlice] A valid slice for data to limit the style application to.
**kwargs [dict] A dictionary of property, value pairs to be set for each cell.
Returns
self [Styler]
Examples
pandas.io.formats.style.Styler.set_table_attributes
Styler.set_table_attributes(attributes)
Set the table attributes.
These are the items that show up in the opening <table> tag in addition to automatic (by default) id.
Parameters
attributes [str]
Returns
self [Styler]
Examples
pandas.io.formats.style.Styler.set_table_styles
overwrite [boolean, default True] Styles are replaced if True, or extended if False. CSS
rules are preserved so most recent styles set will dominate if selectors intersect.
New in version 1.2.0.
Returns
self [Styler]
Examples
>>> df.style.set_table_styles({
... 'A': [{'selector': '',
... 'props': [('color', 'red')]}],
... 'B': [{'selector': 'td',
... 'props': [('color', 'blue')]}]
... }, overwrite=False)
>>> df.style.set_table_styles({
... 0: [{'selector': 'td:hover',
... 'props': [('font-size', '25px')]}]
... }, axis=1, overwrite=False)
pandas.io.formats.style.Styler.set_td_classes
Styler.set_td_classes(classes)
Add string based CSS class names to data cells that will appear within the Styler HTML result. These
classes are added within specified <td> elements.
Parameters
classes [DataFrame] DataFrame containing strings that will be translated to CSS
classes, mapped by identical column and index values that must exist on the un-
derlying Styler data. None, NaN values, and empty strings will be ignored and not
affect the rendered HTML.
Returns
self [Styler]
Examples
>>> df = pd.DataFrame([[1]])
>>> css = pd.DataFrame(["other-class"])
>>> s = Styler(df, uuid="_", cell_ids=False).set_td_classes(css)
>>> s.hide_index().render()
'<style type="text/css" ></style>'
'<table id="T__" >'
' <thead>'
' <tr><th class="col_heading level0 col0" >0</th></tr>'
' </thead>'
' <tbody>'
' <tr><td class="data row0 col0 other-class" >1</td></tr>'
' </tbody>'
'</table>'
pandas.io.formats.style.Styler.set_uuid
Styler.set_uuid(uuid)
Set the uuid for a Styler.
Parameters
uuid [str]
Returns
self [Styler]
pandas.io.formats.style.Styler.to_excel
To write a single Styler to an Excel .xlsx file it is only necessary to specify a target file name. To write to
multiple sheets it is necessary to create an ExcelWriter object with a target file name, and specify a sheet
in the file to write to.
Multiple sheets may be written to by specifying unique sheet_name. With all data written to the file it
is necessary to save the changes. Note that creating an ExcelWriter object with a file name that already
exists will result in the contents of the existing file being erased.
Parameters
excel_writer [path-like, file-like, or ExcelWriter object] File path or existing Excel-
Writer.
sheet_name [str, default ‘Sheet1’] Name of sheet which will contain DataFrame.
na_rep [str, default ‘’] Missing data representation.
float_format [str, optional] Format string for floating point numbers. For example
float_format="%.2f" will format 0.1234 to 0.12.
columns [sequence or list of str, optional] Columns to write.
header [bool or list of str, default True] Write out the column names. If a list of string
is given it is assumed to be aliases for the column names.
index [bool, default True] Write row names (index).
index_label [str or sequence, optional] Column label for index column(s) if desired.
If not specified, and header and index are True, then the index names are used. A
sequence should be given if the DataFrame uses MultiIndex.
startrow [int, default 0] Upper left cell row to dump data frame.
startcol [int, default 0] Upper left cell column to dump data frame.
engine [str, optional] Write engine to use, ‘openpyxl’ or ‘xlsxwriter’. You can also set
this via the options io.excel.xlsx.writer, io.excel.xls.writer,
and io.excel.xlsm.writer.
Deprecated since version 1.2.0: As the xlwt package is no longer maintained, the
xlwt engine will be removed in a future version of pandas.
merge_cells [bool, default True] Write MultiIndex and Hierarchical Rows as merged
cells.
encoding [str, optional] Encoding of the resulting excel file. Only necessary for xlwt,
other writers support unicode natively.
inf_rep [str, default ‘inf’] Representation for infinity (there is no native representation
for infinity in Excel).
verbose [bool, default True] Display more information in the error logs.
freeze_panes [tuple of int (length 2), optional] Specifies the one-based bottommost
row and rightmost column that is to be frozen.
storage_options [dict, optional] Extra options that make sense for a particular storage
connection, e.g. host, port, username, password, etc., if using a URL that will be
parsed by fsspec, e.g., starting “s3://”, “gcs://”. An error will be raised if pro-
viding this argument with a non-fsspec URL. See the fsspec and backend storage
implementation docs for the set of allowed keys and values.
New in version 1.2.0.
See also:
Notes
For compatibility with to_csv(), to_excel serializes lists and dicts to strings before writing.
Once a workbook has been saved it is not possible write further data without rewriting the whole work-
book.
Examples
>>> df1.to_excel("output.xlsx",
... sheet_name='Sheet_name_1')
If you wish to write to more than one sheet in the workbook, it is necessary to specify an ExcelWriter
object:
To set the library that is used to write the Excel file, you can pass the engine keyword (the default engine
is automatically chosen depending on the file extension):
pandas.io.formats.style.Styler.use
Styler.use(styles)
Set the styles on the current Styler.
Possibly uses styles from Styler.export.
Parameters
styles [list] List of style functions.
Returns
self [Styler]
See also:
pandas.io.formats.style.Styler.where
Styler.env
Styler.template
Styler.loader
pandas.io.formats.style.Styler.env
pandas.io.formats.style.Styler.template
pandas.io.formats.style.Styler.loader
3.13 Plotting
andrews_curves(frame, class_column[, ax, . . . ]) Generate a matplotlib plot of Andrews curves, for visu-
alising clusters of multivariate data.
autocorrelation_plot(series[, ax]) Autocorrelation plot for time series.
bootstrap_plot(series[, fig, size, samples]) Bootstrap plot on mean, median and mid-range statis-
tics.
boxplot(data[, column, by, ax, fontsize, . . . ]) Make a box plot from DataFrame columns.
deregister_matplotlib_converters() Remove pandas formatters and converters.
lag_plot(series[, lag, ax]) Lag plot for time series.
parallel_coordinates(frame, class_column[, Parallel coordinates plotting.
. . . ])
plot_params Stores pandas plotting options.
radviz(frame, class_column[, ax, color, . . . ]) Plot a multidimensional dataset in 2D.
register_matplotlib_converters() Register pandas formatters and converters with mat-
plotlib.
scatter_matrix(frame[, alpha, figsize, ax, . . . ]) Draw a matrix of scatter plots.
table(ax, data[, rowLabels, colLabels]) Helper function to convert DataFrame and Series to
matplotlib.table.
3.13.1 pandas.plotting.andrews_curves
Examples
>>> df = pd.read_csv(
... 'https://raw.github.com/pandas-dev/'
... 'pandas/master/pandas/tests/io/data/csv/iris.csv'
... )
>>> pd.plotting.andrews_curves(df, 'Name')
3.13.2 pandas.plotting.autocorrelation_plot
Examples
The horizontal lines in the plot correspond to 95% and 99% confidence bands.
The dashed line is 99% confidence band.
3.13.3 pandas.plotting.bootstrap_plot
Examples
>>> s = pd.Series(np.random.uniform(size=100))
>>> pd.plotting.bootstrap_plot(s)
3.13.4 pandas.plotting.boxplot
fontsize [float or str] Tick label font size in points or as a string (e.g., large).
rot [int or float, default 0] The rotation angle of labels (in degrees) with respect to the screen
coordinate system.
grid [bool, default True] Setting this to True will show the grid.
figsize [A tuple (width, height) in inches] The size of the figure to create in matplotlib.
layout [tuple (rows, columns), optional] For example, (3, 5) will display the subplots using
3 columns and 5 rows, starting from the top-left.
return_type [{‘axes’, ‘dict’, ‘both’} or None, default ‘axes’] The kind of object to return.
The default is axes.
• ‘axes’ returns the matplotlib axes the boxplot is drawn on.
• ‘dict’ returns a dictionary whose values are the matplotlib Lines of the boxplot.
• ‘both’ returns a namedtuple with the axes and dict.
• when grouping with by, a Series mapping columns to return_type is returned.
If return_type is None, a NumPy array of axes with the same shape as layout
is returned.
**kwargs All other plotting keyword arguments to be passed to matplotlib.pyplot.
boxplot().
Returns
result See Notes.
See also:
Series.plot.hist Make a histogram.
matplotlib.pyplot.boxplot Matplotlib equivalent plot.
Notes
Examples
Boxplots can be created for every column in the dataframe by df.boxplot() or indicating the columns to be
used:
>>> np.random.seed(1234)
>>> df = pd.DataFrame(np.random.randn(10, 4),
... columns=['Col1', 'Col2', 'Col3', 'Col4'])
>>> boxplot = df.boxplot(column=['Col1', 'Col2', 'Col3'])
Boxplots of variables distributions grouped by the values of a third variable can be created using the option by.
For instance:
>>> df = pd.DataFrame(np.random.randn(10, 2),
... columns=['Col1', 'Col2'])
>>> df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A',
... 'B', 'B', 'B', 'B', 'B'])
>>> boxplot = df.boxplot(by='X')
A list of strings (i.e. ['X', 'Y']) can be passed to boxplot in order to group the data by combination of the
variables in the x-axis:
>>> df = pd.DataFrame(np.random.randn(10, 3),
... columns=['Col1', 'Col2', 'Col3'])
>>> df['X'] = pd.Series(['A', 'A', 'A', 'A', 'A',
... 'B', 'B', 'B', 'B', 'B'])
>>> df['Y'] = pd.Series(['A', 'B', 'A', 'B', 'A',
... 'B', 'A', 'B', 'A', 'B'])
>>> boxplot = df.boxplot(column=['Col1', 'Col2'], by=['X', 'Y'])
Additional formatting can be done to the boxplot, like suppressing the grid (grid=False), rotating the labels
The parameter return_type can be used to select the type of element returned by boxplot. When
return_type='axes' is selected, the matplotlib axes on which the boxplot is drawn are returned:
If return_type is None, a NumPy array of axes with the same shape as layout is returned:
3.13.5 pandas.plotting.deregister_matplotlib_converters
pandas.plotting.deregister_matplotlib_converters()
Remove pandas formatters and converters.
Removes the custom converters added by register(). This attempts to set the state of the registry back to
the state before pandas registered its own units. Converters for pandas’ own types like Timestamp and Period
are removed completely. Converters for types pandas overwrites, like datetime.datetime, are restored to
their original value.
See also:
register_matplotlib_converters Register pandas formatters and converters with matplotlib.
3.13.6 pandas.plotting.lag_plot
Examples
Lag plots are most commonly used to look for patterns in time series data.
Given the following time series
>>> np.random.seed(5)
>>> x = np.cumsum(np.random.normal(loc=1, scale=5, size=50))
>>> s = pd.Series(x)
>>> s.plot()
3.13.7 pandas.plotting.parallel_coordinates
Examples
>>> df = pd.read_csv(
... 'https://raw.github.com/pandas-dev/'
... 'pandas/master/pandas/tests/io/data/csv/iris.csv'
... )
>>> pd.plotting.parallel_coordinates(
... df, 'Name', color=('#556270', '#4ECDC4', '#C7F464')
... )
3.13.8 pandas.plotting.plot_params
3.13.9 pandas.plotting.radviz
Examples
>>> df = pd.DataFrame(
... {
... 'SepalLength': [6.5, 7.7, 5.1, 5.8, 7.6, 5.0, 5.4, 4.6, 6.7, 4.6],
... 'SepalWidth': [3.0, 3.8, 3.8, 2.7, 3.0, 2.3, 3.0, 3.2, 3.3, 3.6],
... 'PetalLength': [5.5, 6.7, 1.9, 5.1, 6.6, 3.3, 4.5, 1.4, 5.7, 1.0],
... 'PetalWidth': [1.8, 2.2, 0.4, 1.9, 2.1, 1.0, 1.5, 0.2, 2.1, 0.2],
... 'Category': [
... 'virginica',
... 'virginica',
... 'setosa',
... 'virginica',
... 'virginica',
... 'versicolor',
... 'versicolor',
... 'setosa',
... 'virginica',
... 'setosa'
... ]
... }
... )
>>> pd.plotting.radviz(df, 'Category')
3.13.10 pandas.plotting.register_matplotlib_converters
pandas.plotting.register_matplotlib_converters()
Register pandas formatters and converters with matplotlib.
This function modifies the global matplotlib.units.registry dictionary. pandas adds custom con-
verters for
• pd.Timestamp
• pd.Period
• np.datetime64
• datetime.datetime
• datetime.date
• datetime.time
See also:
deregister_matplotlib_converters Remove pandas formatters and converters.
3.13.11 pandas.plotting.scatter_matrix
Examples
3.13.12 pandas.plotting.table
describe_option(pat[, _print_desc]) Prints the description for one or more registered options.
reset_option(pat) Reset one or more options to their default value.
get_option(pat) Retrieves the value of the specified option.
set_option(pat, value) Sets the value of the specified option.
option_context(*args) Context manager to temporarily set options in the with
statement context.
pandas.describe_option
Notes
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be set to 0 and
pandas will auto-detect the width of the terminal and print a truncated object which fits the screen width.
The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible
to do correct auto-detection. [default: 0] [currently: 0]
display.max_colwidth [int or None] The maximum width in characters of a column in the repr of a pandas data
structure. When the column overflows, a “. . . ” placeholder is embedded in the output. A ‘None’ value
means unlimited. [default: 50] [currently: 50]
display.max_info_columns [int] max_info_columns is used in DataFrame.info method to decide if per column
information will be printed. [default: 100] [currently: 100]
display.max_info_rows [int or None] df.info() will usually show null-counts for each column. For large frames
this can be quite slow. max_info_rows and max_info_cols limit this null check only to frames with smaller
dimensions than specified. [default: 1690785] [currently: 1690785]
display.max_rows [int] If max_rows is exceeded, switch to truncate view. Depending on large_repr, objects
are either centrally truncated or printed as a summary view. ‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be set to 0 and
pandas will auto-detect the height of the terminal and print a truncated object which fits the screen height.
The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible
to do correct auto-detection. [default: 60] [currently: 15]
display.max_seq_items [int or None] When pretty-printing a long sequence, no more then max_seq_items will
be printed. If items are omitted, they will be denoted by the addition of “. . . ” to the resulting string.
If set to None, the number of items to be printed is unlimited. [default: 100] [currently: 100]
display.memory_usage [bool, string or None] This specifies if the memory usage of a DataFrame should be
displayed when df.info() is called. Valid values True,False,’deep’ [default: True] [currently: True]
display.min_rows [int] The numbers of rows to show in a truncated view (when max_rows is exceeded). Ig-
nored when max_rows is set to None or 0. When set to None, follows the value of max_rows. [default:
10] [currently: 10]
display.multi_sparse [boolean] “sparsify” MultiIndex display (don’t display repeated elements in outer levels
within groups) [default: True] [currently: True]
display.notebook_repr_html [boolean] When True, IPython notebook will use html representation for pandas
objects (if it is available). [default: True] [currently: True]
display.pprint_nest_depth [int] Controls the number of nested levels to process when pretty-printing [default:
3] [currently: 3]
display.precision [int] Floating point output precision in terms of number of places after the deci-
mal, for regular formatting as well as scientific notation. Similar to precision in numpy.
set_printoptions(). [default: 6] [currently: 6]
display.show_dimensions [boolean or ‘truncate’] Whether to print out dimensions at the end of DataFrame
repr. If ‘truncate’ is specified, only print out the dimensions if the frame is truncated (e.g. not display all
rows and/or columns) [default: truncate] [currently: truncate]
display.unicode.ambiguous_as_wide [boolean] Whether to use the Unicode East Asian Width to calculate the
display text width. Enabling this may affect to the performance (default: False) [default: False] [currently:
False]
display.unicode.east_asian_width [boolean] Whether to use the Unicode East Asian Width to calculate the
display text width. Enabling this may affect to the performance (default: False) [default: False] [currently:
False]
display.width [int] Width of the display in characters. In case python/IPython is running in a terminal this can
be set to None and pandas will correctly auto-detect the width. Note that the IPython notebook, IPython
qtconsole, or IDLE do not run in a terminal and hence it is not possible to correctly detect the width.
[default: 80] [currently: 80]
io.excel.ods.reader [string] The default Excel reader engine for ‘ods’ files. Available options: auto, odf. [de-
fault: auto] [currently: auto]
io.excel.ods.writer [string] The default Excel writer engine for ‘ods’ files. Available options: auto, odf. [de-
fault: auto] [currently: auto]
io.excel.xls.reader [string] The default Excel reader engine for ‘xls’ files. Available options: auto, xlrd. [de-
pandas.reset_option
pandas.reset_option(pat) = <pandas._config.config.CallableDynamicDoc
object>
Reset one or more options to their default value.
Pass “all” as argument to reset all options.
Available options:
• compute.[use_bottleneck, use_numba, use_numexpr]
• display.[chop_threshold, colheader_justify, column_space, date_dayfirst, date_yearfirst, encoding, ex-
pand_frame_repr, float_format]
• display.html.[border, table_schema, use_mathjax]
• display.[large_repr]
• display.latex.[escape, longtable, multicolumn, multicolumn_format, multirow, repr]
• display.[max_categories, max_columns, max_colwidth, max_info_columns, max_info_rows, max_rows,
max_seq_items, memory_usage, min_rows, multi_sparse, notebook_repr_html, pprint_nest_depth, pre-
cision, show_dimensions]
• display.unicode.[ambiguous_as_wide, east_asian_width]
• display.[width]
• io.excel.ods.[reader, writer]
• io.excel.xls.[reader, writer]
• io.excel.xlsb.[reader]
• io.excel.xlsm.[reader, writer]
• io.excel.xlsx.[reader, writer]
• io.hdf.[default_format, dropna_table]
• io.parquet.[engine]
• mode.[chained_assignment, sim_interactive, use_inf_as_na, use_inf_as_null]
• plotting.[backend]
• plotting.matplotlib.[register_converters]
Parameters
pat [str/regex] If specified only options matching prefix* will be reset. Note: partial
matches are supported for convenience, but unless you use the full option name (e.g.
x.y.z.option_name), your code may break in future versions if new options with similar
names are introduced.
Returns
None
Notes
display.html.use_mathjax [boolean] When True, Jupyter notebook will process table contents using Math-
Jax, rendering mathematical expressions enclosed by the dollar symbol. (default: True) [default: True]
[currently: True]
display.large_repr [‘truncate’/’info’] For DataFrames exceeding max_rows/max_cols, the repr (and HTML
repr) can show a truncated table (the default from 0.13), or switch to the view from df.info() (the behaviour
in earlier versions of pandas). [default: truncate] [currently: truncate]
display.latex.escape [bool] This specifies if the to_latex method of a Dataframe uses escapes special characters.
Valid values: False,True [default: True] [currently: True]
display.latex.longtable :bool This specifies if the to_latex method of a Dataframe uses the longtable format.
Valid values: False,True [default: False] [currently: False]
display.latex.multicolumn [bool] This specifies if the to_latex method of a Dataframe uses multicolumns to
pretty-print MultiIndex columns. Valid values: False,True [default: True] [currently: True]
display.latex.multicolumn_format [bool] This specifies if the to_latex method of a Dataframe uses multi-
columns to pretty-print MultiIndex columns. Valid values: False,True [default: l] [currently: l]
display.latex.multirow [bool] This specifies if the to_latex method of a Dataframe uses multirows to pretty-
print MultiIndex rows. Valid values: False,True [default: False] [currently: False]
display.latex.repr [boolean] Whether to produce a latex DataFrame representation for jupyter environments
that support it. (default: False) [default: False] [currently: False]
display.max_categories [int] This sets the maximum number of categories pandas should output when printing
out a Categorical or a Series of dtype “category”. [default: 8] [currently: 8]
display.max_columns [int] If max_cols is exceeded, switch to truncate view. Depending on large_repr, objects
are either centrally truncated or printed as a summary view. ‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be set to 0 and
pandas will auto-detect the width of the terminal and print a truncated object which fits the screen width.
The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible
to do correct auto-detection. [default: 0] [currently: 0]
display.max_colwidth [int or None] The maximum width in characters of a column in the repr of a pandas data
structure. When the column overflows, a “. . . ” placeholder is embedded in the output. A ‘None’ value
means unlimited. [default: 50] [currently: 50]
display.max_info_columns [int] max_info_columns is used in DataFrame.info method to decide if per column
information will be printed. [default: 100] [currently: 100]
display.max_info_rows [int or None] df.info() will usually show null-counts for each column. For large frames
this can be quite slow. max_info_rows and max_info_cols limit this null check only to frames with smaller
dimensions than specified. [default: 1690785] [currently: 1690785]
display.max_rows [int] If max_rows is exceeded, switch to truncate view. Depending on large_repr, objects
are either centrally truncated or printed as a summary view. ‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be set to 0 and
pandas will auto-detect the height of the terminal and print a truncated object which fits the screen height.
The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible
to do correct auto-detection. [default: 60] [currently: 15]
display.max_seq_items [int or None] When pretty-printing a long sequence, no more then max_seq_items will
be printed. If items are omitted, they will be denoted by the addition of “. . . ” to the resulting string.
If set to None, the number of items to be printed is unlimited. [default: 100] [currently: 100]
display.memory_usage [bool, string or None] This specifies if the memory usage of a DataFrame should be
displayed when df.info() is called. Valid values True,False,’deep’ [default: True] [currently: True]
display.min_rows [int] The numbers of rows to show in a truncated view (when max_rows is exceeded). Ig-
nored when max_rows is set to None or 0. When set to None, follows the value of max_rows. [default:
10] [currently: 10]
display.multi_sparse [boolean] “sparsify” MultiIndex display (don’t display repeated elements in outer levels
within groups) [default: True] [currently: True]
display.notebook_repr_html [boolean] When True, IPython notebook will use html representation for pandas
objects (if it is available). [default: True] [currently: True]
display.pprint_nest_depth [int] Controls the number of nested levels to process when pretty-printing [default:
3] [currently: 3]
display.precision [int] Floating point output precision in terms of number of places after the deci-
mal, for regular formatting as well as scientific notation. Similar to precision in numpy.
set_printoptions(). [default: 6] [currently: 6]
display.show_dimensions [boolean or ‘truncate’] Whether to print out dimensions at the end of DataFrame
repr. If ‘truncate’ is specified, only print out the dimensions if the frame is truncated (e.g. not display all
rows and/or columns) [default: truncate] [currently: truncate]
display.unicode.ambiguous_as_wide [boolean] Whether to use the Unicode East Asian Width to calculate the
display text width. Enabling this may affect to the performance (default: False) [default: False] [currently:
False]
display.unicode.east_asian_width [boolean] Whether to use the Unicode East Asian Width to calculate the
display text width. Enabling this may affect to the performance (default: False) [default: False] [currently:
False]
display.width [int] Width of the display in characters. In case python/IPython is running in a terminal this can
be set to None and pandas will correctly auto-detect the width. Note that the IPython notebook, IPython
qtconsole, or IDLE do not run in a terminal and hence it is not possible to correctly detect the width.
[default: 80] [currently: 80]
io.excel.ods.reader [string] The default Excel reader engine for ‘ods’ files. Available options: auto, odf. [de-
fault: auto] [currently: auto]
io.excel.ods.writer [string] The default Excel writer engine for ‘ods’ files. Available options: auto, odf. [de-
fault: auto] [currently: auto]
io.excel.xls.reader [string] The default Excel reader engine for ‘xls’ files. Available options: auto, xlrd. [de-
fault: auto] [currently: auto]
io.excel.xls.writer [string] The default Excel writer engine for ‘xls’ files. Available options: auto, xlwt. [de-
fault: auto] [currently: auto] (Deprecated, use `` instead.)
io.excel.xlsb.reader [string] The default Excel reader engine for ‘xlsb’ files. Available options: auto, pyxlsb.
[default: auto] [currently: auto]
io.excel.xlsm.reader [string] The default Excel reader engine for ‘xlsm’ files. Available options: auto, xlrd,
openpyxl. [default: auto] [currently: auto]
io.excel.xlsm.writer [string] The default Excel writer engine for ‘xlsm’ files. Available options: auto, open-
pyxl. [default: auto] [currently: auto]
io.excel.xlsx.reader [string] The default Excel reader engine for ‘xlsx’ files. Available options: auto, xlrd,
openpyxl. [default: auto] [currently: auto]
io.excel.xlsx.writer [string] The default Excel writer engine for ‘xlsx’ files. Available options: auto, openpyxl,
xlsxwriter. [default: auto] [currently: auto]
io.hdf.default_format [format] default format writing format, if None, then put will default to ‘fixed’ and
append will default to ‘table’ [default: None] [currently: None]
io.hdf.dropna_table [boolean] drop ALL nan rows when appending to a table [default: False] [currently:
False]
io.parquet.engine [string] The default parquet reader/writer engine. Available options: ‘auto’, ‘pyarrow’, ‘fast-
parquet’, the default is ‘auto’ [default: auto] [currently: auto]
mode.chained_assignment [string] Raise an exception, warn, or no action if trying to use chained assignment,
The default is warn [default: warn] [currently: warn]
mode.sim_interactive [boolean] Whether to simulate interactive mode for purposes of testing [default: False]
[currently: False]
mode.use_inf_as_na [boolean] True means treat None, NaN, INF, -INF as NA (old way), False means None
and NaN are null, but INF, -INF are not NA (new way). [default: False] [currently: False]
mode.use_inf_as_null [boolean] use_inf_as_null had been deprecated and will be removed in a future ver-
sion. Use use_inf_as_na instead. [default: False] [currently: False] (Deprecated, use mode.use_inf_as_na
instead.)
plotting.backend [str] The plotting backend to use. The default value is “matplotlib”, the backend provided
with pandas. Other backends can be specified by providing the name of the module that implements the
backend. [default: matplotlib] [currently: matplotlib]
pandas.get_option
Notes
display.chop_threshold [float or None] if set to a float value, all float values smaller then the given threshold
will be displayed as exactly 0 by repr and friends. [default: None] [currently: None]
display.colheader_justify [‘left’/’right’] Controls the justification of column headers. used by DataFrameFor-
matter. [default: right] [currently: right]
display.column_space No description available. [default: 12] [currently: 12]
display.date_dayfirst [boolean] When True, prints and parses dates with the day first, eg 20/01/2005 [default:
False] [currently: False]
display.date_yearfirst [boolean] When True, prints and parses dates with the year first, eg 2005/01/20 [default:
False] [currently: False]
display.encoding [str/unicode] Defaults to the detected encoding of the console. Specifies the encoding to be
used for strings returned by to_string, these are generally strings meant to be displayed on the console.
[default: utf-8] [currently: utf-8]
display.expand_frame_repr [boolean] Whether to print out the full DataFrame repr for wide DataFrames
across multiple lines, max_columns is still respected, but the output will wrap-around across multiple
“pages” if its width exceeds display.width. [default: True] [currently: True]
display.float_format [callable] The callable should accept a floating point number and return a string with
the desired format of the number. This is used in some places like SeriesFormatter. See for-
mats.format.EngFormatter for an example. [default: None] [currently: None]
display.html.border [int] A border=value attribute is inserted in the <table> tag for the DataFrame
HTML repr. [default: 1] [currently: 1]
display.html.table_schema [boolean] Whether to publish a Table Schema representation for frontends that
support it. (default: False) [default: False] [currently: False]
display.html.use_mathjax [boolean] When True, Jupyter notebook will process table contents using Math-
Jax, rendering mathematical expressions enclosed by the dollar symbol. (default: True) [default: True]
[currently: True]
display.large_repr [‘truncate’/’info’] For DataFrames exceeding max_rows/max_cols, the repr (and HTML
repr) can show a truncated table (the default from 0.13), or switch to the view from df.info() (the behaviour
in earlier versions of pandas). [default: truncate] [currently: truncate]
display.latex.escape [bool] This specifies if the to_latex method of a Dataframe uses escapes special characters.
Valid values: False,True [default: True] [currently: True]
display.latex.longtable :bool This specifies if the to_latex method of a Dataframe uses the longtable format.
Valid values: False,True [default: False] [currently: False]
display.latex.multicolumn [bool] This specifies if the to_latex method of a Dataframe uses multicolumns to
pretty-print MultiIndex columns. Valid values: False,True [default: True] [currently: True]
display.latex.multicolumn_format [bool] This specifies if the to_latex method of a Dataframe uses multi-
columns to pretty-print MultiIndex columns. Valid values: False,True [default: l] [currently: l]
display.latex.multirow [bool] This specifies if the to_latex method of a Dataframe uses multirows to pretty-
print MultiIndex rows. Valid values: False,True [default: False] [currently: False]
display.latex.repr [boolean] Whether to produce a latex DataFrame representation for jupyter environments
that support it. (default: False) [default: False] [currently: False]
display.max_categories [int] This sets the maximum number of categories pandas should output when printing
out a Categorical or a Series of dtype “category”. [default: 8] [currently: 8]
display.max_columns [int] If max_cols is exceeded, switch to truncate view. Depending on large_repr, objects
are either centrally truncated or printed as a summary view. ‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be set to 0 and
pandas will auto-detect the width of the terminal and print a truncated object which fits the screen width.
The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible
to do correct auto-detection. [default: 0] [currently: 0]
display.max_colwidth [int or None] The maximum width in characters of a column in the repr of a pandas data
structure. When the column overflows, a “. . . ” placeholder is embedded in the output. A ‘None’ value
means unlimited. [default: 50] [currently: 50]
display.max_info_columns [int] max_info_columns is used in DataFrame.info method to decide if per column
information will be printed. [default: 100] [currently: 100]
display.max_info_rows [int or None] df.info() will usually show null-counts for each column. For large frames
this can be quite slow. max_info_rows and max_info_cols limit this null check only to frames with smaller
dimensions than specified. [default: 1690785] [currently: 1690785]
display.max_rows [int] If max_rows is exceeded, switch to truncate view. Depending on large_repr, objects
are either centrally truncated or printed as a summary view. ‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be set to 0 and
pandas will auto-detect the height of the terminal and print a truncated object which fits the screen height.
The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible
to do correct auto-detection. [default: 60] [currently: 15]
display.max_seq_items [int or None] When pretty-printing a long sequence, no more then max_seq_items will
be printed. If items are omitted, they will be denoted by the addition of “. . . ” to the resulting string.
If set to None, the number of items to be printed is unlimited. [default: 100] [currently: 100]
display.memory_usage [bool, string or None] This specifies if the memory usage of a DataFrame should be
displayed when df.info() is called. Valid values True,False,’deep’ [default: True] [currently: True]
display.min_rows [int] The numbers of rows to show in a truncated view (when max_rows is exceeded). Ig-
nored when max_rows is set to None or 0. When set to None, follows the value of max_rows. [default:
10] [currently: 10]
display.multi_sparse [boolean] “sparsify” MultiIndex display (don’t display repeated elements in outer levels
within groups) [default: True] [currently: True]
display.notebook_repr_html [boolean] When True, IPython notebook will use html representation for pandas
objects (if it is available). [default: True] [currently: True]
display.pprint_nest_depth [int] Controls the number of nested levels to process when pretty-printing [default:
3] [currently: 3]
display.precision [int] Floating point output precision in terms of number of places after the deci-
mal, for regular formatting as well as scientific notation. Similar to precision in numpy.
set_printoptions(). [default: 6] [currently: 6]
display.show_dimensions [boolean or ‘truncate’] Whether to print out dimensions at the end of DataFrame
repr. If ‘truncate’ is specified, only print out the dimensions if the frame is truncated (e.g. not display all
rows and/or columns) [default: truncate] [currently: truncate]
display.unicode.ambiguous_as_wide [boolean] Whether to use the Unicode East Asian Width to calculate the
display text width. Enabling this may affect to the performance (default: False) [default: False] [currently:
False]
display.unicode.east_asian_width [boolean] Whether to use the Unicode East Asian Width to calculate the
display text width. Enabling this may affect to the performance (default: False) [default: False] [currently:
False]
display.width [int] Width of the display in characters. In case python/IPython is running in a terminal this can
be set to None and pandas will correctly auto-detect the width. Note that the IPython notebook, IPython
qtconsole, or IDLE do not run in a terminal and hence it is not possible to correctly detect the width.
[default: 80] [currently: 80]
io.excel.ods.reader [string] The default Excel reader engine for ‘ods’ files. Available options: auto, odf. [de-
fault: auto] [currently: auto]
io.excel.ods.writer [string] The default Excel writer engine for ‘ods’ files. Available options: auto, odf. [de-
fault: auto] [currently: auto]
io.excel.xls.reader [string] The default Excel reader engine for ‘xls’ files. Available options: auto, xlrd. [de-
fault: auto] [currently: auto]
io.excel.xls.writer [string] The default Excel writer engine for ‘xls’ files. Available options: auto, xlwt. [de-
fault: auto] [currently: auto] (Deprecated, use `` instead.)
io.excel.xlsb.reader [string] The default Excel reader engine for ‘xlsb’ files. Available options: auto, pyxlsb.
[default: auto] [currently: auto]
io.excel.xlsm.reader [string] The default Excel reader engine for ‘xlsm’ files. Available options: auto, xlrd,
openpyxl. [default: auto] [currently: auto]
io.excel.xlsm.writer [string] The default Excel writer engine for ‘xlsm’ files. Available options: auto, open-
pyxl. [default: auto] [currently: auto]
io.excel.xlsx.reader [string] The default Excel reader engine for ‘xlsx’ files. Available options: auto, xlrd,
openpyxl. [default: auto] [currently: auto]
io.excel.xlsx.writer [string] The default Excel writer engine for ‘xlsx’ files. Available options: auto, openpyxl,
xlsxwriter. [default: auto] [currently: auto]
io.hdf.default_format [format] default format writing format, if None, then put will default to ‘fixed’ and
append will default to ‘table’ [default: None] [currently: None]
io.hdf.dropna_table [boolean] drop ALL nan rows when appending to a table [default: False] [currently:
False]
io.parquet.engine [string] The default parquet reader/writer engine. Available options: ‘auto’, ‘pyarrow’, ‘fast-
parquet’, the default is ‘auto’ [default: auto] [currently: auto]
mode.chained_assignment [string] Raise an exception, warn, or no action if trying to use chained assignment,
The default is warn [default: warn] [currently: warn]
mode.sim_interactive [boolean] Whether to simulate interactive mode for purposes of testing [default: False]
[currently: False]
mode.use_inf_as_na [boolean] True means treat None, NaN, INF, -INF as NA (old way), False means None
and NaN are null, but INF, -INF are not NA (new way). [default: False] [currently: False]
mode.use_inf_as_null [boolean] use_inf_as_null had been deprecated and will be removed in a future ver-
sion. Use use_inf_as_na instead. [default: False] [currently: False] (Deprecated, use mode.use_inf_as_na
instead.)
plotting.backend [str] The plotting backend to use. The default value is “matplotlib”, the backend provided
with pandas. Other backends can be specified by providing the name of the module that implements the
backend. [default: matplotlib] [currently: matplotlib]
plotting.matplotlib.register_converters [bool or ‘auto’.] Whether to register converters with matplotlib’s units
registry for dates, times, datetimes, and Periods. Toggling to False will remove the converters, restoring
any converters that pandas overwrote. [default: auto] [currently: auto]
pandas.set_option
Parameters
pat [str] Regexp which should match a single option. Note: partial matches are supported
for convenience, but unless you use the full option name (e.g. x.y.z.option_name), your
code may break in future versions if new options with similar names are introduced.
value [object] New value of option.
Returns
None
Raises
OptionError if no such option exists
Notes
display.latex.longtable :bool This specifies if the to_latex method of a Dataframe uses the longtable format.
Valid values: False,True [default: False] [currently: False]
display.latex.multicolumn [bool] This specifies if the to_latex method of a Dataframe uses multicolumns to
pretty-print MultiIndex columns. Valid values: False,True [default: True] [currently: True]
display.latex.multicolumn_format [bool] This specifies if the to_latex method of a Dataframe uses multi-
columns to pretty-print MultiIndex columns. Valid values: False,True [default: l] [currently: l]
display.latex.multirow [bool] This specifies if the to_latex method of a Dataframe uses multirows to pretty-
print MultiIndex rows. Valid values: False,True [default: False] [currently: False]
display.latex.repr [boolean] Whether to produce a latex DataFrame representation for jupyter environments
that support it. (default: False) [default: False] [currently: False]
display.max_categories [int] This sets the maximum number of categories pandas should output when printing
out a Categorical or a Series of dtype “category”. [default: 8] [currently: 8]
display.max_columns [int] If max_cols is exceeded, switch to truncate view. Depending on large_repr, objects
are either centrally truncated or printed as a summary view. ‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be set to 0 and
pandas will auto-detect the width of the terminal and print a truncated object which fits the screen width.
The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible
to do correct auto-detection. [default: 0] [currently: 0]
display.max_colwidth [int or None] The maximum width in characters of a column in the repr of a pandas data
structure. When the column overflows, a “. . . ” placeholder is embedded in the output. A ‘None’ value
means unlimited. [default: 50] [currently: 50]
display.max_info_columns [int] max_info_columns is used in DataFrame.info method to decide if per column
information will be printed. [default: 100] [currently: 100]
display.max_info_rows [int or None] df.info() will usually show null-counts for each column. For large frames
this can be quite slow. max_info_rows and max_info_cols limit this null check only to frames with smaller
dimensions than specified. [default: 1690785] [currently: 1690785]
display.max_rows [int] If max_rows is exceeded, switch to truncate view. Depending on large_repr, objects
are either centrally truncated or printed as a summary view. ‘None’ value means unlimited.
In case python/IPython is running in a terminal and large_repr equals ‘truncate’ this can be set to 0 and
pandas will auto-detect the height of the terminal and print a truncated object which fits the screen height.
The IPython notebook, IPython qtconsole, or IDLE do not run in a terminal and hence it is not possible
to do correct auto-detection. [default: 60] [currently: 15]
display.max_seq_items [int or None] When pretty-printing a long sequence, no more then max_seq_items will
be printed. If items are omitted, they will be denoted by the addition of “. . . ” to the resulting string.
If set to None, the number of items to be printed is unlimited. [default: 100] [currently: 100]
display.memory_usage [bool, string or None] This specifies if the memory usage of a DataFrame should be
displayed when df.info() is called. Valid values True,False,’deep’ [default: True] [currently: True]
display.min_rows [int] The numbers of rows to show in a truncated view (when max_rows is exceeded). Ig-
nored when max_rows is set to None or 0. When set to None, follows the value of max_rows. [default:
10] [currently: 10]
display.multi_sparse [boolean] “sparsify” MultiIndex display (don’t display repeated elements in outer levels
within groups) [default: True] [currently: True]
display.notebook_repr_html [boolean] When True, IPython notebook will use html representation for pandas
objects (if it is available). [default: True] [currently: True]
display.pprint_nest_depth [int] Controls the number of nested levels to process when pretty-printing [default:
3] [currently: 3]
display.precision [int] Floating point output precision in terms of number of places after the deci-
mal, for regular formatting as well as scientific notation. Similar to precision in numpy.
set_printoptions(). [default: 6] [currently: 6]
display.show_dimensions [boolean or ‘truncate’] Whether to print out dimensions at the end of DataFrame
repr. If ‘truncate’ is specified, only print out the dimensions if the frame is truncated (e.g. not display all
rows and/or columns) [default: truncate] [currently: truncate]
display.unicode.ambiguous_as_wide [boolean] Whether to use the Unicode East Asian Width to calculate the
display text width. Enabling this may affect to the performance (default: False) [default: False] [currently:
False]
display.unicode.east_asian_width [boolean] Whether to use the Unicode East Asian Width to calculate the
display text width. Enabling this may affect to the performance (default: False) [default: False] [currently:
False]
display.width [int] Width of the display in characters. In case python/IPython is running in a terminal this can
be set to None and pandas will correctly auto-detect the width. Note that the IPython notebook, IPython
qtconsole, or IDLE do not run in a terminal and hence it is not possible to correctly detect the width.
[default: 80] [currently: 80]
io.excel.ods.reader [string] The default Excel reader engine for ‘ods’ files. Available options: auto, odf. [de-
fault: auto] [currently: auto]
io.excel.ods.writer [string] The default Excel writer engine for ‘ods’ files. Available options: auto, odf. [de-
fault: auto] [currently: auto]
io.excel.xls.reader [string] The default Excel reader engine for ‘xls’ files. Available options: auto, xlrd. [de-
fault: auto] [currently: auto]
io.excel.xls.writer [string] The default Excel writer engine for ‘xls’ files. Available options: auto, xlwt. [de-
fault: auto] [currently: auto] (Deprecated, use `` instead.)
io.excel.xlsb.reader [string] The default Excel reader engine for ‘xlsb’ files. Available options: auto, pyxlsb.
[default: auto] [currently: auto]
io.excel.xlsm.reader [string] The default Excel reader engine for ‘xlsm’ files. Available options: auto, xlrd,
openpyxl. [default: auto] [currently: auto]
io.excel.xlsm.writer [string] The default Excel writer engine for ‘xlsm’ files. Available options: auto, open-
pyxl. [default: auto] [currently: auto]
io.excel.xlsx.reader [string] The default Excel reader engine for ‘xlsx’ files. Available options: auto, xlrd,
openpyxl. [default: auto] [currently: auto]
io.excel.xlsx.writer [string] The default Excel writer engine for ‘xlsx’ files. Available options: auto, openpyxl,
xlsxwriter. [default: auto] [currently: auto]
io.hdf.default_format [format] default format writing format, if None, then put will default to ‘fixed’ and
append will default to ‘table’ [default: None] [currently: None]
io.hdf.dropna_table [boolean] drop ALL nan rows when appending to a table [default: False] [currently:
False]
io.parquet.engine [string] The default parquet reader/writer engine. Available options: ‘auto’, ‘pyarrow’, ‘fast-
parquet’, the default is ‘auto’ [default: auto] [currently: auto]
mode.chained_assignment [string] Raise an exception, warn, or no action if trying to use chained assignment,
The default is warn [default: warn] [currently: warn]
mode.sim_interactive [boolean] Whether to simulate interactive mode for purposes of testing [default: False]
[currently: False]
mode.use_inf_as_na [boolean] True means treat None, NaN, INF, -INF as NA (old way), False means None
and NaN are null, but INF, -INF are not NA (new way). [default: False] [currently: False]
mode.use_inf_as_null [boolean] use_inf_as_null had been deprecated and will be removed in a future ver-
sion. Use use_inf_as_na instead. [default: False] [currently: False] (Deprecated, use mode.use_inf_as_na
instead.)
plotting.backend [str] The plotting backend to use. The default value is “matplotlib”, the backend provided
with pandas. Other backends can be specified by providing the name of the module that implements the
backend. [default: matplotlib] [currently: matplotlib]
plotting.matplotlib.register_converters [bool or ‘auto’.] Whether to register converters with matplotlib’s units
registry for dates, times, datetimes, and Periods. Toggling to False will remove the converters, restoring
any converters that pandas overwrote. [default: auto] [currently: auto]
pandas.option_context
class pandas.option_context(*args)
Context manager to temporarily set options in the with statement context.
You need to invoke as option_context(pat, val, [(pat, val), ...]).
Examples
Methods
pandas.option_context.__call__
option_context.__call__(func)
Call self as a function.
testing.assert_frame_equal(left, right[, Check that left and right DataFrame are equal.
. . . ])
testing.assert_series_equal(left, right[, Check that left and right Series are equal.
. . . ])
testing.assert_index_equal(left, right[, Check that left and right Index are equal.
. . . ])
testing.assert_extension_array_equal(left,Check that left and right ExtensionArrays are equal.
right)
pandas.testing.assert_frame_equal
Examples
This example shows comparing two DataFrames that are equal but with columns of differing dtypes.
pandas.testing.assert_series_equal
check_names [bool, default True] Whether to check the Series and Index names attribute.
check_exact [bool, default False] Whether to compare number exactly.
check_datetimelike_compat [bool, default False] Compare datetime-like which is compa-
rable ignoring dtype.
check_categorical [bool, default True] Whether to compare internal Categorical exactly.
check_category_order [bool, default True] Whether to compare category order of internal
Categoricals.
New in version 1.0.2.
check_freq [bool, default True] Whether to check the freq attribute on a DatetimeIndex or
TimedeltaIndex.
check_flags [bool, default True] Whether to check the flags attribute.
New in version 1.2.0.
rtol [float, default 1e-5] Relative tolerance. Only used when check_exact is False.
New in version 1.1.0.
atol [float, default 1e-8] Absolute tolerance. Only used when check_exact is False.
New in version 1.1.0.
obj [str, default ‘Series’] Specify object name being compared, internally used to show ap-
propriate assertion message.
Examples
pandas.testing.assert_index_equal
Deprecated since version 1.1.0: Use rtol and atol instead to define relative/absolute
tolerance, respectively. Similar to math.isclose().
check_exact [bool, default True] Whether to compare number exactly.
check_categorical [bool, default True] Whether to compare internal Categorical exactly.
check_order [bool, default True] Whether to compare the order of index entries as well as
their values. If True, both indexes must contain the same elements, in the same order.
If False, both indexes must contain the same elements, but in any order.
New in version 1.2.0.
rtol [float, default 1e-5] Relative tolerance. Only used when check_exact is False.
New in version 1.1.0.
atol [float, default 1e-8] Absolute tolerance. Only used when check_exact is False.
New in version 1.1.0.
obj [str, default ‘Index’] Specify object name being compared, internally used to show ap-
propriate assertion message.
Examples
pandas.testing.assert_extension_array_equal
atol [float, default 1e-8] Absolute tolerance. Only used when check_exact is False.
New in version 1.1.0.
Notes
Missing values are checked separately from valid values. A mask of missing values is computed for each and
checked to match. The remaining all-valid values are cast to object dtype and checked.
Examples
pandas.errors.AccessorRegistrationWarning
exception pandas.errors.AccessorRegistrationWarning
Warning for attribute conflicts in accessor registration.
pandas.errors.DtypeWarning
exception pandas.errors.DtypeWarning
Warning raised when reading different dtypes in a column from a file.
Raised for a dtype incompatibility. This can happen whenever read_csv or read_table encounter non-uniform
dtypes in a column(s) of a given CSV file.
See also:
read_csv Read CSV (comma-separated) file into a DataFrame.
read_table Read general delimited file into a DataFrame.
Notes
This warning is issued when dealing with larger files because the dtype checking happens per chunk read.
Despite the warning, the CSV file is read with mixed types in a single column which will be an object type. See
the examples below to better understand this issue.
Examples
This example creates and reads a large CSV file with a column that contains int and str.
>>> df = pd.DataFrame({'a': (['1'] * 100000 + ['X'] * 100000 +
... ['1'] * 100000),
... 'b': ['b'] * 300000})
>>> df.to_csv('test.csv', index=False)
>>> df2 = pd.read_csv('test.csv')
... # DtypeWarning: Columns (0) have mixed types
Important to notice that df2 will contain both str and int for the same input, ‘1’.
>>> df2.iloc[262140, 0]
'1'
>>> type(df2.iloc[262140, 0])
<class 'str'>
>>> df2.iloc[262150, 0]
1
>>> type(df2.iloc[262150, 0])
<class 'int'>
One way to solve this issue is using the dtype parameter in the read_csv and read_table functions to explicit the
conversion:
>>> df2 = pd.read_csv('test.csv', sep=',', dtype={'a': str})
pandas.errors.DuplicateLabelError
exception pandas.errors.DuplicateLabelError
Error raised when an operation would introduce duplicate labels.
New in version 1.2.0.
Examples
pandas.errors.EmptyDataError
exception pandas.errors.EmptyDataError
Exception that is thrown in pd.read_csv (by both the C and Python engines) when empty data or header is
encountered.
pandas.errors.InvalidIndexError
exception pandas.errors.InvalidIndexError
Exception raised when attempting to use an invalid index key.
New in version 1.1.0.
pandas.errors.MergeError
exception pandas.errors.MergeError
Error raised when problems arise during merging due to problems with input data. Subclass of ValueError.
pandas.errors.NullFrequencyError
exception pandas.errors.NullFrequencyError
Error raised when a null freq attribute is used in an operation that needs a non-null frequency, particularly
DatetimeIndex.shift, TimedeltaIndex.shift, PeriodIndex.shift.
pandas.errors.NumbaUtilError
exception pandas.errors.NumbaUtilError
Error raised for unsupported Numba engine routines.
pandas.errors.OutOfBoundsDatetime
exception pandas.errors.OutOfBoundsDatetime
pandas.errors.OutOfBoundsTimedelta
exception pandas.errors.OutOfBoundsTimedelta
Raised when encountering a timedelta value that cannot be represented as a timedelta64[ns].
pandas.errors.ParserError
exception pandas.errors.ParserError
Exception that is raised by an error encountered in parsing file contents.
This is a generic error raised for errors encountered when functions like read_csv or read_html are parsing
contents of a file.
See also:
read_csv Read CSV (comma-separated) file into a DataFrame.
read_html Read HTML table into a DataFrame.
pandas.errors.ParserWarning
exception pandas.errors.ParserWarning
Warning raised when reading a file that doesn’t use the default ‘c’ parser.
Raised by pd.read_csv and pd.read_table when it is necessary to change parsers, generally from the default ‘c’
parser to ‘python’.
It happens due to a lack of support or functionality for parsing a particular attribute of a CSV file with the
requested engine.
Currently, ‘c’ unsupported options include the following parameters:
1. sep other than a single character (e.g. regex separators)
2. skipfooter higher than 0
3. sep=None with delim_whitespace=False
The warning can be avoided by adding engine=’python’ as a parameter in pd.read_csv and pd.read_table meth-
ods.
See also:
pd.read_csv Read CSV (comma-separated) file into DataFrame.
pd.read_table Read general delimited file into DataFrame.
Examples
>>> import io
>>> csv = '''a;b;c
... 1;1,8
... 1;2,1'''
>>> df = pd.read_csv(io.StringIO(csv), sep='[;,]')
... # ParserWarning: Falling back to the 'python' engine...
pandas.errors.PerformanceWarning
exception pandas.errors.PerformanceWarning
Warning raised when there is a possible performance impact.
pandas.errors.UnsortedIndexError
exception pandas.errors.UnsortedIndexError
Error raised when attempting to get a slice of a MultiIndex, and the index has not been lexsorted. Subclass of
KeyError.
pandas.errors.UnsupportedFunctionCall
exception pandas.errors.UnsupportedFunctionCall
Exception raised when attempting to call a numpy function on a pandas object, but that function is not supported
by the object e.g. np.cumsum(groupby_object).
pandas.api.types.union_categoricals
Notes
Examples
If you want to combine categoricals that do not necessarily have the same categories, union_categoricals will
combine a list-like of categoricals. The new categories will be the union of the categories being combined.
By default, the resulting categories will be ordered as they appear in the categories of the data. If you want the
categories to be lexsorted, use sort_categories=True argument.
union_categoricals also works with the case of combining two categoricals of the same categories and order
information (e.g. what you could also append for).
Raises TypeError because the categories are ordered and not identical.
>>> a = pd.Categorical(["a", "b"], ordered=True)
>>> b = pd.Categorical(["a", "b", "c"], ordered=True)
>>> union_categoricals([a, b])
Traceback (most recent call last):
...
TypeError: to union ordered Categoricals, all categories must be the same
union_categoricals also works with a CategoricalIndex, or Series containing categorical data, but note that the
resulting array will always be a plain Categorical
>>> a = pd.Series(["b", "c"], dtype='category')
>>> b = pd.Series(["a", "b"], dtype='category')
>>> union_categoricals([a, b])
['b', 'c', 'a', 'b']
Categories (3, object): ['b', 'c', 'a']
pandas.api.types.infer_dtype
pandas.api.types.infer_dtype()
Efficiently infer the type of a passed val, or list-like array of values. Return a string describing the type.
Parameters
value [scalar, list, ndarray, or pandas type]
skipna [bool, default True] Ignore NaN values when inferring the type.
Returns
str Describing the common type of the input data.
Results can include:
• string
• bytes
• floating
• integer
• mixed-integer
• mixed-integer-float
• decimal
• complex
• categorical
• boolean
• datetime64
• datetime
• date
• timedelta64
• timedelta
• time
• period
• mixed
Raises
TypeError If ndarray-like but cannot infer the dtype
Notes
Examples
>>> infer_dtype([pd.Timestamp('20130101')])
'datetime'
>>> infer_dtype([np.datetime64('2013-01-01')])
'datetime64'
>>> infer_dtype(pd.Series(list('aabc')).astype('category'))
'categorical'
pandas.api.types.pandas_dtype
pandas.api.types.pandas_dtype(dtype)
Convert input into a pandas only dtype object or a numpy dtype object.
Parameters
dtype [object to be converted]
Returns
np.dtype or a pandas dtype
Raises
TypeError if not a dtype
Dtype introspection
pandas.api.types.is_bool_dtype
pandas.api.types.is_bool_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a boolean dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a boolean dtype.
Notes
Examples
>>> is_bool_dtype(str)
False
>>> is_bool_dtype(int)
False
>>> is_bool_dtype(bool)
True
>>> is_bool_dtype(np.bool_)
True
>>> is_bool_dtype(np.array(['a', 'b']))
False
>>> is_bool_dtype(pd.Series([1, 2]))
False
>>> is_bool_dtype(np.array([True, False]))
True
>>> is_bool_dtype(pd.Categorical([True, False]))
True
>>> is_bool_dtype(pd.arrays.SparseArray([True, False]))
True
pandas.api.types.is_categorical_dtype
pandas.api.types.is_categorical_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the Categorical dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the Categorical dtype.
Examples
>>> is_categorical_dtype(object)
False
>>> is_categorical_dtype(CategoricalDtype())
True
>>> is_categorical_dtype([1, 2, 3])
False
>>> is_categorical_dtype(pd.Categorical([1, 2, 3]))
True
>>> is_categorical_dtype(pd.CategoricalIndex([1, 2, 3]))
True
pandas.api.types.is_complex_dtype
pandas.api.types.is_complex_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a complex dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a complex dtype.
Examples
>>> is_complex_dtype(str)
False
>>> is_complex_dtype(int)
False
>>> is_complex_dtype(np.complex_)
True
>>> is_complex_dtype(np.array(['a', 'b']))
False
>>> is_complex_dtype(pd.Series([1, 2]))
False
>>> is_complex_dtype(np.array([1 + 1j, 5]))
True
pandas.api.types.is_datetime64_any_dtype
pandas.api.types.is_datetime64_any_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the datetime64 dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
bool Whether or not the array or dtype is of the datetime64 dtype.
Examples
>>> is_datetime64_any_dtype(str)
False
>>> is_datetime64_any_dtype(int)
False
>>> is_datetime64_any_dtype(np.datetime64) # can be tz-naive
True
>>> is_datetime64_any_dtype(DatetimeTZDtype("ns", "US/Eastern"))
True
>>> is_datetime64_any_dtype(np.array(['a', 'b']))
False
>>> is_datetime64_any_dtype(np.array([1, 2]))
False
>>> is_datetime64_any_dtype(np.array([], dtype="datetime64[ns]"))
(continues on next page)
pandas.api.types.is_datetime64_dtype
pandas.api.types.is_datetime64_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the datetime64 dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the datetime64 dtype.
Examples
>>> is_datetime64_dtype(object)
False
>>> is_datetime64_dtype(np.datetime64)
True
>>> is_datetime64_dtype(np.array([], dtype=int))
False
>>> is_datetime64_dtype(np.array([], dtype=np.datetime64))
True
>>> is_datetime64_dtype([1, 2, 3])
False
pandas.api.types.is_datetime64_ns_dtype
pandas.api.types.is_datetime64_ns_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the datetime64[ns] dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
bool Whether or not the array or dtype is of the datetime64[ns] dtype.
Examples
>>> is_datetime64_ns_dtype(str)
False
>>> is_datetime64_ns_dtype(int)
False
>>> is_datetime64_ns_dtype(np.datetime64) # no unit
False
>>> is_datetime64_ns_dtype(DatetimeTZDtype("ns", "US/Eastern"))
True
>>> is_datetime64_ns_dtype(np.array(['a', 'b']))
(continues on next page)
pandas.api.types.is_datetime64tz_dtype
pandas.api.types.is_datetime64tz_dtype(arr_or_dtype)
Check whether an array-like or dtype is of a DatetimeTZDtype dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of a DatetimeTZDtype dtype.
Examples
>>> is_datetime64tz_dtype(object)
False
>>> is_datetime64tz_dtype([1, 2, 3])
False
>>> is_datetime64tz_dtype(pd.DatetimeIndex([1, 2, 3])) # tz-naive
False
>>> is_datetime64tz_dtype(pd.DatetimeIndex([1, 2, 3], tz="US/Eastern"))
True
pandas.api.types.is_extension_type
pandas.api.types.is_extension_type(arr)
Check whether an array-like is of a pandas extension class instance.
Deprecated since version 1.0.0: Use is_extension_array_dtype instead.
Extension classes include categoricals, pandas sparse objects (i.e. classes represented within the pandas library
and not ones external to it like scipy sparse matrices), and datetime-like arrays.
Parameters
arr [array-like] The array-like to check.
Returns
Examples
pandas.api.types.is_extension_array_dtype
pandas.api.types.is_extension_array_dtype(arr_or_dtype)
Check if an object is a pandas extension array type.
See the Use Guide for more.
Parameters
arr_or_dtype [object] For array-like input, the .dtype attribute will be extracted.
Returns
bool Whether the arr_or_dtype is an extension array type.
Notes
This checks whether an object implements the pandas extension array interface. In pandas, this includes:
• Categorical
• Sparse
• Interval
• Period
• DatetimeArray
• TimedeltaArray
Third-party libraries may implement arrays or types satisfying this interface as well.
Examples
pandas.api.types.is_float_dtype
pandas.api.types.is_float_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a float dtype.
This function is internal and should not be exposed in the public API.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a float dtype.
Examples
>>> is_float_dtype(str)
False
>>> is_float_dtype(int)
False
>>> is_float_dtype(float)
True
>>> is_float_dtype(np.array(['a', 'b']))
False
>>> is_float_dtype(pd.Series([1, 2]))
False
>>> is_float_dtype(pd.Index([1, 2.]))
True
pandas.api.types.is_int64_dtype
pandas.api.types.is_int64_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the int64 dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of the int64 dtype.
Notes
Depending on system architecture, the return value of is_int64_dtype( int) will be True if the OS uses 64-bit
integers and False if the OS uses 32-bit integers.
Examples
>>> is_int64_dtype(str)
False
>>> is_int64_dtype(np.int32)
False
>>> is_int64_dtype(np.int64)
True
>>> is_int64_dtype('int8')
False
>>> is_int64_dtype('Int8')
False
>>> is_int64_dtype(pd.Int64Dtype)
True
>>> is_int64_dtype(float)
False
>>> is_int64_dtype(np.uint64) # unsigned
False
>>> is_int64_dtype(np.array(['a', 'b']))
False
>>> is_int64_dtype(np.array([1, 2], dtype=np.int64))
True
>>> is_int64_dtype(pd.Index([1, 2.])) # float
False
>>> is_int64_dtype(np.array([1, 2], dtype=np.uint32)) # unsigned
False
pandas.api.types.is_integer_dtype
pandas.api.types.is_integer_dtype(arr_or_dtype)
Check whether the provided array or dtype is of an integer dtype.
Unlike in in_any_int_dtype, timedelta64 instances will return False.
Changed in version 0.24.0: The nullable Integer dtypes (e.g. pandas.Int64Dtype) are also considered as integer
by this function.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of an integer dtype and not an instance of
timedelta64.
Examples
>>> is_integer_dtype(str)
False
>>> is_integer_dtype(int)
True
>>> is_integer_dtype(float)
False
>>> is_integer_dtype(np.uint64)
True
>>> is_integer_dtype('int8')
True
>>> is_integer_dtype('Int8')
True
>>> is_integer_dtype(pd.Int8Dtype)
True
>>> is_integer_dtype(np.datetime64)
False
>>> is_integer_dtype(np.timedelta64)
False
>>> is_integer_dtype(np.array(['a', 'b']))
False
>>> is_integer_dtype(pd.Series([1, 2]))
True
>>> is_integer_dtype(np.array([], dtype=np.timedelta64))
False
>>> is_integer_dtype(pd.Index([1, 2.])) # float
False
pandas.api.types.is_interval_dtype
pandas.api.types.is_interval_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the Interval dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the Interval dtype.
Examples
>>> is_interval_dtype(object)
False
>>> is_interval_dtype(IntervalDtype())
True
>>> is_interval_dtype([1, 2, 3])
False
>>>
>>> interval = pd.Interval(1, 2, closed="right")
>>> is_interval_dtype(interval)
False
>>> is_interval_dtype(pd.IntervalIndex([interval]))
True
pandas.api.types.is_numeric_dtype
pandas.api.types.is_numeric_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a numeric dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a numeric dtype.
Examples
>>> is_numeric_dtype(str)
False
>>> is_numeric_dtype(int)
True
>>> is_numeric_dtype(float)
True
>>> is_numeric_dtype(np.uint64)
True
>>> is_numeric_dtype(np.datetime64)
False
>>> is_numeric_dtype(np.timedelta64)
False
>>> is_numeric_dtype(np.array(['a', 'b']))
False
>>> is_numeric_dtype(pd.Series([1, 2]))
True
>>> is_numeric_dtype(pd.Index([1, 2.]))
True
>>> is_numeric_dtype(np.array([], dtype=np.timedelta64))
False
pandas.api.types.is_object_dtype
pandas.api.types.is_object_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the object dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the object dtype.
Examples
>>> is_object_dtype(object)
True
>>> is_object_dtype(int)
False
>>> is_object_dtype(np.array([], dtype=object))
True
>>> is_object_dtype(np.array([], dtype=int))
False
>>> is_object_dtype([1, 2, 3])
False
pandas.api.types.is_period_dtype
pandas.api.types.is_period_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the Period dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the Period dtype.
Examples
>>> is_period_dtype(object)
False
>>> is_period_dtype(PeriodDtype(freq="D"))
True
>>> is_period_dtype([1, 2, 3])
False
>>> is_period_dtype(pd.Period("2017-01-01"))
False
>>> is_period_dtype(pd.PeriodIndex([], freq="A"))
True
pandas.api.types.is_signed_integer_dtype
pandas.api.types.is_signed_integer_dtype(arr_or_dtype)
Check whether the provided array or dtype is of a signed integer dtype.
Unlike in in_any_int_dtype, timedelta64 instances will return False.
Changed in version 0.24.0: The nullable Integer dtypes (e.g. pandas.Int64Dtype) are also considered as integer
by this function.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of a signed integer dtype and not an instance of
timedelta64.
Examples
>>> is_signed_integer_dtype(str)
False
>>> is_signed_integer_dtype(int)
True
>>> is_signed_integer_dtype(float)
False
>>> is_signed_integer_dtype(np.uint64) # unsigned
False
>>> is_signed_integer_dtype('int8')
True
>>> is_signed_integer_dtype('Int8')
True
>>> is_signed_integer_dtype(pd.Int8Dtype)
True
>>> is_signed_integer_dtype(np.datetime64)
False
>>> is_signed_integer_dtype(np.timedelta64)
False
>>> is_signed_integer_dtype(np.array(['a', 'b']))
False
>>> is_signed_integer_dtype(pd.Series([1, 2]))
True
>>> is_signed_integer_dtype(np.array([], dtype=np.timedelta64))
False
>>> is_signed_integer_dtype(pd.Index([1, 2.])) # float
False
>>> is_signed_integer_dtype(np.array([1, 2], dtype=np.uint32)) # unsigned
False
pandas.api.types.is_string_dtype
pandas.api.types.is_string_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the string dtype.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of the string dtype.
Examples
>>> is_string_dtype(str)
True
>>> is_string_dtype(object)
True
>>> is_string_dtype(int)
False
>>>
>>> is_string_dtype(np.array(['a', 'b']))
True
(continues on next page)
pandas.api.types.is_timedelta64_dtype
pandas.api.types.is_timedelta64_dtype(arr_or_dtype)
Check whether an array-like or dtype is of the timedelta64 dtype.
Parameters
arr_or_dtype [array-like] The array-like or dtype to check.
Returns
boolean Whether or not the array-like or dtype is of the timedelta64 dtype.
Examples
>>> is_timedelta64_dtype(object)
False
>>> is_timedelta64_dtype(np.timedelta64)
True
>>> is_timedelta64_dtype([1, 2, 3])
False
>>> is_timedelta64_dtype(pd.Series([], dtype="timedelta64[ns]"))
True
>>> is_timedelta64_dtype('0 days')
False
pandas.api.types.is_timedelta64_ns_dtype
pandas.api.types.is_timedelta64_ns_dtype(arr_or_dtype)
Check whether the provided array or dtype is of the timedelta64[ns] dtype.
This is a very specific dtype, so generic ones like np.timedelta64 will return False if passed into this function.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of the timedelta64[ns] dtype.
Examples
>>> is_timedelta64_ns_dtype(np.dtype('m8[ns]'))
True
>>> is_timedelta64_ns_dtype(np.dtype('m8[ps]')) # Wrong frequency
False
>>> is_timedelta64_ns_dtype(np.array([1, 2], dtype='m8[ns]'))
True
>>> is_timedelta64_ns_dtype(np.array([1, 2], dtype=np.timedelta64))
False
pandas.api.types.is_unsigned_integer_dtype
pandas.api.types.is_unsigned_integer_dtype(arr_or_dtype)
Check whether the provided array or dtype is of an unsigned integer dtype.
Changed in version 0.24.0: The nullable Integer dtypes (e.g. pandas.UInt64Dtype) are also considered as integer
by this function.
Parameters
arr_or_dtype [array-like] The array or dtype to check.
Returns
boolean Whether or not the array or dtype is of an unsigned integer dtype.
Examples
>>> is_unsigned_integer_dtype(str)
False
>>> is_unsigned_integer_dtype(int) # signed
False
>>> is_unsigned_integer_dtype(float)
False
>>> is_unsigned_integer_dtype(np.uint64)
True
>>> is_unsigned_integer_dtype('uint8')
True
>>> is_unsigned_integer_dtype('UInt8')
True
>>> is_unsigned_integer_dtype(pd.UInt8Dtype)
True
>>> is_unsigned_integer_dtype(np.array(['a', 'b']))
False
>>> is_unsigned_integer_dtype(pd.Series([1, 2])) # signed
False
>>> is_unsigned_integer_dtype(pd.Index([1, 2.])) # float
False
>>> is_unsigned_integer_dtype(np.array([1, 2], dtype=np.uint32))
True
pandas.api.types.is_sparse
pandas.api.types.is_sparse(arr)
Check whether an array-like is a 1-D pandas sparse array.
Check that the one-dimensional array-like is a pandas sparse array. Returns True if it is a pandas sparse array,
not another type of sparse array.
Parameters
arr [array-like] Array-like to check.
Returns
bool Whether or not the array-like is a pandas sparse array.
Examples
Iterable introspection
pandas.api.types.is_dict_like
pandas.api.types.is_dict_like(obj)
Check if the object is dict-like.
Parameters
obj [The object to check]
Returns
is_dict_like [bool] Whether obj has dict-like properties.
Examples
pandas.api.types.is_file_like
pandas.api.types.is_file_like(obj)
Check if the object is a file-like object.
For objects to be considered file-like, they must be an iterator AND have either a read and/or write method as
an attribute.
Note: file-like objects must be iterable, but iterable objects need not be file-like.
Parameters
obj [The object to check]
Returns
is_file_like [bool] Whether obj has file-like properties.
Examples
>>> import io
>>> buffer = io.StringIO("data")
>>> is_file_like(buffer)
True
>>> is_file_like([1, 2, 3])
False
pandas.api.types.is_list_like
pandas.api.types.is_list_like()
Check if the object is list-like.
Objects that are considered list-like are for example Python lists, tuples, sets, NumPy arrays, and Pandas Series.
Strings and datetime objects, however, are not considered list-like.
Parameters
obj [object] Object to check.
allow_sets [bool, default True] If this parameter is False, sets will not be considered list-like.
New in version 0.24.0.
Returns
bool Whether obj has list-like properties.
Examples
pandas.api.types.is_named_tuple
pandas.api.types.is_named_tuple(obj)
Check if the object is a named tuple.
Parameters
obj [The object to check]
Returns
is_named_tuple [bool] Whether obj is a named tuple.
Examples
pandas.api.types.is_iterator
pandas.api.types.is_iterator()
Check if the object is an iterator.
This is intended for generators, not list-like objects.
Parameters
obj [The object to check]
Returns
is_iter [bool] Whether obj is an iterator.
Examples
Scalar introspection
pandas.api.types.is_bool
pandas.api.types.is_bool()
Return True if given object is boolean.
Returns
bool
pandas.api.types.is_categorical
pandas.api.types.is_categorical(arr)
Check whether an array-like is a Categorical instance.
Parameters
arr [array-like] The array-like to check.
Returns
boolean Whether or not the array-like is of a Categorical instance.
Examples
pandas.api.types.is_complex
pandas.api.types.is_complex()
Return True if given object is complex.
Returns
bool
pandas.api.types.is_float
pandas.api.types.is_float()
Return True if given object is float.
Returns
bool
pandas.api.types.is_hashable
pandas.api.types.is_hashable(obj)
Return True if hash(obj) will succeed, False otherwise.
Some types will pass a test against collections.abc.Hashable but fail when they are actually hashed with hash().
Distinguish between these and other types by trying the call to hash() and seeing if they raise TypeError.
Returns
bool
Examples
pandas.api.types.is_integer
pandas.api.types.is_integer()
Return True if given object is integer.
Returns
bool
pandas.api.types.is_interval
pandas.api.types.is_interval()
pandas.api.types.is_number
pandas.api.types.is_number(obj)
Check if the object is a number.
Returns True when the object is a number, and False if is not.
Parameters
obj [any type] The object to check if is a number.
Returns
is_number [bool] Whether obj is a number or not.
See also:
api.types.is_integer Checks a subgroup of numbers.
Examples
>>> pd.api.types.is_number(1)
True
>>> pd.api.types.is_number(7.15)
True
>>> pd.api.types.is_number(False)
True
>>> pd.api.types.is_number("foo")
False
>>> pd.api.types.is_number("5")
False
pandas.api.types.is_re
pandas.api.types.is_re(obj)
Check if the object is a regex pattern instance.
Parameters
obj [The object to check]
Returns
is_regex [bool] Whether obj is a regex pattern.
Examples
>>> is_re(re.compile(".*"))
True
>>> is_re("foo")
False
pandas.api.types.is_re_compilable
pandas.api.types.is_re_compilable(obj)
Check if the object can be compiled into a regex pattern instance.
Parameters
obj [The object to check]
Returns
is_regex_compilable [bool] Whether obj can be compiled as a regex pattern.
Examples
>>> is_re_compilable(".*")
True
>>> is_re_compilable(1)
False
pandas.api.types.is_scalar
pandas.api.types.is_scalar()
Return True if given object is scalar.
Parameters
val [object] This includes:
• numpy array scalar (e.g. np.int64)
• Python builtin numerics
• Python builtin byte arrays and strings
• None
• datetime.datetime
• datetime.timedelta
• Period
• decimal.Decimal
• Interval
• DateOffset
• Fraction
• Number.
Returns
bool Return True if given object is scalar.
Examples
pandas.show_versions
pandas.show_versions(as_json=False)
Provide useful information, important for bug reports.
It comprises info about hosting operation system, pandas version, and versions of other installed relative pack-
ages.
Parameters
as_json [str or bool, default False]
• If False, outputs info in a human readable form to the console.
• If str, it will be considered as a path to a file. Info will be written to that file in
JSON format.
• If True, outputs info in JSON format to the console.
3.15 Extensions
These are primarily intended for library authors looking to extend pandas objects.
3.15.1 pandas.api.extensions.register_extension_dtype
pandas.api.extensions.register_extension_dtype(cls)
Register an ExtensionType with pandas as class decorator.
New in version 0.24.0.
This enables operations like .astype(name) for the name of the ExtensionDtype.
Returns
callable A class decorator.
Examples
3.15.2 pandas.api.extensions.register_dataframe_accessor
pandas.api.extensions.register_dataframe_accessor(name)
Register a custom accessor on DataFrame objects.
Parameters
name [str] Name under which the accessor should be registered. A warning is issued if this
name conflicts with a preexisting attribute.
Returns
callable A class decorator.
See also:
register_dataframe_accessor Register a custom accessor on DataFrame objects.
register_series_accessor Register a custom accessor on Series objects.
Notes
When accessed, your accessor will be initialized with the pandas object the user is interacting with. So the
signature must be
For consistency with pandas methods, you should raise an AttributeError if the data passed to your ac-
cessor has an incorrect dtype.
Examples
import pandas as pd
@pd.api.extensions.register_dataframe_accessor("geo")
class GeoAccessor:
def __init__(self, pandas_obj):
self._obj = pandas_obj
@property
def center(self):
# return the geographic center point of this DataFrame
lat = self._obj.latitude
lon = self._obj.longitude
return (float(lon.mean()), float(lat.mean()))
def plot(self):
# plot this array's data on a map, e.g., using Cartopy
pass
3.15.3 pandas.api.extensions.register_series_accessor
pandas.api.extensions.register_series_accessor(name)
Register a custom accessor on Series objects.
Parameters
name [str] Name under which the accessor should be registered. A warning is issued if this
name conflicts with a preexisting attribute.
Returns
callable A class decorator.
See also:
register_dataframe_accessor Register a custom accessor on DataFrame objects.
register_series_accessor Register a custom accessor on Series objects.
register_index_accessor Register a custom accessor on Index objects.
Notes
When accessed, your accessor will be initialized with the pandas object the user is interacting with. So the
signature must be
def __init__(self, pandas_object): # noqa: E999
...
For consistency with pandas methods, you should raise an AttributeError if the data passed to your ac-
cessor has an incorrect dtype.
>>> pd.Series(['a', 'b']).dt
Traceback (most recent call last):
...
AttributeError: Can only use .dt accessor with datetimelike values
Examples
@pd.api.extensions.register_dataframe_accessor("geo")
class GeoAccessor:
def __init__(self, pandas_obj):
self._obj = pandas_obj
@property
def center(self):
# return the geographic center point of this DataFrame
lat = self._obj.latitude
lon = self._obj.longitude
return (float(lon.mean()), float(lat.mean()))
def plot(self):
# plot this array's data on a map, e.g., using Cartopy
pass
3.15.4 pandas.api.extensions.register_index_accessor
pandas.api.extensions.register_index_accessor(name)
Register a custom accessor on Index objects.
Parameters
name [str] Name under which the accessor should be registered. A warning is issued if this
name conflicts with a preexisting attribute.
Returns
callable A class decorator.
See also:
register_dataframe_accessor Register a custom accessor on DataFrame objects.
register_series_accessor Register a custom accessor on Series objects.
register_index_accessor Register a custom accessor on Index objects.
Notes
When accessed, your accessor will be initialized with the pandas object the user is interacting with. So the
signature must be
def __init__(self, pandas_object): # noqa: E999
...
For consistency with pandas methods, you should raise an AttributeError if the data passed to your ac-
cessor has an incorrect dtype.
>>> pd.Series(['a', 'b']).dt
Traceback (most recent call last):
...
AttributeError: Can only use .dt accessor with datetimelike values
Examples
@pd.api.extensions.register_dataframe_accessor("geo")
class GeoAccessor:
def __init__(self, pandas_obj):
self._obj = pandas_obj
@property
def center(self):
# return the geographic center point of this DataFrame
(continues on next page)
def plot(self):
# plot this array's data on a map, e.g., using Cartopy
pass
3.15.5 pandas.api.extensions.ExtensionDtype
class pandas.api.extensions.ExtensionDtype
A custom data type, to be paired with an ExtensionArray.
See also:
extensions.register_extension_dtype Register an ExtensionType with pandas as class decorator.
extensions.ExtensionArray Abstract base class for custom 1-D array types.
Notes
The interface includes the following abstract methods that must be implemented by subclasses:
• type
• name
The following attributes and methods influence the behavior of the dtype in pandas operations
• _is_numeric
• _is_boolean
• _get_common_dtype
Optionally one can override construct_array_type for construction with the name of this dtype via the Registry.
See extensions.register_extension_dtype().
• construct_array_type
The na_value class attribute can be used to set the default NA value for this type. numpy.nan is used by
default.
ExtensionDtypes are required to be hashable. The base class provides a default implementation, which relies
on the _metadata class attribute. _metadata should be a tuple containing the strings that define your data
type. For example, with PeriodDtype that’s the freq attribute.
If you have a parametrized dtype you should set the ``_metadata`` class property.
Ideally, the attributes in _metadata will match the parameters to your ExtensionDtype.__init__ (if
any). If any of the attributes in _metadata don’t implement the standard __eq__ or __hash__, the default
implementations here will not work.
Changed in version 0.24.0: Added _metadata, __hash__, and changed the default definition of __eq__.
For interaction with Apache Arrow (pyarrow), a __from_arrow__ method can be implemented: this method
receives a pyarrow Array or ChunkedArray as only argument and is expected to return the appropriate pandas
class ExtensionDtype:
def __from_arrow__(
self, array: Union[pyarrow.Array, pyarrow.ChunkedArray]
) -> ExtensionArray:
...
This class does not inherit from ‘abc.ABCMeta’ for performance reasons. Methods and properties required by
the interface raise pandas.errors.AbstractMethodError and no register method is provided for
registering virtual subclasses.
Attributes
pandas.api.extensions.ExtensionDtype.kind
property ExtensionDtype.kind
A character code (one of ‘biufcmMOSUV’), default ‘O’
This should match the NumPy dtype used when the array is converted to an ndarray, which is probably
‘O’ for object if the extension type cannot be represented as a built-in NumPy type.
See also:
numpy.dtype.kind
pandas.api.extensions.ExtensionDtype.na_value
property ExtensionDtype.na_value
Default NA value to use for this type.
This is used in e.g. ExtensionArray.take. This should be the user-facing “boxed” version of the NA value,
not the physical NA value for storage. e.g. for JSONArray, this is an empty dictionary.
pandas.api.extensions.ExtensionDtype.name
property ExtensionDtype.name
A string identifying the data type.
Will be used for display in, e.g. Series.dtype
pandas.api.extensions.ExtensionDtype.names
property ExtensionDtype.names
Ordered list of field names, or None if there are no fields.
This is for compatibility with NumPy arrays, and may be removed in the future.
pandas.api.extensions.ExtensionDtype.type
property ExtensionDtype.type
The scalar type for the array, e.g. int
It’s expected ExtensionArray[item] returns an instance of ExtensionDtype.type for scalar
item, assuming that value is valid (not NA). NA values do not need to be instances of type.
Methods
pandas.api.extensions.ExtensionDtype.construct_array_type
classmethod ExtensionDtype.construct_array_type()
Return the array type associated with this dtype.
Returns
type
pandas.api.extensions.ExtensionDtype.construct_from_string
classmethod ExtensionDtype.construct_from_string(string)
Construct this type from a string.
This is useful mainly for data types that accept parameters. For example, a period dtype accepts a fre-
quency parameter that can be set as period[H] (where H means hourly frequency).
By default, in the abstract class, just the name of the type is expected. But subclasses can overwrite this
method to accept parameters.
Parameters
string [str] The name of the type, for example category.
Returns
ExtensionDtype Instance of the dtype.
Raises
TypeError If a class cannot be constructed from this ‘string’.
Examples
For extension dtypes with arguments the following may be an adequate implementation.
>>> @classmethod
... def construct_from_string(cls, string):
... pattern = re.compile(r"^my_type\[(?P<arg_name>.+)\]$")
... match = pattern.match(string)
... if match:
... return cls(**match.groupdict())
... else:
... raise TypeError(
... f"Cannot construct a '{cls.__name__}' from '{string}'"
... )
pandas.api.extensions.ExtensionDtype.is_dtype
classmethod ExtensionDtype.is_dtype(dtype)
Check if we match ‘dtype’.
Parameters
dtype [object] The object to check.
Returns
bool
Notes
3.15.6 pandas.api.extensions.ExtensionArray
class pandas.api.extensions.ExtensionArray
Abstract base class for custom 1-D array types.
pandas will recognize instances of this class as proper arrays with a custom type and will not attempt to coerce
them to objects. They may be stored directly inside a DataFrame or Series.
Notes
The interface includes the following abstract methods that must be implemented by subclasses:
• _from_sequence
• _from_factorized
• __getitem__
• __len__
• __eq__
• dtype
• nbytes
• isna
• take
• copy
• _concat_same_type
A default repr displaying the type, (truncated) data, length, and dtype is provided. It can be customized or
replaced by by overriding:
• __repr__ : A default repr for the ExtensionArray.
• _formatter : Print scalars inside a Series or DataFrame.
Some methods require casting the ExtensionArray to an ndarray of Python objects with self.
astype(object), which may be expensive. When performance is a concern, we highly recommend over-
riding the following methods:
• fillna
• dropna
• unique
• factorize / _values_for_factorize
• argsort / _values_for_argsort
• searchsorted
The remaining methods implemented on this class should be performant, as they only compose abstract methods.
Still, a more efficient implementation may be available, and these methods can be overridden.
One can implement methods to handle array reductions.
• _reduce
One can implement methods to handle parsing from strings that will be used in methods such as pandas.io.
parsers.read_csv.
• _from_sequence_of_strings
This class does not inherit from ‘abc.ABCMeta’ for performance reasons. Methods and properties required by
the interface raise pandas.errors.AbstractMethodError and no register method is provided for
registering virtual subclasses.
ExtensionArrays are limited to 1 dimension.
They may be backed by none, one, or many NumPy arrays. For example, pandas.Categorical is an
extension array backed by two arrays, one for codes and one for categories. An array of IPv6 address may
be backed by a NumPy structured array with two fields, one for the lower 64 bits and one for the upper 64
bits. Or they may be backed by some other storage type, like Python lists. Pandas makes no assumptions on
how the data are stored, just that it can be converted to a NumPy array. The ExtensionArray interface does not
impose any rules on how this data is stored. However, currently, the backing data cannot be stored in attributes
called .values or ._values to ensure full compatibility with pandas internals. But other names as .data,
._data, ._items, . . . can be freely used.
If implementing NumPy’s __array_ufunc__ interface, pandas expects that
1. You defer by returning NotImplemented when any Series are present in inputs. Pandas will extract
the arrays and call the ufunc again.
2. You define a _HANDLED_TYPES tuple as an attribute on the class. Pandas inspect this to determine
whether the ufunc is valid for the types present.
See NumPy universal functions for more.
By default, ExtensionArrays are not hashable. Immutable subclasses may override this behavior.
Attributes
pandas.api.extensions.ExtensionArray.dtype
property ExtensionArray.dtype
An instance of ‘ExtensionDtype’.
pandas.api.extensions.ExtensionArray.nbytes
property ExtensionArray.nbytes
The number of bytes needed to store this object in memory.
pandas.api.extensions.ExtensionArray.ndim
property ExtensionArray.ndim
Extension Arrays are only allowed to be 1-dimensional.
pandas.api.extensions.ExtensionArray.shape
property ExtensionArray.shape
Return a tuple of the array dimensions.
Methods
argsort([ascending, kind, na_position]) Return the indices that would sort this array.
astype(dtype[, copy]) Cast to a NumPy array with ‘dtype’.
copy() Return a copy of the array.
dropna() Return ExtensionArray without NA values.
factorize([na_sentinel]) Encode the extension array as an enumerated type.
fillna([value, method, limit]) Fill NA/NaN values using the specified method.
equals(other) Return if another array is equivalent to this array.
isna() A 1-D array indicating if each value is missing.
ravel([order]) Return a flattened view on this array.
repeat(repeats[, axis]) Repeat elements of a ExtensionArray.
searchsorted(value[, side, sorter]) Find indices where elements should be inserted to
maintain order.
shift([periods, fill_value]) Shift values by desired number.
take(indices, *[, allow_fill, fill_value]) Take elements from an array.
unique() Compute the ExtensionArray of unique values.
view([dtype]) Return a view on the array.
_concat_same_type(to_concat) Concatenate multiple array of this dtype.
_formatter([boxed]) Formatting function for scalar values.
_from_factorized(values, original) Reconstruct an ExtensionArray after factorization.
_from_sequence(scalars, *[, dtype, copy]) Construct a new ExtensionArray from a sequence of
scalars.
_from_sequence_of_strings(strings, *[, Construct a new ExtensionArray from a sequence of
. . . ]) strings.
_reduce(name, *[, skipna]) Return a scalar result of performing the reduction op-
eration.
_values_for_argsort() Return values for sorting.
_values_for_factorize() Return an array and missing value suitable for fac-
torization.
pandas.api.extensions.ExtensionArray.argsort
pandas.api.extensions.ExtensionArray.astype
ExtensionArray.astype(dtype, copy=True)
Cast to a NumPy array with ‘dtype’.
Parameters
dtype [str or dtype] Typecode or data-type to which the array is cast.
copy [bool, default True] Whether to copy the data, even if not necessary. If False, a
copy is made only if the old dtype does not match the new dtype.
Returns
array [ndarray] NumPy ndarray with ‘dtype’ for its dtype.
pandas.api.extensions.ExtensionArray.copy
ExtensionArray.copy()
Return a copy of the array.
Returns
ExtensionArray
pandas.api.extensions.ExtensionArray.dropna
ExtensionArray.dropna()
Return ExtensionArray without NA values.
Returns
valid [ExtensionArray]
pandas.api.extensions.ExtensionArray.factorize
ExtensionArray.factorize(na_sentinel=- 1)
Encode the extension array as an enumerated type.
Parameters
na_sentinel [int, default -1] Value to use in the codes array to indicate missing values.
Returns
codes [ndarray] An integer NumPy array that’s an indexer into the original Extension-
Array.
uniques [ExtensionArray] An ExtensionArray containing the unique values of self.
Note: uniques will not contain an entry for the NA value of the ExtensionArray if
there are any missing values present in self.
See also:
Notes
pandas.api.extensions.ExtensionArray.fillna
pandas.api.extensions.ExtensionArray.equals
ExtensionArray.equals(other)
Return if another array is equivalent to this array.
Equivalent means that both arrays have the same shape and dtype, and all values compare equal. Missing
values in the same location are considered equal (in contrast with normal equality).
Parameters
other [ExtensionArray] Array to compare to this Array.
Returns
boolean Whether the arrays are equivalent.
pandas.api.extensions.ExtensionArray.isna
ExtensionArray.isna()
A 1-D array indicating if each value is missing.
Returns
na_values [Union[np.ndarray, ExtensionArray]] In most cases, this should return a
NumPy ndarray. For exceptional cases like SparseArray, where returning an
ndarray would be expensive, an ExtensionArray may be returned.
Notes
pandas.api.extensions.ExtensionArray.ravel
ExtensionArray.ravel(order='C')
Return a flattened view on this array.
Parameters
order [{None, ‘C’, ‘F’, ‘A’, ‘K’}, default ‘C’]
Returns
ExtensionArray
Notes
pandas.api.extensions.ExtensionArray.repeat
ExtensionArray.repeat(repeats, axis=None)
Repeat elements of a ExtensionArray.
Returns a new ExtensionArray where each element of the current ExtensionArray is repeated consecu-
tively a given number of times.
Parameters
repeats [int or array of ints] The number of repetitions for each element. This should
be a non-negative integer. Repeating 0 times will return an empty ExtensionArray.
axis [None] Must be None. Has no effect but is accepted for compatibility with numpy.
Returns
repeated_array [ExtensionArray] Newly created ExtensionArray with repeated ele-
ments.
See also:
Examples
pandas.api.extensions.ExtensionArray.searchsorted
Parameters
value [array_like] Values to insert into self.
side [{‘left’, ‘right’}, optional] If ‘left’, the index of the first suitable location found
is given. If ‘right’, return the last such index. If there is no suitable index, return
either 0 or N (where N is the length of self ).
sorter [1-D array_like, optional] Optional array of integer indices that sort array a into
ascending order. They are typically the result of argsort.
Returns
array of ints Array of insertion points with the same shape as value.
See also:
pandas.api.extensions.ExtensionArray.shift
ExtensionArray.shift(periods=1, fill_value=None)
Shift values by desired number.
Newly introduced missing values are filled with self.dtype.na_value.
New in version 0.24.0.
Parameters
periods [int, default 1] The number of periods to shift. Negative values are allowed for
shifting backwards.
fill_value [object, optional] The scalar value to use for newly introduced missing val-
ues. The default is self.dtype.na_value.
New in version 0.24.0.
Returns
ExtensionArray Shifted.
Notes
pandas.api.extensions.ExtensionArray.take
Notes
Examples
Here’s an example implementation, which relies on casting the extension array to object dtype. This uses
the helper method pandas.api.extensions.take().
pandas.api.extensions.ExtensionArray.unique
ExtensionArray.unique()
Compute the ExtensionArray of unique values.
Returns
uniques [ExtensionArray]
pandas.api.extensions.ExtensionArray.view
ExtensionArray.view(dtype=None)
Return a view on the array.
Parameters
dtype [str, np.dtype, or ExtensionDtype, optional] Default None.
Returns
ExtensionArray or np.ndarray A view on the ExtensionArray’s data.
pandas.api.extensions.ExtensionArray._concat_same_type
classmethod ExtensionArray._concat_same_type(to_concat)
Concatenate multiple array of this dtype.
Parameters
to_concat [sequence of this type]
Returns
ExtensionArray
pandas.api.extensions.ExtensionArray._formatter
ExtensionArray._formatter(boxed=False)
Formatting function for scalar values.
This is used in the default ‘__repr__’. The returned formatting function receives instances of your scalar
type.
Parameters
boxed [bool, default False] An indicated for whether or not your array is being printed
within a Series, DataFrame, or Index (True), or just by itself (False). This may be
useful if you want scalar values to appear differently within a Series versus on its
own (e.g. quoted or not).
Returns
Callable[[Any], str] A callable that gets instances of the scalar type and returns a
string. By default, repr() is used when boxed=False and str() is used
when boxed=True.
pandas.api.extensions.ExtensionArray._from_factorized
pandas.api.extensions.ExtensionArray._from_sequence
pandas.api.extensions.ExtensionArray._from_sequence_of_strings
pandas.api.extensions.ExtensionArray._reduce
pandas.api.extensions.ExtensionArray._values_for_argsort
ExtensionArray._values_for_argsort()
Return values for sorting.
Returns
ndarray The transformed values should maintain the ordering between values within
the array.
See also:
pandas.api.extensions.ExtensionArray._values_for_factorize
ExtensionArray._values_for_factorize()
Return an array and missing value suitable for factorization.
Returns
values [ndarray] An array suitable for factorization. This should maintain order and
be a supported dtype (Float64, Int64, UInt64, String, Object). By default, the
extension array is cast to object dtype.
na_value [object] The value in values to consider missing. This will be treated as NA
in the factorization routines, so it will be coded as na_sentinel and not included in
uniques. By default, np.nan is used.
Notes
3.15.7 pandas.arrays.PandasArray
Attributes
None
Methods
None
Additionally, we have some utility methods for ensuring your object behaves correctly.
3.15.8 pandas.api.indexers.check_array_indexer
pandas.api.indexers.check_array_indexer(array, indexer)
Check if indexer is a valid array indexer for array.
For a boolean mask, array and indexer are checked to have the same length. The dtype is validated, and if it is
an integer or boolean ExtensionArray, it is checked if there are missing values present, and it is converted to the
appropriate numpy array. Other dtypes will raise an error.
Non-array indexers (integer, slice, Ellipsis, tuples, ..) are passed through as is.
New in version 1.0.0.
Parameters
array [array-like] The array that is being indexed (only used for the length).
indexer [array-like or list-like] The array-like that’s used to index. List-like input that is
not yet a numpy array or an ExtensionArray is converted to one. Other input types are
passed through as is.
Returns
numpy.ndarray The validated indexer as a numpy array that can be used to index.
Raises
IndexError When the lengths don’t match.
ValueError When indexer cannot be converted to a numpy ndarray to index (e.g. presence
of missing values).
See also:
api.types.is_bool_dtype Check if key is of boolean dtype.
Examples
When checking a boolean mask, a boolean ndarray is returned when the arguments are all valid.
A numpy boolean mask will get passed through (if the length is correct):
Similarly for integer indexers, an integer ndarray is returned when it is a valid indexer, otherwise an error is (for
integer indexers, a matching length is not required):
The sentinel pandas.api.extensions.no_default is used as the default value in some methods. Use an is
comparison to check if the user provides a non-default value.
FOUR
DEVELOPMENT
Table of contents:
• Where to start?
• Bug reports and enhancement requests
• Working with the code
– Version control, Git, and GitHub
– Getting started with Git
– Forking
– Creating a development environment
* Requirements
* Building the documentation
* Building master branch documentation
• Contributing to the code base
– Code standards
– Pre-commit
– Optional dependencies
2465
pandas: powerful Python data analysis toolkit, Release 1.2.0
* C (cpplint)
* Python (PEP8 / black)
* Import formatting
* Backwards compatibility
– Type hints
* Style guidelines
* pandas-specific types
* Validating type hints
– Testing with continuous integration
– Test-driven development/code writing
* Writing tests
* Transitioning to pytest
* Using pytest
* Using hypothesis
* Testing warnings
– Running the test suite
– Running the performance test suite
– Documenting your code
• Contributing your changes to pandas
– Committing your code
– Pushing your changes
– Review your code
– Finally, make the pull request
– Updating your pull request
– Delete your merged branch (optional)
• Tips for a successful pull request
All contributions, bug reports, bug fixes, documentation improvements, enhancements, and ideas are welcome.
If you are brand new to pandas or open-source development, we recommend going through the GitHub “issues” tab to
find issues that interest you. There are a number of issues listed under Docs and good first issue where you could start
out. Once you’ve found an interesting issue, you can return here to get your development environment setup.
When you start working on an issue, it’s a good idea to assign the issue to yourself, so nobody else duplicates the
work on it. GitHub restricts assigning issues to maintainers of the project only. In most projects, and until recently in
pandas, contributors added a comment letting others know they are working on an issue. While this is ok, you need to
check each issue individually, and it’s not possible to find the unassigned ones.
For this reason, we implemented a workaround consisting of adding a comment with the exact text take. When you
do it, a GitHub action will automatically assign you the issue (this will take seconds, and may require refreshing the
page to see it). By doing this, it’s possible to filter the list of issues and find only the unassigned ones.
So, a good way to find an issue to start contributing to pandas is to check the list of unassigned good first issues and
assign yourself one you like by writing a comment with the exact text take.
If for whatever reason you are not able to continue working with the issue, please try to unassign it, so other people
know it’s available again. You can check the list of assigned issues, since people may not be working in them anymore.
If you want to work on one that is assigned, feel free to kindly ask the current assignee if you can take it (please allow
at least a week of inactivity before considering work in the issue discontinued).
Feel free to ask questions on the mailing list or on Gitter.
Bug reports are an important part of making pandas more stable. Having a complete bug report will allow others to
reproduce the bug and provide insight into fixing. See this stackoverflow article and this blogpost for tips on writing a
good bug report.
Trying the bug-producing code out on the master branch is often a worthwhile exercise to confirm the bug still exists.
It is also worth searching existing bug reports and pull requests to see if the issue has already been reported and/or
fixed.
Bug reports must:
1. Include a short, self-contained Python snippet reproducing the problem. You can format the code nicely by
using GitHub Flavored Markdown:
```python
>>> from pandas import DataFrame
>>> df = DataFrame(...)
...
```
2. Include the full version string of pandas and its dependencies. You can use the built-in function:
3. Explain why the current behavior is wrong/not desired and what you expect instead.
The issue will then show up to the pandas community and be open to comments/ideas from others.
Now that you have an issue you want to fix, enhancement to add, or documentation to improve, you need to learn how
to work with GitHub and the pandas code base.
To the new user, working with Git is one of the more daunting aspects of contributing to pandas. It can very quickly
become overwhelming, but sticking to the guidelines below will help keep the process straightforward and mostly
trouble free. As always, if you are having difficulties please feel free to ask for help.
The code is hosted on GitHub. To contribute you will need to sign up for a free GitHub account. We use Git for
version control to allow many people to work together on the project.
Some great resources for learning Git:
• the GitHub help pages.
• the NumPy’s documentation.
• Matthew Brett’s Pydagogue.
GitHub has instructions for installing git, setting up your SSH key, and configuring git. All these steps need to be
completed before you can work seamlessly between your local repository and GitHub.
Forking
You will need your own fork to work on the code. Go to the pandas project page and hit the Fork button. You will
want to clone your fork to your machine:
This creates the directory pandas-yourname and connects your repository to the upstream (main project) pandas
repository.
Note that performing a shallow clone (with --depth==N, for some N greater or equal to 1) might break some tests
and features as pd.show_versions() as the version number cannot be computed anymore.
To test out code changes, you’ll need to build pandas from source, which requires a C/C++ compiler and Python
environment. If you’re making documentation changes, you can skip to Contributing to the documentation but if you
skip creating the development environment you won’t be able to build the documentation locally before pushing your
changes.
Instead of manually setting up a development environment, you can use Docker to automatically create the environ-
ment with just several commands. pandas provides a DockerFile in the root directory to build a Docker image with
a full pandas development environment.
Docker Commands
Pass your GitHub username in the DockerFile to use your own fork:
Even easier, you can integrate Docker with the following IDEs:
Visual Studio Code
You can use the DockerFile to launch a remote session with Visual Studio Code, a popular free IDE, using the .
devcontainer.json file. See https://code.visualstudio.com/docs/remote/containers for details.
PyCharm (Professional)
Enable Docker support and use the Services tool window to build and manage images as well as run and interact with
containers. See https://www.jetbrains.com/help/pycharm/docker.html for details.
Note that you might need to rebuild the C extensions if/when you merge with upstream/master using:
Installing a C compiler
pandas uses C extensions (mostly written using Cython) to speed up certain operations. To install pandas from source,
you need to compile these C extensions, which means you need a C compiler. This process depends on which platform
you’re using.
If you have setup your environment using conda, the packages c-compiler and cxx-compiler will install a
fitting compiler for your platform that is compatible with the remaining conda packages. On Windows and macOS,
you will also need to install the SDKs as they have to be distributed separately. These packages will be automatically
installed by using pandas’s environment.yml.
Windows
You will need Build Tools for Visual Studio 2017.
Warning: You DO NOT need to install Visual Studio 2019. You only need “Build Tools for Visual Studio 2019”
found by scrolling down to “All downloads” -> “Tools for Visual Studio 2019”. In the installer, select the “C++
build tools” workload.
You can install the necessary components on the commandline using vs_buildtools.exe:
To use the conda-based compilers, you will need to install the Developer Tools using xcode-select
--install. Otherwise information about compiler installation can be found here: https://devguide.python.org/
setup/#macos
Linux
For Linux-based conda installations, you won’t have to install any additional components outside of the conda
environment. The instructions below are only needed if your setup isn’t based on conda environments.
Some Linux distributions will come with a pre-installed C compiler. To find out which compilers (and versions) are
installed on your system:
# for Debian/Ubuntu:
dpkg --list | grep compiler
# for Red Hat/RHEL/CentOS/Fedora:
yum list installed | grep -i --color compiler
GCC (GNU Compiler Collection), is a widely used compiler, which supports C and a number of other languages. If
GCC is listed as an installed compiler nothing more is required. If no C compiler is installed (or you wish to install a
newer version) you can install a compiler (GCC in the example code below) with:
For other Linux distributions, consult your favourite search engine for compiler installation instructions.
Let us know if you have any difficulties by opening an issue or reaching out on Gitter.
At this point you should be able to import pandas from your locally built version:
This will create the new environment, and not touch any of your existing environments, nor any existing Python
installation.
To view your environments:
conda info -e
conda deactivate
If you aren’t using conda for your development environment, follow these instructions. You’ll need to have at least
Python 3.6.1 installed on your system.
Unix/macOS with virtualenv
# For instance:
pyenv virtualenv 3.7.6 pandas-dev
Windows
Below is a brief overview on how to set-up a virtual environment with Powershell under Windows. For details please
refer to the official virtualenv user guide
Use an ENV_DIR of your choice. We’ll use ~\virtualenvs\pandas-dev where ‘~’ is the folder pointed to by either
$env:USERPROFILE (Powershell) or %USERPROFILE% (cmd.exe) environment variable. Any parent directories
should already exist.
Creating a branch
You want your master branch to reflect only production-ready code, so create a feature branch for making your changes.
For example:
This changes your working directory to the shiny-new-feature branch. Keep any changes in this branch specific to one
bug or feature so it is clear what the branch brings to pandas. You can have many shiny-new-features and switch in
between them using the git checkout command.
When creating this branch, make sure your master branch is up to date with the latest upstream master version. To
update your local master branch, you can do:
When you want to update the feature branch with changes in master after you created the branch, check the section on
updating a PR.
Contributing to the documentation benefits everyone who uses pandas. We encourage you to help us improve the
documentation, and you don’t have to be an expert on pandas to do so! In fact, there are sections of the docs that are
worse off after being written by experts. If something in the docs doesn’t make sense to you, updating the relevant
section after you figure it out is a great way to ensure it will help the next person.
Documentation:
The documentation is written in reStructuredText, which is almost like writing in plain English, and built using
Sphinx. The Sphinx Documentation has an excellent introduction to reST. Review the Sphinx docs to perform more
complex changes to the documentation as well.
Some other important things to know about the docs:
• The pandas documentation consists of two parts: the docstrings in the code itself and the docs in this folder
doc/.
The docstrings provide a clear explanation of the usage of the individual functions, while the documentation
in this folder consists of tutorial-like overviews per topic together with some other information (what’s new,
installation, etc).
• The docstrings follow a pandas convention, based on the Numpy Docstring Standard. Follow the pandas
docstring guide for detailed instructions on how to write a correct docstring.
A Python docstring is a string used to document a Python module, class, function or method, so programmers
can understand what it does without having to read the details of the implementation.
Also, it is a common practice to generate online (html) documentation automatically from docstrings. Sphinx
serves this purpose.
The next example gives an idea of what a docstring looks like:
def add(num1, num2):
"""
Add up two integer numbers.
This function simply wraps the ``+`` operator, and does not
(continues on next page)
Parameters
----------
num1 : int
First number to add.
num2 : int
Second number to add.
Returns
-------
int
The sum of ``num1`` and ``num2``.
See Also
--------
subtract : Subtract one integer from another.
Examples
--------
>>> add(2, 2)
4
>>> add(25, 0)
25
>>> add(10, -10)
0
"""
return num1 + num2
Some standards regarding docstrings exist, which make them easier to read, and allow them be easily exported
to other formats such as html or pdf.
The first conventions every Python docstring should follow are defined in PEP-257.
As PEP-257 is quite broad, other more specific standards also exist. In the case of pandas, the NumPy docstring
convention is followed. These conventions are explained in this document:
– numpydoc docstring guide (which is based in the original Guide to NumPy/SciPy documentation)
numpydoc is a Sphinx extension to support the NumPy docstring convention.
The standard uses reStructuredText (reST). reStructuredText is a markup language that allows encoding styles
in plain text files. Documentation about reStructuredText can be found in:
– Sphinx reStructuredText primer
– Quick reStructuredText reference
– Full reStructuredText specification
pandas has some helpers for sharing docstrings between related classes, see Sharing docstrings.
The rest of this document will summarize all the above guidelines, and will provide additional conventions
specific to the pandas project.
Writing a docstring
General rules
Docstrings must be defined with three double-quotes. No blank lines should be left before or after the docstring.
The text starts in the next line after the opening quotes. The closing quotes have their own line (meaning that
they are not at the end of the last sentence).
On rare occasions reST styles like bold text or italics will be used in docstrings, but is it common to have inline
code, which is presented between backticks. The following are considered inline code:
– The name of a parameter
– Python code, a module, function, built-in, type, literal. . . (e.g. os, list, numpy.abs, datetime.
date, True)
– A pandas class (in the form :class:`pandas.Series`)
– A pandas method (in the form :meth:`pandas.Series.sum`)
– A pandas function (in the form :func:`pandas.to_datetime`)
Note: To display only the last component of the linked class, method or function, prefix it with ~. For example,
:class:`~pandas.Series` will link to pandas.Series but only display the last part, Series as the
link text. See Sphinx cross-referencing syntax for details.
Good:
def add_values(arr):
"""
Add the values in ``arr``.
Bad:
def func():
"""Some function.
There is a blank line between the docstring and the first line
of code ``foo = 1``.
The closing quotes should be in the next line, not in this one."""
foo = 1
(continues on next page)
The short summary is a single sentence that expresses what the function does in a concise way.
The short summary must start with a capital letter, end with a dot, and fit in a single line. It needs to express
what the object does without providing details. For functions and methods, the short summary must start with
an infinitive verb.
Good:
def astype(dtype):
"""
Cast Series type.
Bad:
def astype(dtype):
"""
Casts Series type.
def astype(dtype):
"""
Method to cast Series type.
def astype(dtype):
"""
Cast Series type
def astype(dtype):
"""
Cast Series type from its current type to the new type defined in
the parameter dtype.
The extended summary provides details on what the function does. It should not go into the details of the
parameters, or discuss implementation notes, which go in other sections.
A blank line is left between the short summary and the extended summary. Every paragraph in the extended
summary ends with a dot.
The extended summary should provide details on why the function is useful and their use cases, if it is not too
generic.
def unstack():
"""
Pivot a row index to columns.
The index level will be automatically removed from the index when added
as columns.
"""
pass
Section 3: parameters
The details of the parameters will be added in this section. This section has the title “Parameters”, followed by
a line with a hyphen under each letter of the word “Parameters”. A blank line is left before the section title, but
not after, and not between the line with the word “Parameters” and the one with the hyphens.
After the title, each parameter in the signature must be documented, including *args and **kwargs, but not
self.
The parameters are defined by their name, followed by a space, a colon, another space, and the type (or
types). Note that the space between the name and the colon is important. Types are not defined for *args
and **kwargs, but must be defined for all other parameters. After the parameter definition, it is required to
have a line with the parameter description, which is indented, and can have multiple lines. The description must
start with a capital letter, and finish with a dot.
For keyword arguments with a default value, the default will be listed after a comma at the end of the type. The
exact form of the type in this case will be “int, default 0”. In some cases it may be useful to explain what the
default argument means, which can be added after a comma “int, default -1, meaning all cpus”.
In cases where the default value is None, meaning that the value will not be used. Instead of "str, default
None", it is preferred to write "str, optional". When None is a value being used, we will keep the form
“str, default None”. For example, in df.to_csv(compression=None), None is not a value being used,
but means that compression is optional, and no compression is being used if not provided. In this case we will
use "str, optional". Only in cases like func(value=None) and None is being used in the same way
as 0 or foo would be used, then we will specify “str, int or None, default None”.
Good:
class Series:
def plot(self, kind, color='blue', **kwargs):
"""
Generate a plot.
(continues on next page)
Parameters
----------
kind : str
Kind of matplotlib plot.
color : str, default 'blue'
Color name or rgb code.
**kwargs
These parameters will be passed to the matplotlib plotting
function.
"""
pass
Bad:
class Series:
def plot(self, kind, **kwargs):
"""
Generate a plot.
Note the blank line between the parameters title and the first
parameter. Also, note that after the name of the parameter ``kind``
and before the colon, a space is missing.
Parameters
----------
kind: str
kind of matplotlib plot
"""
pass
Parameter types
When specifying the parameter types, Python built-in data types can be used directly (the Python type is pre-
ferred to the more verbose string, integer, boolean, etc):
– int
– float
– str
– bool
For complex types, define the subtypes. For dict and tuple, as more than one type is present, we use the
brackets to help read the type (curly brackets for dict and normal brackets for tuple):
– list of int
– dict of {str : int}
– tuple of (str, int, int)
– tuple of (str,)
– set of str
In case where there are just a set of values allowed, list them in curly brackets and separated by commas
(followed by a space). If the values are ordinal and they have an order, list them in this order. Otherwise, list the
default value first, if there is one:
– {0, 10, 25}
– {‘simple’, ‘advanced’}
– {‘low’, ‘medium’, ‘high’}
– {‘cat’, ‘dog’, ‘bird’}
If the type is defined in a Python module, the module must be specified:
– datetime.date
– datetime.datetime
– decimal.Decimal
If the type is in a package, the module must be also specified:
– numpy.ndarray
– scipy.sparse.coo_matrix
If the type is a pandas type, also specify pandas except for Series and DataFrame:
– Series
– DataFrame
– pandas.Index
– pandas.Categorical
– pandas.arrays.SparseArray
If the exact type is not relevant, but must be compatible with a NumPy array, array-like can be specified. If Any
type that can be iterated is accepted, iterable can be used:
– array-like
– iterable
If more than one type is accepted, separate them by commas, except the last two types, that need to be separated
by the word ‘or’:
– int or float
– float, decimal.Decimal or None
– str or list of str
If None is one of the accepted values, it always needs to be the last in the list.
For axis, the convention is to use something like:
If the method returns a value, it will be documented in this section. Also if the method yields its output.
The title of the section will be defined in the same way as the “Parameters”. With the names “Returns” or
“Yields” followed by a line with as many hyphens as the letters in the preceding word.
The documentation of the return is also similar to the parameters. But in this case, no name will be provided,
unless the method returns or yields more than one value (a tuple of values).
The types for “Returns” and “Yields” are the same as the ones for the “Parameters”. Also, the description must
finish with a dot.
For example, with a single value:
def sample():
"""
Generate and return a random number.
Returns
-------
float
Random number generated.
"""
return np.random.random()
def random_letters():
"""
Generate and return a sequence of random letters.
Returns
-------
length : int
Length of the returned string.
letters : str
String of random letters.
"""
length = np.random.randint(1, 10)
letters = ''.join(np.random.choice(string.ascii_lowercase)
for i in range(length))
return length, letters
Yields
------
float
Random number generated.
"""
while True:
yield np.random.random()
This section is used to let users know about pandas functionality related to the one being documented. In rare
cases, if no related methods or functions can be found at all, this section can be skipped.
An obvious example would be the head() and tail() methods. As tail() does the equivalent as head()
but at the end of the Series or DataFrame instead of at the beginning, it is good to let the users know about
it.
To give an intuition on what can be considered related, here there are some examples:
– loc and iloc, as they do the same, but in one case providing indices and in the other positions
– max and min, as they do the opposite
– iterrows, itertuples and items, as it is easy that a user looking for the method to iterate over
columns ends up in the method to iterate over rows, and vice-versa
– fillna and dropna, as both methods are used to handle missing values
– read_csv and to_csv, as they are complementary
– merge and join, as one is a generalization of the other
– astype and pandas.to_datetime, as users may be reading the documentation of astype to know
how to cast as a date, and the way to do it is with pandas.to_datetime
– where is related to numpy.where, as its functionality is based on it
When deciding what is related, you should mainly use your common sense and think about what can be useful
for the users reading the documentation, especially the less experienced ones.
When relating to other libraries (mainly numpy), use the name of the module first (not an alias like np). If the
function is in a module which is not the main one, like scipy.sparse, list the full module (e.g. scipy.
sparse.coo_matrix).
This section has a header, “See Also” (note the capital S and A), followed by the line with hyphens and preceded
by a blank line.
After the header, we will add a line for each related method or function, followed by a space, a colon, another
space, and a short description that illustrates what this method or function does, why is it relevant in this context,
and what the key differences are between the documented function and the one being referenced. The description
must also end with a dot.
Note that in “Returns” and “Yields”, the description is located on the line after the type. In this section, however,
it is located on the same line, with a colon in between. If the description does not fit on the same line, it can
continue onto other lines which must be further indented.
For example:
class Series:
def head(self):
"""
Return the first 5 elements of the Series.
Returns
-------
Series
Subset of the original series with the 5 first values.
See Also
--------
Series.tail : Return the last 5 elements of the Series.
Series.iloc : Return a slice of the elements in the Series,
which can also be used to return the first or last n.
"""
return self.iloc[:5]
Section 6: notes
This is an optional section used for notes about the implementation of the algorithm, or to document technical
aspects of the function behavior.
Feel free to skip it, unless you are familiar with the implementation of the algorithm, or you discover some
counter-intuitive behavior while writing the examples for the function.
This section follows the same format as the extended summary section.
Section 7: examples
This is one of the most important sections of a docstring, despite being placed in the last position, as often
people understand concepts better by example than through accurate explanations.
Examples in docstrings, besides illustrating the usage of the function or method, must be valid Python code, that
returns the given output in a deterministic way, and that can be copied and run by users.
Examples are presented as a session in the Python terminal. >>> is used to present code. ... is used for code
continuing from the previous line. Output is presented immediately after the last line of code generating the
output (no blank lines in between). Comments describing the examples can be added with blank lines before
and after them.
The way to present examples is as follows:
1. Import required libraries (except numpy and pandas)
2. Create the data required for the example
3. Show a very basic example that gives an idea of the most common use case
4. Add examples with explanations that illustrate how the parameters can be used for extended functionality
A simple example could be:
class Series:
Parameters
----------
n : int
Number of values to return.
Return
------
pandas.Series
Subset of the original series with the n first values.
See Also
--------
tail : Return the last n elements of the Series.
Examples
--------
>>> s = pd.Series(['Ant', 'Bear', 'Cow', 'Dog', 'Falcon',
... 'Lion', 'Monkey', 'Rabbit', 'Zebra'])
>>> s.head()
0 Ant
1 Bear
2 Cow
3 Dog
4 Falcon
dtype: object
With the ``n`` parameter, we can change the number of returned rows:
>>> s.head(n=3)
0 Ant
1 Bear
2 Cow
dtype: object
"""
return self.iloc[:n]
The examples should be as concise as possible. In cases where the complexity of the function requires long
examples, is recommended to use blocks with headers in bold. Use double star ** to make a text bold, like in
**this example**.
Code in examples is assumed to always start with these two lines which are not shown:
import numpy as np
import pandas as pd
Any other module used in the examples must be explicitly imported, one per line (as recommended in PEP
8#imports) and avoiding aliases. Avoid excessive imports, but if needed, imports from the standard library go
first, followed by third-party libraries (like matplotlib).
When illustrating examples with a single Series use the name s, and if illustrating with a single DataFrame
use the name df. For indices, idx is the preferred name. If a set of homogeneous Series or DataFrame
is used, name them s1, s2, s3. . . or df1, df2, df3. . . If the data is not homogeneous, and more than one
structure is needed, name them with something meaningful, for example df_main and df_to_join.
Data used in the example should be as compact as possible. The number of rows is recommended to be around
4, but make it a number that makes sense for the specific example. For example in the head method, it requires
to be higher than 5, to show the example with the default values. If doing the mean, we could use something
like [1, 2, 3], so it is easy to see that the value returned is the mean.
For more complex examples (grouping for example), avoid using data without interpretation, like a matrix of
random numbers with columns A, B, C, D. . . And instead use a meaningful example, which makes it easier to
understand the concept. Unless required by the example, use names of animals, to keep examples consistent.
And numerical properties of them.
When calling the method, keywords arguments head(n=3) are preferred to positional arguments head(3).
Good:
class Series:
def mean(self):
"""
Compute the mean of the input.
Examples
--------
>>> s = pd.Series([1, 2, 3])
>>> s.mean()
2
"""
pass
Examples
--------
>>> s = pd.Series([1, np.nan, 3])
>>> s.fillna(0)
[1, 0, 3]
"""
pass
def groupby_mean(self):
(continues on next page)
Examples
--------
>>> s = pd.Series([380., 370., 24., 26],
... name='max_speed',
... index=['falcon', 'falcon', 'parrot', 'parrot'])
>>> s.groupby_mean()
index
falcon 375.0
parrot 25.0
Name: max_speed, dtype: float64
"""
pass
Examples
--------
>>> s = pd.Series('Antelope', 'Lion', 'Zebra', np.nan)
>>> s.contains(pattern='a')
0 False
1 False
2 True
3 NaN
dtype: bool
**Case sensitivity**
**Missing values**
We can fill missing values in the output using the ``na`` parameter:
Bad:
Examples
--------
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame(np.random.randn(3, 3),
... columns=('a', 'b', 'c'))
>>> df.method(1)
21
>>> df.method(bar=14)
123
"""
pass
Getting the examples pass the doctests in the validation script can sometimes be tricky. Here are some attention
points:
– Import all needed libraries (except for pandas and NumPy, those are already imported as import
pandas as pd and import numpy as np) and define all variables you use in the example.
– Try to avoid using random data. However random data might be OK in some cases, like if the function you
are documenting deals with probability distributions, or if the amount of data needed to make the function
result meaningful is too much, such that creating it manually is very cumbersome. In those cases, always
use a fixed random seed to make the generated examples predictable. Example:
>>> np.random.seed(42)
>>> df = pd.DataFrame({'normal': np.random.normal(100, 5, 20)})
– If you have a code snippet that wraps multiple lines, you need to use ‘. . . ’ on the continued lines:
– If you want to show a case where an exception is raised, you can do:
>>> pd.to_datetime(["712-01-01"])
Traceback (most recent call last):
OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 712-01-01 00:00:00
It is essential to include the “Traceback (most recent call last):”, but for the actual error only the error
name is sufficient.
– If there is a small part of the result that can vary (e.g. a hash in an object representation), you can use ...
to represent this part.
If you want to show that s.plot() returns a matplotlib AxesSubplot object, this will fail the doctest
>>> s.plot()
<matplotlib.axes._subplots.AxesSubplot at 0x7efd0c0b0690>
>>> s.plot()
<matplotlib.axes._subplots.AxesSubplot at ...>
Plots in examples
There are some methods in pandas returning plots. To render the plots generated by the examples in the docu-
mentation, the .. plot:: directive exists.
To use it, place the next code after the “Examples” header as shown below. The plot will be generated automat-
ically when building the documentation.
class Series:
def plot(self):
"""
Generate a plot with the ``Series`` data.
Examples
--------
.. plot::
:context: close-figs
Sharing docstrings
pandas has a system for sharing docstrings, with slight variations, between classes. This helps us keep docstrings
consistent, while keeping things clear for the user reading. It comes at the cost of some complexity when writing.
Each shared docstring will have a base template with variables, like {klass}. The variables filled in later on
using the doc decorator. Finally, docstrings can also be appended to with the doc decorator.
In this example, we’ll create a parent docstring normally (this is like pandas.core.generic.NDFrame.
Then we’ll have two children (like pandas.core.series.Series and pandas.core.frame.
DataFrame). We’ll substitute the class names in this docstring.
class Parent:
@doc(klass="Parent")
def my_function(self):
"""Apply my function to {klass}."""
...
class ChildA(Parent):
@doc(Parent.my_function, klass="ChildA")
def my_function(self):
...
class ChildB(Parent):
@doc(Parent.my_function, klass="ChildB")
def my_function(self):
...
Notice:
1. We “append” the parent docstring to the children docstrings, which are initially empty.
Our files will often contain a module-level _shared_doc_kwargs with some common substitution values
(things like klass, axes, etc).
You can substitute and append in one shot with something like
@doc(template, **_shared_doc_kwargs)
def my_function(self):
...
where template may come from a module-level _shared_docs dictionary mapping function names to
docstrings. Wherever possible, we prefer using doc, since the docstring-writing processes is slightly closer to
normal.
See pandas.core.generic.NDFrame.fillna for an example template, and pandas.core.
series.Series.fillna and pandas.core.generic.frame.fillna for the filled versions.
• The tutorials make heavy use of the IPython directive sphinx extension. This directive lets you put code in the
documentation which will be run during the doc build. For example:
.. ipython:: python
x = 2
x**3
Almost all code examples in the docs are run (and the output saved) during the doc build. This approach means
that code examples will always be up to date, but it does make the doc building a bit more complex.
• Our API documentation files in doc/source/reference house the auto-generated documentation from the
docstrings. For classes, there are a few subtleties around controlling which methods and attributes have pages
auto-generated.
We have two autosummary templates for classes.
1. _templates/autosummary/class.rst. Use this when you want to automatically generate a
page for every public method and attribute on the class. The Attributes and Methods sections will
be automatically added to the class’ rendered documentation by numpydoc. See DataFrame for an
example.
2. _templates/autosummary/class_without_autosummary. Use this when you want to pick
a subset of methods / attributes to auto-generate pages for. When using this template, you should in-
clude an Attributes and Methods section in the class docstring. See CategoricalIndex for an
example.
Every method should be included in a toctree in one of the documentation files in doc/source/
reference, else Sphinx will emit a warning.
Note: The .rst files are used to automatically generate Markdown and HTML versions of the docs. For this reason,
please do not edit CONTRIBUTING.md directly, but instead make any changes to doc/source/development/
contributing.rst. Then, to generate CONTRIBUTING.md, use pandoc with the following command:
The utility script scripts/validate_docstrings.py can be used to get a csv summary of the API docu-
mentation. And also validate common errors in the docstring of a specific class, function or method. The summary
also compares the list of methods documented in the files in doc/source/reference (which is used to generate
the API Reference page) and the actual public methods. This will identify methods documented in doc/source/
reference that are not actually class methods, and existing methods that are not documented in doc/source/
reference.
When improving a single function or method’s docstring, it is not necessarily needed to build the full documentation
(see next section). However, there is a script that checks a docstring (for example for the DataFrame.mean method):
This script will indicate some formatting errors if present, and will also run and test the examples included in the
docstring. Check the pandas docstring guide for a detailed guide on how to format the docstring.
The examples in the docstring (‘doctests’) must be valid Python code, that in a deterministic way returns the presented
output, and that can be copied and run by users. This can be checked with the script above, and is also tested on Travis.
A failing doctest will be a blocker for merging a PR. Check the examples section in the docstring guide for some tips
and tricks to get the doctests passing.
When doing a PR with a docstring update, it is good to post the output of the validation script in a comment on github.
Requirements
First, you need to have a development environment to be able to build pandas (see the docs on creating a development
environment above).
So how do you build the docs? Navigate to your local doc/ directory in the console and run:
Then you can find the HTML output in the folder doc/build/html/.
The first time you build the docs, it will take quite a while because it has to run all the code examples and build all the
generated docstring pages. In subsequent evocations, sphinx will try to only build the pages that have been modified.
If you want to do a full clean build, do:
You can tell make.py to compile only a single section of the docs, greatly reducing the turn-around time for checking
your changes.
# compile the docs with only a single section, relative to the "source" folder.
# For example, compiling only this guide (doc/source/development/contributing.rst)
python make.py clean
python make.py --single development/contributing.rst
For comparison, a full documentation build may take 15 minutes, but a single section may take 15 seconds. Subsequent
builds, which only process portions you have changed, will be faster.
You can also specify to use multiple cores to speed up the documentation build:
Open the following file in a web browser to see the full documentation you just built:
doc/build/html/index.html
And you’ll have the satisfaction of seeing your new and improved documentation!
When pull requests are merged into the pandas master branch, the main parts of the documentation are also built by
Travis-CI. These docs are then hosted here, see also the Continuous Integration section.
Code Base:
• Code standards
• Pre-commit
• Optional dependencies
– C (cpplint)
– Python (PEP8 / black)
– Import formatting
– Backwards compatibility
• Type hints
– Style guidelines
– pandas-specific types
– Validating type hints
• Testing with continuous integration
• Test-driven development/code writing
– Writing tests
– Transitioning to pytest
– Using pytest
– Using hypothesis
– Testing warnings
• Running the test suite
• Running the performance test suite
• Documenting your code
Code standards
Writing good code is not just about what you write. It is also about how you write it. During Continuous Integration
testing, several tools will be run to check your code for stylistic errors. Generating any warnings will cause the test to
fail. Thus, good style is a requirement for submitting code to pandas.
There is a tool in pandas to help contributors verify their changes before contributing them to the project:
./ci/code_checks.sh
The script verifies the linting of code files, it looks for common mistake patterns (like missing spaces around sphinx di-
rectives that make the documentation not being rendered properly) and it also validates the doctests. It is possible to run
the checks independently by using the parameters lint, patterns and doctests (e.g. ./ci/code_checks.
sh lint).
In addition, because a lot of people use our library, it is important that we do not make sudden changes to the code
that could have the potential to break a lot of user code as a result, that is, we need it to be as backwards compatible as
possible to avoid mass breakages.
In addition to ./ci/code_checks.sh, some extra checks are run by pre-commit - see here for how to run
them.
Additional standards are outlined on the pandas code style guide.
Pre-commit
You can run many of these styling checks manually as we have described above. However, we encourage you to use
pre-commit hooks instead to automatically run black, flake8, isort when you make a git commit. This can be
done by installing pre-commit:
pre-commit install
from the root of the pandas repository. Now all of the styling checks will be run each time you commit changes without
your needing to run each one manually. In addition, using pre-commit will also allow you to more easily remain
up-to-date with our code checks as they change.
Note that if needed, you can skip these checks with git commit --no-verify.
If you don’t want to use pre-commit as part of your workflow, you can still use it to run its checks with:
Note: If you have conflicting installations of virtualenv, then you may get an error - see here.
Also, due to a bug in virtualenv, you may run into issues if you’re using conda. To solve this, you can downgrade
virtualenv to version 20.0.33.
Optional dependencies
Optional dependencies (e.g. matplotlib) should be imported with the private helper pandas.compat.
_optional.import_optional_dependency. This ensures a consistent error message when the dependency
is not met.
All methods using an optional dependency should include a test asserting that an ImportError is raised when the
optional dependency is not found. This test should be skipped if the library is present.
All optional dependencies should be documented in Optional dependencies and the minimum required version should
be set in the pandas.compat._optional.VERSIONS dict.
C (cpplint)
pandas uses the Google standard. Google provides an open source style checker called cpplint, but we use a fork
of it that can be found here. Here are some of the more common cpplint issues:
• we restrict line-length to 80 characters to promote readability
• every header file must include a header guard to avoid name collisions if re-included
Continuous Integration will run the cpplint tool and report any stylistic errors in your code. Therefore, it is helpful
before submitting code to run the check yourself:
To make your commits compliant with this standard, you can install the ClangFormat tool, which can be downloaded
here. To configure, in your home directory, run the following command:
Then modify the file to ensure that any indentation width parameters are at least four. Once configured, you can run
the tool as follows:
clang-format modified-c-file
This will output what your file will look like if the changes are made, and to apply them, run the following command:
clang-format -i modified-c-file
To run the tool on an entire directory, you can run the following analogous commands:
Do note that this tool is best-effort, meaning that it will try to correct as many errors as possible, but it may not correct
all of them. Thus, it is recommended that you run cpplint to double check and make any other style fixes manually.
pandas follows the PEP8 standard and uses Black and Flake8 to ensure a consistent code format throughout the project.
We encourage you to use pre-commit.
Continuous Integration will run those tools and report any stylistic errors in your code. Therefore, it is helpful before
submitting code to run the check yourself:
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
to auto-format your code. Additionally, many editors have plugins that will apply black as you edit files.
You should use a black version 20.8b1 as previous versions are not compatible with the pandas codebase.
One caveat about git diff upstream/master -u -- "*.py" | flake8 --diff: this command will
catch any stylistic errors in your changes specifically, but be beware it may not catch all of them. For example, if
you delete the only usage of an imported function, it is stylistically incorrect to import an unused function. However,
style-checking the diff will not catch this because the actual import is not part of the diff. Thus, for completeness, you
should run this command, though it may take longer:
Note that on OSX, the -r flag is not available, so you have to omit it and run this slightly modified command:
Windows does not support the xargs command (unless installed for example via the MinGW toolchain), but one can
imitate the behaviour as follows:
This will get all the files being changed by the PR (and ending with .py), and run flake8 on them, one after the
other.
Note that these commands can be run analogously with black.
Import formatting
to automatically format imports correctly. This will modify your local copy of the files.
Alternatively, you can run a command similar to what was suggested for black and flake8 right above:
git diff upstream/master --name-only -- "*.py" | xargs -r isort
Backwards compatibility
Please try to maintain backward compatibility. pandas has lots of users with lots of existing code, so don’t break it if
at all possible. If you think breakage is required, clearly state why as part of the pull request. Also, be careful when
changing method signatures and add deprecation warnings where needed. Also, add the deprecated sphinx directive
to the deprecated functions or methods.
If a function with the same arguments as the one being deprecated exist, you can use the pandas.util.
_decorators.deprecate:
from pandas.util._decorators import deprecate
def old_func():
"""Summary of the function.
.. deprecated:: 1.1.0
Use new_func instead.
"""
warnings.warn('Use new_func instead.', FutureWarning, stacklevel=2)
new_func()
def new_func():
pass
Type hints
pandas strongly encourages the use of PEP 484 style type hints. New development should contain type hints and pull
requests to annotate existing code are accepted as well!
Style guidelines
Types imports should follow the from typing import ... convention. So rather than
import typing
primes: typing.List[int] = []
primes: List[int] = []
maybe_primes: List[Optional[int]] = []
In some cases in the code base classes may define class variables that shadow builtins. This causes an issue as described
in Mypy 1775. The defensive solution here is to create an unambiguous alias of the builtin and use that without your
annotation. For example, if you come across a definition like
class SomeClass1:
str = None
str_type = str
class SomeClass2:
str: str_type = None
In some cases you may be tempted to use cast from the typing module when you know better than the analyzer. This
occurs particularly when using custom inference functions. For example
if is_number(obj):
...
else: # Reasonably only str objects would reach this but...
obj = cast(str, obj) # Mypy complains without this!
return obj.upper()
The limitation here is that while a human can reasonably understand that is_number would catch the int and
float types mypy cannot make that same inference just yet (see mypy #5206. While the above works, the use of
cast is strongly discouraged. Where applicable a refactor of the code to appease static analysis is preferable
if isinstance(obj, str):
return obj.upper()
else:
...
With custom types and inference this is not always possible so exceptions are made, but every effort should be ex-
hausted to avoid cast before going down such paths.
pandas-specific types
Commonly used types specific to pandas will appear in pandas._typing and you should use these where applicable.
This module is private for now but ultimately this should be exposed to third party libraries who want to implement
type checking against pandas.
For example, quite a few functions in pandas accept a dtype argument. This can be expressed as a
string like "object", a numpy.dtype like np.int64 or even a pandas ExtensionDtype like pd.
CategoricalDtype. Rather than burden the user with having to constantly annotate all of those options, this
can simply be imported and reused from the pandas._typing module
This module will ultimately house types for repeatedly used concepts like “path-like”, “array-like”, “numeric”, etc. . .
and can also hold aliases for commonly appearing parameters like axis. Development of this module is active so be
sure to refer to the source for the most up to date list of available types.
pandas uses mypy to statically analyze the code base and type hints. After making any change you can ensure your
type hints are correct by running
mypy pandas
The pandas test suite will run automatically on Travis-CI and Azure Pipelines continuous integration services, once
your pull request is submitted. However, if you wish to run the test suite on a branch prior to submitting the pull
request, then the continuous integration services need to be hooked to your GitHub repository. Instructions are here
for Travis-CI and Azure Pipelines.
A pull-request will be considered for merging when you have an all ‘green’ build. If any tests are failing, then you
will get a red ‘X’, where you can click through to see the individual failed tests. This is an example of a green build.
Note: Each time you push to your fork, a new run of the tests will be triggered on the CI. You can enable the
auto-cancel feature, which removes any non-currently-running tests for that same pull-request, for Travis-CI here.
pandas is serious about testing and strongly encourages contributors to embrace test-driven development (TDD). This
development process “relies on the repetition of a very short development cycle: first the developer writes an (initially
failing) automated test case that defines a desired improvement or new function, then produces the minimum amount
of code to pass that test.” So, before actually writing any code, you should write your tests. Often the test can be taken
from the original GitHub issue. However, it is always worth considering additional use cases and writing corresponding
tests.
Adding tests is one of the most common requests after code is pushed to pandas. Therefore, it is worth getting in the
habit of writing tests ahead of time so this is never an issue.
Like many packages, pandas uses pytest and the convenient extensions in numpy.testing.
Writing tests
All tests should go into the tests subdirectory of the specific package. This folder contains many current examples of
tests, and we suggest looking to these for inspiration. If your test requires working with files or network connectivity,
there is more information on the testing page of the wiki.
The pandas._testing module has many special assert functions that make it easier to make statements about
whether Series or DataFrame objects are equivalent. The easiest way to verify that your code is correct is to explicitly
construct the result you expect, then compare the actual result to the expected correct result:
def test_pivot(self):
data = {
'index' : ['A', 'B', 'C', 'C', 'B', 'A'],
'columns' : ['One', 'One', 'One', 'Two', 'Two', 'Two'],
'values' : [1., 2., 3., 3., 2., 1.]
}
frame = DataFrame(data)
pivoted = frame.pivot(index='index', columns='columns', values='values')
expected = DataFrame({
'One' : {'A' : 1., 'B' : 2., 'C' : 3.},
'Two' : {'A' : 1., 'B' : 2., 'C' : 3.}
})
assert_frame_equal(pivoted, expected)
Please remember to add the Github Issue Number as a comment to a new test. E.g. “# brief comment, see GH#28907”
Transitioning to pytest
pandas existing test structure is mostly class-based, meaning that you will typically find tests wrapped in a class.
class TestReallyCoolFeature:
pass
Going forward, we are moving to a more functional style using the pytest framework, which offers a richer testing
framework that will facilitate testing and developing. Thus, instead of writing test classes, we will write test functions
like this:
def test_really_cool_feature():
pass
Using pytest
Here is an example of a self-contained set of tests that illustrate multiple features that we like to use.
• functional style: tests are like test_* and only take arguments that are either fixtures or parameters
• pytest.mark can be used to set metadata on test functions, e.g. skip or xfail.
• using parametrize: allow testing of multiple cases
• to set a mark on a parameter, pytest.param(..., marks=...) syntax should be used
• fixture, code for object construction, on a per-test basis
• using bare assert for scalars and truth-testing
• tm.assert_series_equal (and its counter part tm.assert_frame_equal), for pandas object com-
parisons.
• the typical pattern of constructing an expected and comparing versus the result
We would name this file test_cool_feature.py and put in an appropriate place in the pandas/tests/
structure.
import pytest
import numpy as np
import pandas as pd
@pytest.mark.parametrize(
'dtype', ['float32', pytest.param('int16', marks=pytest.mark.skip),
pytest.param('int32', marks=pytest.mark.xfail(
reason='to show how it works'))])
def test_mark(dtype):
assert str(np.dtype(dtype)) == 'float32'
@pytest.fixture
def series():
return pd.Series([1, 2, 3])
tester.py::test_dtypes[int8] PASSED
tester.py::test_dtypes[int16] PASSED
tester.py::test_dtypes[int32] PASSED
tester.py::test_dtypes[int64] PASSED
tester.py::test_mark[float32] PASSED
tester.py::test_mark[int16] SKIPPED
tester.py::test_mark[int32] xfail
tester.py::test_series[int8] PASSED
tester.py::test_series[int16] PASSED
tester.py::test_series[int32] PASSED
tester.py::test_series[int64] PASSED
Tests that we have parametrized are now accessible via the test name, for example we could run these with -k
int8 to sub-select only those tests which match int8.
test_cool_feature.py::test_dtypes[int8] PASSED
test_cool_feature.py::test_series[int8] PASSED
Using hypothesis
Hypothesis is a library for property-based testing. Instead of explicitly parametrizing a test, you can describe all
valid inputs and let Hypothesis try to find a failing input. Even better, no matter how many random examples it tries,
Hypothesis always reports a single minimal counterexample to your assertions - often an example that you would
never have thought to test.
See Getting Started with Hypothesis for more of an introduction, then refer to the Hypothesis documentation for
details.
import json
from hypothesis import given, strategies as st
@given(value=any_json_value)
def test_json_roundtrip(value):
result = json.loads(json.dumps(value))
assert value == result
This test shows off several useful features of Hypothesis, as well as demonstrating a good use-case: checking properties
that should hold over a large or complicated domain of inputs.
To keep the pandas test suite running quickly, parametrized tests are preferred if the inputs or logic are simple, with
Hypothesis tests reserved for cases with complex logic or where there are too many combinations of options or subtle
interactions to test (or think of!) all of them.
Testing warnings
By default, one of pandas CI workers will fail if any unhandled warnings are emitted.
If your change involves checking that a warning is actually emitted, use tm.
assert_produces_warning(ExpectedWarning).
import pandas._testing as tm
df = pd.DataFrame()
with tm.assert_produces_warning(FutureWarning):
df.some_operation()
We prefer this to the pytest.warns context manager because ours checks that the warning’s stacklevel is set cor-
rectly. The stacklevel is what ensure the user’s file name and line number is printed in the warning, rather than some-
thing internal to pandas. It represents the number of function calls from user code (e.g. df.some_operation())
to the function that actually emits the warning. Our linter will fail the build if you use pytest.warns in a test.
If you have a test that would emit a warning, but you aren’t actually testing the warning itself (say because it’s going
to be removed in the future, or because we’re matching a 3rd-party library’s behavior), then use pytest.mark.
filterwarnings to ignore the error.
@pytest.mark.filterwarnings("ignore:msg:category")
def test_thing(self):
...
If the test generates a warning of class category whose message starts with msg, the warning will be ignored and
the test will pass.
If you need finer-grained control, you can use Python’s usual warnings module to control whether a warning is ignored
/ raised at different places within a single test.
with warnings.catch_warnings():
warnings.simplefilter("ignore", FutureWarning)
# Or use warnings.filterwarnings(...)
The tests can then be run directly inside your Git clone (without having to install pandas) by typing:
pytest pandas
The tests suite is exhaustive and takes around 20 minutes to run. Often it is worth running only a subset of tests first
around your changes before running the entire suite.
The easiest way to do this is with:
pytest pandas/tests/[test-module].py
pytest pandas/tests/[test-module].py::[TestClass]
pytest pandas/tests/[test-module].py::[TestClass]::[test_method]
Using pytest-xdist, one can speed up local testing on multicore machines. To use this feature, you will need to install
pytest-xdist via:
Two scripts are provided to assist with this. These scripts distribute testing across 4 threads.
On Unix variants, one can type:
test_fast.sh
test_fast.bat
This can significantly reduce the time it takes to locally run tests before submitting a pull request.
For more, see the pytest documentation.
Furthermore one can run
pd.test()
Performance matters and it is worth considering whether your code has introduced performance regressions. pandas
is in the process of migrating to asv benchmarks to enable easy monitoring of the performance of critical pandas
operations. These benchmarks are all found in the pandas/asv_bench directory, and the test results can be found
here.
To use all features of asv, you will need either conda or virtualenv. For more details please check the asv
installation webpage.
To install asv:
If you need to run a benchmark, change your directory to asv_bench/ and run:
You can replace HEAD with the name of the branch you are working on, and report benchmarks that changed by
more than 10%. The command uses conda by default for creating the benchmark environments. If you want to use
virtualenv instead, write:
The -E virtualenv option should be added to all asv commands that run benchmarks. The default value is
defined in asv.conf.json.
Running the full benchmark suite can be an all-day process, depending on your hardware and its resource utiliza-
tion. However, usually it is sufficient to paste only a subset of the results into the pull request to show that the
committed changes do not cause unexpected performance regressions. You can run specific benchmarks using the -b
flag, which takes a regular expression. For example, this will only run benchmarks from a pandas/asv_bench/
benchmarks/groupby.py file:
If you want to only run a specific group of benchmarks from a file, you can do it using . as a separator. For example:
This will display stderr from the benchmarks, and use your local python that comes from your $PATH.
Information on how to write a benchmark and how to use asv can be found in the asv documentation.
Changes should be reflected in the release notes located in doc/source/whatsnew/vx.y.z.rst. This file
contains an ongoing change log for each release. Add an entry to this file to document your fix, enhancement or
(unavoidable) breaking change. Make sure to include the GitHub issue number when adding your entry (using :
issue:`1234` where 1234 is the issue/pull request number).
If your code is an enhancement, it is most likely necessary to add usage examples to the existing documentation. This
can be done following the section regarding documentation above. Further, to let users know when this feature was
added, the versionadded directive is used. The sphinx syntax for that is:
.. versionadded:: 1.1.0
This will put the text New in version 1.1.0 wherever you put the sphinx directive. This should also be put in the
docstring when adding a new function or method (example) or a new keyword argument (example).
Keep style fixes to a separate commit to make your pull request more readable.
Once you’ve made changes, you can see them by typing:
git status
If you have created a new file, it is not being tracked by git. Add it by typing:
# On branch shiny-new-feature
#
# modified: /relative/path/to/file-you-added.py
#
Finally, commit your changes to your local repository with an explanatory message. pandas uses a convention for
commit message prefixes and layout. Here are some common prefixes along with general guidelines for when to use
them:
• ENH: Enhancement, new functionality
• BUG: Bug fix
• DOC: Additions/updates to documentation
• TST: Additions/updates to tests
• BLD: Updates to the build process/scripts
• PERF: Performance improvement
• TYP: Type annotations
• CLN: Code cleanup
The following defines how a commit message should be structured. Please reference the relevant GitHub issues in
your commit message using GH1234 or #1234. Either style is fine, but the former is generally preferred:
• a subject line with < 80 chars.
• One blank line.
• Optionally, a commit message body.
Now you can commit your changes in your local repository:
git commit -m
When you want your changes to appear publicly on your GitHub page, push your forked feature branch’s commits:
Here origin is the default name given to your remote repository on GitHub. You can see the remote repositories:
git remote -v
If you added the upstream repository as described above you will see something like:
Now your code is on GitHub, but it is not yet a part of the pandas project. For that to happen, a pull request needs to
be submitted on GitHub.
When you’re ready to ask for a code review, file a pull request. Before you do, once again make sure that you have
followed all the guidelines outlined in this document regarding code style, tests, performance tests, and documentation.
You should also double check your branch changes against the branch it was based on:
1. Navigate to your repository on GitHub – https://github.com/your-user-name/pandas
2. Click on Branches
3. Click on the Compare button for your feature branch
4. Select the base and compare branches, if necessary. This will be master and shiny-new-feature,
respectively.
If everything looks good, you are ready to make a pull request. A pull request is how code from a local repository
becomes available to the GitHub community and can be looked at and eventually merged into the master version. This
pull request and its associated changes will eventually be committed to the master branch and available in the next
release. To submit a pull request:
1. Navigate to your repository on GitHub
2. Click on the Pull Request button
3. You can then click on Commits and Files Changed to make sure everything looks okay one last time
4. Write a description of your changes in the Preview Discussion tab
5. Click Send Pull Request.
This request then goes to the repository maintainers, and they will review the code.
Based on the review you get on your pull request, you will probably need to make some changes to the code. In that
case, you can make them in your branch, add a new commit to that branch, push it to GitHub, and the pull request will
be automatically updated. Pushing them to GitHub again is done by:
This will automatically update your pull request with the latest code and restart the Continuous Integration tests.
Another reason you might need to update your pull request is to solve conflicts with changes that have been merged
into the master branch since you opened your pull request.
To do this, you need to “merge upstream master” in your branch:
If there are no conflicts (or they could be fixed automatically), a file with a default commit message will open, and you
can simply save and quit this file.
If there are merge conflicts, you need to solve those conflicts. See for example at https://help.github.com/articles/
resolving-a-merge-conflict-using-the-command-line/ for an explanation on how to do this. Once the conflicts are
merged and the files where the conflicts were solved are added, you can run git commit to save those fixes.
If you have uncommitted changes at the moment you want to update the branch with master, you will need to stash
them prior to updating (see the stash docs). This will effectively store your changes and they can be reapplied after
updating.
After the feature branch has been update locally, you can now update your pull request by pushing to the branch on
GitHub:
Once your feature branch is accepted into upstream, you’ll probably want to get rid of the branch. First, merge
upstream master into your branch so git knows it is safe to delete your branch:
Make sure you use a lower-case -d, or else git won’t warn you if your feature branch has not actually been merged.
The branch will still exist on GitHub, so to delete it there do:
If you have made it to the Review your code phase, one of the core contributors may take a look. Please note however
that a handful of people are responsible for reviewing all of the contributions, which can often lead to bottlenecks.
To improve the chances of your pull request being reviewed, you should:
• Reference an open issue for non-trivial changes to clarify the PR’s purpose
• Ensure you have appropriate tests. These should be the first part of any PR
• Keep your pull requests as simple as possible. Larger PRs take longer to review
• Ensure that CI is in a green state. Reviewers may not even look otherwise
• Keep Updating your pull request, either by request or every few days
Table of contents:
• Patterns
– Using foo.__class__
• String formatting
– Concatenated strings
* Using f-strings
* White spaces
– Representation function (aka ‘repr()’)
• Imports (aim for absolute)
• Miscellaneous
– Reading from a url
pandas follows the PEP8 standard and uses Black and Flake8 to ensure a consistent code format throughout the project.
We encourage you to use pre-commit to automatically run black, flake8, isort, and related code checks when
you make a git commit.
4.2.1 Patterns
Using foo.__class__
pandas uses ‘type(foo)’ instead ‘foo.__class__’ as it is making the code more readable. For example:
Good:
foo = "bar"
type(foo)
Bad:
foo = "bar"
foo.__class__
Concatenated strings
Using f-strings
pandas uses f-strings formatting instead of ‘%’ and ‘.format()’ string formatters.
The convention of using f-strings on a string that is concatenated over several lines, is to prefix only the lines containing
values which need to be interpreted.
For example:
Good:
foo = "old_function"
bar = "new_function"
my_warning_message = (
f"Warning, {foo} is deprecated, "
"please use the new and way better "
f"{bar}"
)
Bad:
foo = "old_function"
bar = "new_function"
my_warning_message = (
f"Warning, {foo} is deprecated, "
f"please use the new and way better "
f"{bar}"
)
White spaces
Only put white space at the end of the previous line, so there is no whitespace at the beginning of the concatenated
string.
For example:
Good:
example_string = (
"Some long concatenated string, "
"with good placement of the "
"whitespaces"
)
Bad:
example_string = (
"Some long concatenated string,"
" with bad placement of the"
" whitespaces"
)
value = str
f"Unknown received value, got: {repr(value)}"
Good:
value = str
f"Unknown received type, got: '{type(value).__name__}'"
In Python 3, absolute imports are recommended. Using absolute imports, doing something like import string
will import the string module rather than string.py in the same directory. As much as possible, you should try to
write out absolute imports that show the whole import chain from top-level pandas.
Explicit relative imports are also supported in Python 3 but it is not recommended to use them. Implicit relative imports
should never be used and are removed in Python 3.
For example:
# preferred
import pandas.core.common as com
# not preferred
from .common import test_base
# wrong
from common import test_base
4.2.4 Miscellaneous
Good:
This guide is for pandas’ maintainers. It may also be interesting to contributors looking to understand the pandas
development process and what steps are necessary to become a maintainer.
The main contributing guide is available at Contributing to pandas.
4.3.1 Roles
pandas uses two levels of permissions: triage and core team members.
Triage members can label and close issues and pull requests.
Core team members can label and close issues and pull request, and can merge pull requests.
GitHub publishes the full list of permissions.
4.3.2 Tasks
pandas is largely a volunteer project, so these tasks shouldn’t be read as “expectations” of triage and maintainers.
Rather, they’re general descriptions of what it means to be a maintainer.
• Triage newly filed issues (see Issue triage)
• Review newly opened pull requests
• Respond to updates on existing issues and pull requests
• Drive discussion and decisions on stalled issues and pull requests
• Provide experience / wisdom on API design questions to ensure consistency and maintainability
• Project organization (run / attend developer meetings, represent pandas)
https://matthewrocklin.com/blog/2019/05/18/maintainer may be interesting background reading.
We’ll need a discussion from several pandas maintainers before deciding whether the proposal is in scope for
pandas.
6. Is this a usage question?
We prefer that usage questions are asked on StackOverflow with the pandas tag. https://stackoverflow.com/
questions/tagged/pandas
If it’s easy to answer, feel free to link to the relevant documentation section, let them know that in the future this
kind of question should be on StackOverflow, and close the issue.
7. What labels and milestones should I add?
Apply the relevant labels. This is a bit of an art, and comes with experience. Look at similar issues to get a feel
for how things are labeled.
If the issue is clearly defined and the fix seems relatively straightforward, label the issue as “Good first issue”.
Typically, new issues will be assigned the “Contributions welcome” milestone, unless it’s know that this issue
should be addressed in a specific release (say because it’s a large regression).
Be delicate here: many people interpret closing an issue as us saying that the conversation is over. It’s typically best
to give the reporter some time to respond or self-close their issue if it’s determined that the behavior is not a bug, or
the feature is out of scope. Sometimes reporters just go away though, and we’ll close the issue after the conversation
has died.
Anybody can review a pull request: regular contributors, triagers, or core-team members. But only core-team members
can merge pull requets when they’re ready.
Here are some things to check when reviewing a pull request.
• Tests should be in a sensible location: in the same file as closely related tests.
• New public APIs should be included somewhere in doc/source/reference/.
• New / changed API should use the versionadded or versionchanged directives in the docstring.
• User-facing changes should have a whatsnew in the appropriate file.
• Regression tests should reference the original GitHub issue number like # GH-1234.
• The pull request should be labeled and assigned the appropriate milestone (the next patch release for regression
fixes and small bug fixes, the next minor milestone otherwise)
• Changes should comply with our Version policy.
Every open issue in pandas has a cost. Open issues make finding duplicates harder, and can make it harder to know
what needs to be done in pandas. That said, closing issues isn’t a goal on its own. Our goal is to make pandas the best
it can be, and that’s best done by ensuring that the quality of our open issues is high.
Occasionally, bugs are fixed but the issue isn’t linked to in the Pull Request. In these cases, comment that “This has
been fixed, but could use a test.” and label the issue as “Good First Issue” and “Needs Test”.
If an older issue doesn’t follow our issue template, edit the original post to include a minimal example, the actual
output, and the expected output. Uniformity in issue reports is valuable.
If an older issue lacks a reproducible example, label it as “Needs Info” and ask them to provide one (or write one
yourself if possible). If one isn’t provide reasonably soon, close it according to the policies in Closing issues.
Occasionally, contributors are unable to finish off a pull request. If some time has passed (two weeks, say) since the
last review requesting changes, gently ask if they’re still interested in working on this. If another two weeks or so
passes with no response, thank them for their work and close the pull request. Comment on the original issue that
“There’s a stalled PR at #1234 that may be helpful.”, and perhaps label the issue as “Good first issue” if the PR was
relatively close to being accepted.
Additionally, core-team members can push to contributors branches. This can be helpful for pushing an important PR
across the line, or for fixing a small merge conflict.
The full process is outlined in our governance documents. In summary, we’re happy to give triage permissions to
anyone who shows interest by being helpful on the issue tracker.
The current list of core-team members is at https://github.com/pandas-dev/pandas-governance/blob/master/people.md
Only core team members can merge pull requests. We have a few guidelines.
1. You should typically not self-merge your own pull requests. Exceptions include things like small changes to fix
CI (e.g. pinning a package version).
2. You should not merge pull requests that have an active discussion, or pull requests that has any -1 votes from a
core maintainer. pandas operates by consensus.
3. For larger changes, it’s good to have a +1 from at least two core team members.
In addition to the items listed in Closing issues, you should verify that the pull request is assigned the correct milestone.
Pull requests merged with a patch-release milestone will typically be backported by our bot. Verify that the bot
noticed the merge (it will leave a comment within a minute typically). If a manual backport is needed please do that,
and remove the “Needs backport” label once you’ve done it manually. If you forget to assign a milestone before
tagging, you can request the bot to backport it with:
4.4 Internals
This section will provide a look into some of pandas internals. It’s primarily intended for developers of pandas itself.
4.4.1 Indexing
In pandas there are a few objects implemented which can serve as valid containers for the axis labels:
• Index: the generic “ordered set” object, an ndarray of object dtype assuming nothing about its contents. The
labels must be hashable (and likely immutable) and unique. Populates a dict of label to location in Cython to do
O(1) lookups.
• Int64Index: a version of Index highly optimized for 64-bit integer data, such as time stamps
• Float64Index: a version of Index highly optimized for 64-bit float data
• MultiIndex: the standard hierarchical index object
• DatetimeIndex: An Index object with Timestamp boxed elements (impl are the int64 values)
• TimedeltaIndex: An Index object with Timedelta boxed elements (impl are the in64 values)
• PeriodIndex: An Index object with Period elements
There are functions that make the creation of a regular index easy:
• date_range: fixed frequency date range generated from a time rule or DateOffset. An ndarray of Python
datetime objects
• period_range: fixed frequency date range generated from a time rule or DateOffset. An ndarray of Period
objects, representing timespans
The motivation for having an Index class in the first place was to enable different implementations of indexing.
This means that it’s possible for you, the user, to implement a custom Index subclass that may be better suited to a
particular application than the ones provided in pandas.
From an internal implementation point of view, the relevant methods that an Index must define are one or more of
the following (depending on how incompatible the new object internals are with the Index functions):
• get_loc: returns an “indexer” (an integer, or in some cases a slice object) for a label
• slice_locs: returns the “range” to slice between two labels
• get_indexer: Computes the indexing vector for reindexing / data alignment purposes. See the source /
docstrings for more on this
• get_indexer_non_unique: Computes the indexing vector for reindexing / data alignment purposes when
the index is non-unique. See the source / docstrings for more on this
• reindex: Does any pre-conversion of the input index then calls get_indexer
• union, intersection: computes the union or intersection of two Index objects
• insert: Inserts a new label into an Index, yielding a new object
• delete: Delete a label, yielding a new object
• drop: Deletes a set of labels
• take: Analogous to ndarray.take
MultiIndex
Internally, the MultiIndex consists of a few things: the levels, the integer codes (until version 0.24 named labels),
and the level names:
In [2]: index
Out[2]:
MultiIndex([(0, 'one'),
(0, 'two'),
(1, 'one'),
(1, 'two'),
(2, 'one'),
(2, 'two')],
names=['first', 'second'])
In [3]: index.levels
Out[3]: FrozenList([[0, 1, 2], ['one', 'two']])
In [4]: index.codes
Out[4]: FrozenList([[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])
In [5]: index.names
Out[5]: FrozenList(['first', 'second'])
You can probably guess that the codes determine which unique element is identified with that location at each layer
of the index. It’s important to note that sortedness is determined solely from the integer codes and does not check
(or care) whether the levels themselves are sorted. Fortunately, the constructors from_tuples and from_arrays
ensure that this is true, but if you compute the levels and codes yourself, please be careful.
Values
pandas extends NumPy’s type system with custom types, like Categorical or datetimes with a timezone, so we
have multiple notions of “values”. For 1-D containers (Index classes and Series) we have the following conven-
tion:
• cls._values refers is the “best possible” array. This could be an ndarray or ExtensionArray.
So, for example, Series[category]._values is a Categorical.
Ideally, there should be one, and only one, obvious place for a test to reside. Until we reach that ideal, these are some
rules of thumb for where a test should be located.
1. Does your test depend only on code in pd._libs.tslibs? This test likely belongs in one of:
• tests.tslibs
Note: No file in tests.tslibs should import from any pandas modules outside of pd._libs.
tslibs
• tests.scalar
• tests.tseries.offsets
2. Does your test depend only on code in pd._libs? This test likely belongs in one of:
• tests.libs
• tests.groupby.test_libgroupby
3. Is your test for an arithmetic or comparison method? This test likely belongs in one of:
• tests.arithmetic
Note: These are intended for tests that can be shared to test the behavior of
DataFrame/Series/Index/ExtensionArray using the box_with_array fixture.
• tests.frame.test_arithmetic
• tests.series.test_arithmetic
4. Is your test for a reduction method (min, max, sum, prod, . . . )? This test likely belongs in one of:
• tests.reductions
Note: These are intended for tests that can be shared to test the behavior of
DataFrame/Series/Index/ExtensionArray.
• tests.frame.test_reductions
• tests.series.test_reductions
• tests.test_nanops
5. Is your test for an indexing method? This is the most difficult case for deciding where a test belongs, because
there are many of these tests, and many of them test more than one method (e.g. both Series.__getitem__
and Series.loc.__getitem__)
A) Is the test specifically testing an Index method (e.g. Index.get_loc, Index.get_indexer)? This
test likely belongs in one of:
• tests.indexes.test_indexing
• tests.indexes.fooindex.test_indexing
Within that files there should be a method-specific test class e.g. TestGetLoc.
In most cases, neither Series nor DataFrame objects should be needed in these tests.
B) Is the test for a Series or DataFrame indexing method other than __getitem__ or __setitem__, e.g.
xs, where, take, mask, lookup, or insert? This test likely belongs in one of:
• tests.frame.indexing.test_methodname
• tests.series.indexing.test_methodname
C) Is the test for any of loc, iloc, at, or iat? This test likely belongs in one of:
• tests.indexing.test_loc
• tests.indexing.test_iloc
• tests.indexing.test_at
• tests.indexing.test_iat
Within the appropriate file, test classes correspond to either types of indexers (e.g.
TestLocBooleanMask) or major use cases (e.g. TestLocSetitemWithExpansion).
See the note in section D) about tests that test multiple