Learning pandas - Second Edition
4/5
()
About this ebook
- Get comfortable using pandas and Python as an effective data exploration and analysis tool
- Explore pandas through a framework of data analysis, with an explanation of how pandas is well suited for the various stages in a data analysis process
- A comprehensive guide to pandas with many of clear and practical examples to help you get up and using pandas
This book is ideal for data scientists, data analysts, Python programmers who want to plunge into data analysis using pandas, and anyone with a curiosity about analyzing data. Some knowledge of statistics and programming will be helpful to get the most out of this book but not strictly required. Prior exposure to pandas is also not required.
Read more from Michael Heydt
Learning pandas Rating: 4 out of 5 stars4/5Python Web Scraping Cookbook: Over 90 proven recipes to get you scraping with Python, microservices, Docker, and AWS Rating: 0 out of 5 stars0 ratingsLearning pandas - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Pandas for Finance: Master pandas, an open source Python Data Analysis Library, for financial data analysis Rating: 0 out of 5 stars0 ratings
Related to Learning pandas - Second Edition
Related ebooks
Python Data Analysis Rating: 4 out of 5 stars4/5Python Data Structures and Algorithms Rating: 5 out of 5 stars5/5Mastering Python for Data Science Rating: 3 out of 5 stars3/5Python Data Science Essentials Rating: 0 out of 5 stars0 ratingsR for Data Science Rating: 5 out of 5 stars5/5Getting Started with Beautiful Soup Rating: 3 out of 5 stars3/5Data Science with Jupyter: Master Data Science skills with easy-to-follow Python examples Rating: 0 out of 5 stars0 ratingsGetting Started with Python Data Analysis Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python Rating: 0 out of 5 stars0 ratingsData Analysis with Python: Introducing NumPy, Pandas, Matplotlib, and Essential Elements of Python Programming (English Edition) Rating: 0 out of 5 stars0 ratingsLearning Data Mining with Python - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Predictive Analytics with R Rating: 4 out of 5 stars4/5R High Performance Programming Rating: 4 out of 5 stars4/5R Machine Learning By Example Rating: 0 out of 5 stars0 ratingsHands-On Time Series Analysis with R: Perform time series analysis and forecasting using R Rating: 0 out of 5 stars0 ratingsPractical Data Science Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsMastering Machine Learning on AWS: Advanced machine learning in Python using SageMaker, Apache Spark, and TensorFlow Rating: 0 out of 5 stars0 ratingsHands-On Data Analysis with Pandas: Efficiently perform data collection, wrangling, analysis, and visualization using Python Rating: 0 out of 5 stars0 ratingsPython Data Science Essentials - Second Edition Rating: 4 out of 5 stars4/5Python Data Analysis Cookbook Rating: 5 out of 5 stars5/5Python: Real-World Data Science Rating: 0 out of 5 stars0 ratingsWeb Scraping with Python Rating: 4 out of 5 stars4/5Python Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsmatplotlib Plotting Cookbook Rating: 5 out of 5 stars5/5Python Machine Learning By Example Rating: 4 out of 5 stars4/5Python: Real World Machine Learning Rating: 0 out of 5 stars0 ratingsPython Data Visualization Cookbook Rating: 4 out of 5 stars4/5
Programming For You
Python Programming : How to Code Python Fast In Just 24 Hours With 7 Simple Steps Rating: 4 out of 5 stars4/5Coding All-in-One For Dummies Rating: 4 out of 5 stars4/5C Programming For Beginners: The Simple Guide to Learning C Programming Language Fast! Rating: 5 out of 5 stars5/5Learn to Code. Get a Job. The Ultimate Guide to Learning and Getting Hired as a Developer. Rating: 5 out of 5 stars5/5Learn PowerShell in a Month of Lunches, Fourth Edition: Covers Windows, Linux, and macOS Rating: 5 out of 5 stars5/5SQL QuickStart Guide: The Simplified Beginner's Guide to Managing, Analyzing, and Manipulating Data With SQL Rating: 4 out of 5 stars4/5Excel 101: A Beginner's & Intermediate's Guide for Mastering the Quintessence of Microsoft Excel (2010-2019 & 365) in no time! Rating: 0 out of 5 stars0 ratingsBeginning Programming with C++ For Dummies Rating: 4 out of 5 stars4/5Coding with JavaScript For Dummies Rating: 0 out of 5 stars0 ratingsNarrative Design for Indies: Getting Started Rating: 4 out of 5 stars4/5Grokking Algorithms: An illustrated guide for programmers and other curious people Rating: 4 out of 5 stars4/5SQL All-in-One For Dummies Rating: 3 out of 5 stars3/5C# 7.0 All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsExcel : The Ultimate Comprehensive Step-By-Step Guide to the Basics of Excel Programming: 1 Rating: 5 out of 5 stars5/5JavaScript All-in-One For Dummies Rating: 5 out of 5 stars5/5Linux: Learn in 24 Hours Rating: 5 out of 5 stars5/5Python: Learn Python in 24 Hours Rating: 4 out of 5 stars4/5HTML & CSS: Learn the Fundaments in 7 Days Rating: 4 out of 5 stars4/5C All-in-One Desk Reference For Dummies Rating: 5 out of 5 stars5/5Python: For Beginners A Crash Course Guide To Learn Python in 1 Week Rating: 4 out of 5 stars4/5C Programming for Beginners: Your Guide to Easily Learn C Programming In 7 Days Rating: 4 out of 5 stars4/5
Reviews for Learning pandas - Second Edition
2 ratings0 reviews
Book preview
Learning pandas - Second Edition - Michael Heydt
Learning pandas
Second Edition
High-performance data manipulation and analysis in Python
Michael Heydt
BIRMINGHAM - MUMBAI
Learning pandas
Second Edition
Copyright © 2017 Packt Publishing
All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, without the prior written permission of the publisher, except in the case of brief quotations embedded in critical articles or reviews.
Every effort has been made in the preparation of this book to ensure the accuracy of the information presented. However, the information contained in this book is sold without warranty, either express or implied. Neither the author, nor Packt Publishing, and its dealers and distributors will be held liable for any damages caused or alleged to be caused directly or indirectly by this book.
Packt Publishing has endeavored to provide trademark information about all of the companies and products mentioned in this book by the appropriate use of capitals. However, Packt Publishing cannot guarantee the accuracy of this information.
First published: April 2015
Second edition: June 2017
Production reference: 1300617
Published by Packt Publishing Ltd.
Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.
ISBN 978-1-78712-313-7
www.packtpub.com
Credits
About the Author
Michael Heydt is a technologist, entrepreneur, and educator with decades of professional software development and financial and commodities trading experience. He has worked extensively on Wall Street specializing in the development of distributed, actor-based, high-performance, and high-availability trading systems. He is currently founder of Micro Trading Services, a company that focuses on creating cloud and micro service-based software solutions for finance and commodities trading. He holds a master's in science in mathematics and computer science from Drexel University, and an executive master's of technology management from the University of Pennsylvania School of Applied Science and the Wharton School of Business.
I would really like to thank the team at Packt for continuously pushing me to create and revise this and my other books. I would also like to greatly thank my family for putting up with me disappearing for months on end during my sparse free time to indulge in creating this content. They are my true inspiration.
About the Reviewers
Sonali Dayal is a freelance data scientist in the San Francisco Bay Area. Her work on building analytical models and data pipelines influences major product and financial decisions for clients. Previously, she has worked as a freelance software and data science engineer for early stage startups, where she built supervised and unsupervised machine learning models, as well as interactive data analytics dashboards. She received her BS in biochemistry from Virginia Tech in 2011.
I'd like to thank the team at Packt for the opportunity to review this book and their support throughout the process.
Nicola Rainiero is a civil geotechnical engineer with a background in the construction industry as a self-employed designer engineer. He is also specialized in renewable energy and has collaborated with the Sant Anna University of Pisa for two European projects, REGEOCITIES and PRISCA, using qualitative and quantitative data analysis techniques.
He has the ambition to simplifying his work with open software, using and developing new ones. Sometimes obtaining good results, other less good.
A special thanks to Packt Publishing for this opportunity to participate in the review of this book. I thank my family, especially my parents, for their physical and moral support.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com. Did you know that Packt offers eBook versions of every book published, with PDF and ePub files available? You can upgrade to the eBook version at www.PacktPub.comand as a print book customer, you are entitled to a discount on the eBook copy. Get in touch with us at service@packtpub.com for more details. At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a range of free newsletters and receive exclusive discounts and offers on Packt books and eBooks.
https://www.packtpub.com/mapt
Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt books and video courses, as well as industry-leading tools to help you plan your personal development and advance your career.
Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial process. To help us improve, please leave us an honest review on this book's Amazon page at https://www.amazon.com/dp/1787123138.
If you'd like to join our team of regular reviewers, you can e-mail us at customerreviews@packtpub.com. We award our regular reviewers with free eBooks and videos in exchange for their valuable feedback. Help us be relentless in improving our products!
Table of Contents
Preface
What this book covers
What you need for this book
Who this book is for
Conventions
Reader feedback
Customer support
Downloading the example code
Errata
Piracy
Questions
pandas and Data Analysis
Introducing pandas
Data manipulation, analysis, science, and pandas
Data manipulation
Data analysis
Data science
Where does pandas fit?
The process of data analysis
The process
Ideation
Retrieval
Preparation
Exploration
Modeling
Presentation
Reproduction
A note on being iterative and agile
Relating the book to the process
Concepts of data and analysis in our tour of pandas
Types of data
Structured
Unstructured
Semi-structured
Variables
Categorical
Continuous
Discrete
Time series data
General concepts of analysis and statistics
Quantitative versus qualitative data/analysis
Single and multivariate analysis
Descriptive statistics
Inferential statistics
Stochastic models
Probability and Bayesian statistics
Correlation
Regression
Other Python libraries of value with pandas
Numeric and scientific computing - NumPy and SciPy
Statistical analysis – StatsModels
Machine learning – scikit-learn
PyMC - stochastic Bayesian modeling
Data visualization - matplotlib and seaborn
Matplotlib
Seaborn
Summary
Up and Running with pandas
Installation of Anaconda
IPython and Jupyter Notebook
IPython
Jupyter Notebook
Introducing the pandas Series and DataFrame
Importing pandas
The pandas Series
The pandas DataFrame
Loading data from files into a DataFrame
Visualization
Summary
Representing Univariate Data with the Series
Configuring pandas
Creating a Series
Creating a Series using Python lists and dictionaries
Creation using NumPy functions
Creation using a scalar value
The .index and .values properties
The size and shape of a Series
Specifying an index at creation
Heads, tails, and takes
Retrieving values in a Series by label or position
Lookup by label using the [] operator and the .ix[] property
Explicit lookup by position with .iloc[]
Explicit lookup by labels with .loc[]
Slicing a Series into subsets
Alignment via index labels
Performing Boolean selection
Re-indexing a Series
Modifying a Series in-place
Summary
Representing Tabular and Multivariate Data with the DataFrame
Configuring pandas
Creating DataFrame objects
Creating a DataFrame using NumPy function results
Creating a DataFrame using a Python dictionary and pandas Series objects
Creating a DataFrame from a CSV file
Accessing data within a DataFrame
Selecting the columns of a DataFrame
Selecting rows of a DataFrame
Scalar lookup by label or location using .at[] and .iat[]
Slicing using the [ ] operator
Selecting rows using Boolean selection
Selecting across both rows and columns
Summary
Manipulating DataFrame Structure
Configuring pandas
Renaming columns
Adding new columns with [] and .insert()
Adding columns through enlargement
Adding columns using concatenation
Reordering columns
Replacing the contents of a column
Deleting columns
Appending new rows
Concatenating rows
Adding and replacing rows via enlargement
Removing rows using .drop()
Removing rows using Boolean selection
Removing rows using a slice
Summary
Indexing Data
Configuring pandas
The importance of indexes
The pandas index types
The fundamental type - Index
Integer index labels using Int64Index and RangeIndex
Floating-point labels using Float64Index
Representing discrete intervals using IntervalIndex
Categorical values as an index - CategoricalIndex
Indexing by date and time using DatetimeIndex
Indexing periods of time using PeriodIndex
Working with Indexes
Creating and using an index with a Series or DataFrame
Selecting values using an index
Moving data to and from the index
Reindexing a pandas object
Hierarchical indexing
Summary
Categorical Data
Configuring pandas
Creating Categoricals
Renaming categories
Appending new categories
Removing categories
Removing unused categories
Setting categories
Descriptive information of a Categorical
Munging school grades
Summary
Numerical and Statistical Methods
Configuring pandas
Performing numerical methods on pandas objects
Performing arithmetic on a DataFrame or Series
Getting the counts of values
Determining unique values (and their counts)
Finding minimum and maximum values
Locating the n-smallest and n-largest values
Calculating accumulated values
Performing statistical processes on pandas objects
Retrieving summary descriptive statistics
Measuring central tendency: mean, median, and mode
Calculating the mean
Finding the median
Determining the mode
Calculating variance and standard deviation
Measuring variance
Finding the standard deviation
Determining covariance and correlation
Calculating covariance
Determining correlation
Performing discretization and quantiling of data
Calculating the rank of values
Calculating the percent change at each sample of a series
Performing moving-window operations
Executing random sampling of data
Summary
Accessing Data
Configuring pandas
Working with CSV and text/tabular format data
Examining the sample CSV data set
Reading a CSV file into a DataFrame
Specifying the index column when reading a CSV file
Data type inference and specification
Specifying column names
Specifying specific columns to load
Saving DataFrame to a CSV file
Working with general field-delimited data
Handling variants of formats in field-delimited data
Reading and writing data in Excel format
Reading and writing JSON files
Reading HTML data from the web
Reading and writing HDF5 format files
Accessing CSV data on the web
Reading and writing from/to SQL databases
Reading data from remote data services
Reading stock data from Yahoo! and Google Finance
Retrieving options data from Google Finance
Reading economic data from the Federal Reserve Bank of St. Louis
Accessing Kenneth French's data
Reading from the World Bank
Summary
Tidying Up Your Data
Configuring pandas
What is tidying your data?
How to work with missing data
Determining NaN values in pandas objects
Selecting out or dropping missing data
Handling of NaN values in mathematical operations
Filling in missing data
Forward and backward filling of missing values
Filling using index labels
Performing interpolation of missing values
Handling duplicate data
Transforming data
Mapping data into different values
Replacing values
Applying functions to transform data
Summary
Combining, Relating, and Reshaping Data
Configuring pandas
Concatenating data in multiple objects
Understanding the default semantics of concatenation
Switching axes of alignment
Specifying join type
Appending versus concatenation
Ignoring the index labels
Merging and joining data
Merging data from multiple pandas objects
Specifying the join semantics of a merge operation
Pivoting data to and from value and indexes
Stacking and unstacking
Stacking using non-hierarchical indexes
Unstacking using hierarchical indexes
Melting data to and from long and wide format
Performance benefits of stacked data
Summary
Data Aggregation
Configuring pandas
The split, apply, and combine (SAC) pattern
Data for the examples
Splitting data
Grouping by a single column's values
Accessing the results of a grouping
Grouping using multiple columns
Grouping using index levels
Applying aggregate functions, transforms, and filters
Applying aggregation functions to groups
Transforming groups of data
The general process of transformation
Filling missing values with the mean of the group
Calculating normalized z-scores with a transformation
Filtering groups from aggregation
Summary
Time-Series Modelling
Setting up the IPython notebook
Representation of dates, time, and intervals
The datetime, day, and time objects
Representing a point in time with a Timestamp
Using a Timedelta to represent a time interval
Introducing time-series data
Indexing using DatetimeIndex
Creating time-series with specific frequencies
Calculating new dates using offsets
Representing data intervals with date offsets
Anchored offsets
Representing durations of time using Period
Modelling an interval of time with a Period
Indexing using the PeriodIndex
Handling holidays using calendars
Normalizing timestamps using time zones
Manipulating time-series data
Shifting and lagging
Performing frequency conversion on a time-series
Up and down resampling of a time-series
Time-series moving-window operations
Summary
Visualization
Configuring pandas
Plotting basics with pandas
Creating time-series charts
Adorning and styling your time-series plot
Adding a title and changing axes labels
Specifying the legend content and position
Specifying line colors, styles, thickness, and markers
Specifying tick mark locations and tick labels
Formatting axes' tick date labels using formatters
Common plots used in statistical analyses
Showing relative differences with bar plots
Picturing distributions of data with histograms
Depicting distributions of categorical data with box and whisker charts
Demonstrating cumulative totals with area plots
Relationships between two variables with scatter plots
Estimates of distribution with the kernel density plot
Correlations between multiple variables with the scatter plot matrix
Strengths of relationships in multiple variables with heatmaps
Manually rendering multiple plots in a single chart
Summary
Historical Stock Price Analysis
Setting up the IPython notebook
Obtaining and organizing stock data from Google
Plotting time-series prices
Plotting volume-series data
Calculating the simple daily percentage change in closing price
Calculating simple daily cumulative returns of a stock
Resampling data from daily to monthly returns
Analyzing distribution of returns
Performing a moving-average calculation
Comparison of average daily returns across stocks
Correlation of stocks based on the daily percentage change of the closing price
Calculating the volatility of stocks
Determining risk relative to expected returns
Summary
Preface
Pandas is a popular Python package used for practical, real-world data analysis. It provides efficient, fast, and high-performance data structures that make data exploration and analysis very easy. This learner's guide will help you through a comprehensive set of features provided by the pandas library to perform efficient data manipulation and analysis.
What this book covers
Chapter 1 , pandas and Data Analysis, is a hands-on introduction to the key features of pandas. The idea of this chapter is to provide some context for using pandas in the context of statistics and data science. The chapter will get into several concepts in data science and show how they are supported by pandas. This will set a context for each of the subsequent chapters, mentioning each chapter relates to both data science and data science processes.
Chapter 2, Up and Running with pandas, instructs the reader on obtain and install pandas, and to get introduce a few of the basic concepts in pandas. We will also look at how the examples are presented using iPython and Juypter notebook.
Chapter 3, Representing Univariate Data with the Series, walks the reader through the use of the pandas Series, which provides 1-dimensional, indexed data representations. The reader will learn about how to create Series objects and how to manipulate data held within. They will also learn about indexes and alignment of data, and about how the Series can be used to slice data.
Chapter 4, Representing Tabular and Multivariate Data with the DataFrame, walks the reader through the basic use of the pandas DataFrame, which provides and indexes multivariate data representations. This chapter will instruct the reader to be able to create DataFrame objects using various sets of static data, and how to perform selection of specific columns and rows within. Complex queries, manipulation, and indexing will be now handled in the following chapter.
Chapter 5, Manipulation and Indexing of DataFrame objects, expands on the previous chapter and instructs you on how to perform more complex manipulations of a DataFrame. We start by learning how to add, remove, and delete columns and rows; modify data within a DataFrame (or created a modified copy); perform calculations on data within; create hierarchical indexes; and also calculate common statistical results upon DataFrame contents.
Chapter 6, Indexing Data, shows how data can be loaded and saved from external sources into both Series and DataFrame objects. The chapter also covers data access from multiple sources such as files, http servers, database systems, and web services. Also covered is the processing of data in CSV, HTML, and JSON formats.
Chapter 7, Categorical Data, instructs the reader on how to use the various tools provided by pandas for managing dirty and missing data.
Chapter 8, Numerical and Statistical Methods, covers various techniques for combining, splitting, joining, and merging of data located in multiple pandas objects, and then demonstrates on how to reshape data using concepts such as pivots, stacking, and melting.
Chapter 9, Accessing Data, talks about grouping and performing aggregate data analysis. In pandas, this is often referred to as the split-apply-combine pattern. The reader will learn about using this pattern to group data in various different configurations and also apply aggregate functions to calculate results upon each group of data.
Chapter 10, Tidying Up Your Data, explains how to organize data in a tidy form, that is usable for data analysis.
Chapter 11, Combining, Relating and Reshaping Data, tells the readers how they can take data in multiple pandas objects and combine them, through concepts such as joins, merges and concatenation.
Chapter 12, Data Aggregation, dives into the integration of pandas with matplotlib to visualize pandas data. The chapter will demonstrate how to present many common statistical and financial data visualizations including bar charts, histograms, scatter plots, area plots, density plots, and heat maps.
Chapter 13, Time-Series Modeling, covers representing time series data in pandas. This chapter will cover the extensive capabilities provided by pandas for facilitating analysis of time series data.
Chapter 14, Visualization, teaches you how to create data visualizations based upon data stored in pandas data structures. We start with the basics learning, how to create a simple chart from data and control several of the attributes of the chart (such as legends, labels, and colors). We examine the creation of several common types of plot used to represent different types of data that are use those plot types to convey meaning in the underlying data. We also learn how to integrate pandas with D3.js so that we can create rich web-based visualizations.
Chapter 15, Historical Stock Price Analysis, shows you how to apply pandas to basic financial problems. It will focus on data obtained from Yahoo! Finance, and will demonstrate a number of financial concepts in financial data such as calculating returns, moving averages, volatility, and several other concepts. The student will also learns how to apply data visualization to these financial concepts.
What you need for this book
This book assumes some familiarity with programming concepts, but those without programming experience, or specifically Python programming experience, will be comfortable with the examples as they focus on pandas constructs more than Python or programming. The examples are based on Anaconda Python 2.7 and pandas 0.15.1. If you do not have either installed, guidance will be given in Chapter 2, Up and Running with pandas, regarding installing pandas on installing both on Windows, OSX, and Ubuntu systems. For those not interested in installing any software, instruction is also given on using the Warkari.io online Python data analysis service.
Who this book is for
This book is ideal for data scientists, data analysts, and Python programmers who want to plunge into data analysis using pandas, and anyone curious about analyzing data. Some knowledge of statistics and programming will help you