Data Science Cheat Sheet: KEY Imports

Download as pdf or txt
Download as pdf or txt
You are on page 1of 1

LEARN DATA SCIENCE ONLINE

Start Learning For Free - www.dataquest.io

Data Science Cheat Sheet


Pandas

KEY IMPORTS
Well use shorthand in this cheat sheet Import these to start
df - A pandas DataFrame object import pandas as pd
s - A pandas Series object import numpy as np

I M P O RT I N G DATA SELECTION ascending=[True,False]) - Sort values by col1 in


pd.read_csv(filename) - From a CSV file df[col] - Return column with label col as Series ascending order then col2 in descending order
pd.read_table(filename) - From a delimited text df[[col1, col2]] - Return Columns as a new df.groupby(col) - Return a groupby object for
file (like TSV) DataFrame values from one column
pd.read_excel(filename) - From an Excel file s.iloc[0] - selection by position df.groupby([col1,col2]) - Return a groupby
pd.read_sql(query, connection_object) - s.loc[0] - selection by index object values from multiple columns
Read from a SQL table/database df.iloc[0,:] - first row df.groupby(col1)[col2].mean() - Return the
pd.read_json(json_string) - Read from a JSON df.iloc[0,0] - first element of first column mean of the values in col2, grouped by the values
formatted string, URL or file. in col1 (mean can be replaced with almost any
pd.read_html(url) - Parses an html URL, string or DATA C L E A N I N G function from the statistics section)
file and extracts tables to a list of dataframes df.columns = ['a','b','c'] - Rename columns df.pivot_table(index=col1,values=
pd.read_clipboard() - Takes the contents of your pd.isnull() - Checks for null Values, Returns [col2,col3],aggfunc=max) - Create a pivot table
clipboard and passes it to read_table() Boolean Arrray that groups by col1 and calculates the mean of
pd.DataFrame(dict) - From a dict, keys for col- pd.notnull() - Opposite of s.isnull() col2 and col3
umns names, values for data as lists df.dropna() - Drop all rows that contain null df.groupby(col1).agg(np.mean) - find the
values average across all columns for every unique column
E X P O RT I N G DATA df.dropna(axis=1) - Drop all columns that con- 1 group
df.to_csv(filename) - Write to a CSV file tain null values data.apply(np.mean) - apply a function across
df.to_excel(filename) - Write to an Excel file df.dropna(axis=1,thresh=n) - Drop all rows each column
df.to_sql(table_name, connection_object) - have have less than n non null values data.apply(np.max, axis=1) - apply a function
Write to a SQL table df.fillna(x) - Replace all null values with x across each row
df.to_json(filename) - Write to a file in JSON s.fillna(s.mean()) - Replace all null values with
format the mean (mean can be replaced with almost any J O I N /C O M B I N E
df.to_html(filename) - Save as an HTML table function from the statistics section) df1.append(df2) - Add the rows in df1 to the end
df.to_clipboard() - Write to the clipboard s.astype(float) - Convert the datatype of the of df2 (columns should be identical)
series to float df.concat([df1, df2],axis=1) - Add the
C R E AT E T E ST O B J E C TS s.replace(1,'one') - Replace all values equal to columns in df1 to the end of df2 (rows should be
Useful for testing 1 with 'one' identical)
pd.DataFrame(np.random.rand(20,5)) - 5 col- s.replace([1,3],['one','three']) - Replace all df1.join(df2,on=col1,how='inner') - SQL-style
umns and 20 rows of random floats 1 with 'one' and 3 with 'three' join the columns in df1 with the columns on df2
pd.Series(my_list) - Create a series from an df.rename(columns=lambda x: x + 1) - mass where the rows for col have identical values. how
iterable my_list renaming of columns can be one of 'left', 'right', 'outer', 'inner'
df.index = pd.date_range('1900/1/30', df.rename(columns={'old_name': 'new_
periods=df.shape[0]) - Add a date index name'}) - selective renaming STAT I ST I C S
df.set_index('column_one') - change the index These can all be applied to a series as well.
V I E W I N G/ I N S P E C T I N G DATA df.rename(index=lambda x: x + 1) - mass df.describe() - Summary statistics for numerical
df.head(n) - First n rows of the DataFrame renaming of index columns
df.tail(n) - Last n rows of the DataFrame df.mean() - Return the mean of all columns
df.shape() - Number of rows and columns F I LT E R, S O R T, & G R O U P BY df.corr() - finds the correlation between columns
df.info() - Index, Datatype and Memory informa- df[df[col] > 0.5] - Rows where the col column in a DataFrame.
tion is greater than 0.5 df.count() - counts the number of non-null values
df.describe() - Summary statistics for numerical df[(df[col] > 0.5) & (df[col] < 0.7)] - in each DataFrame column.
columns Rows where 0.7 > col > 0.5 df.max() - finds the highest value in each column.
s.value_counts(dropna=False) - View unique df.sort_values(col1) - Sort values by col1 in df.min() - finds the lowest value in each column.
values and counts ascending order df.median() - finds the median of each column.
df.apply(pd.Series.value_counts) - Unique df.sort_values(col2,ascending=False) - Sort df.std() - finds the standard deviation of each
values and counts for all columns values by col2 in descending order column.
df.sort_values([col1,col2],

LEARN DATA SCIENCE ONLINE


Start Learning For Free - www.dataquest.io

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy