0% found this document useful (0 votes)

34 views1 page

Customer Data Outliers Pyspark

The document discusses outlier detection using PySpark. It shows how to load a dataset from Google Drive, define functions to identify outliers, and apply the outlier detection to identify outliers in the dataset. Key steps include converting columns to numeric types, calculating quartiles, and flagging outliers outside the inter-quartile range.

Uploaded by

lilo cireneu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views1 page

Customer Data Outliers Pyspark

Uploaded by

lilo cireneu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 1

Outlier Detection using PYSPARK

Installing the PYSPARK module

In [ ]:
!pip install pyspark

Requirement already satisfied: pyspark in /usr/local/lib/python3.7/dist-packages (3.2.0)

Requirement already satisfied: py4j==0.10.9.2 in /usr/local/lib/python3.7/dist-packages (from pyspark) (0.10.9.2)

Mounting the google drive datasets

In [ ]:
from google.colab import drive
drive.mount('/content/datasets')

Drive already mounted at /content/datasets; to attempt to forcibly remount, call drive.mount("/content/datasets", force_
remount=True).

In [ ]:
# This code creates a filterable and interactive data table like excel for pandas dataframes
from google.colab import data_table
data_table.enable_dataframe_formatter()

Importing the necessary libraries

In [ ]:
import seaborn as sns
import matplotlib.pyplot as plt
from pyspark.sql import SparkSession

Creating a Spark Session

In [ ]:
spark = SparkSession.builder.appName('OUTLIER REMOVAL').getOrCreate()
spark

Out[ ]: SparkSession - in-memory

SparkContext

Spark UI

Version v3.2.0
Master local[*]
AppName ucla_project

Reading the dataset from Google Drive using spark session

In [ ]:
df = spark.read.csv('/content/datasets/MyDrive/Datasets/UCLA_Module2_FinalProject/customer_data.csv',header = True)
df.show()

+-------+------+-----+-----+-------+------+----------------+------------+
|Channel|Region|Fresh| Milk|Grocery|Frozen|Detergents_Paper|Delicatessen|
+-------+------+-----+-----+-------+------+----------------+------------+
| 2| 3|12669| 9656| 7561| 214| 2674| 1338|
| 2| 3| 7057| 9810| 9568| 1762| 3293| 1776|
| 2| 3| 6353| 8808| 7684| 2405| 3516| 7844|
| 1| 3|13265| 1196| 4221| 6404| 507| 1788|
| 2| 3|22615| 5410| 7198| 3915| 1777| 5185|
| 2| 3| 9413| 8259| 5126| 666| 1795| 1451|
| 2| 3|12126| 3199| 6975| 480| 3140| 545|
| 2| 3| 7579| 4956| 9426| 1669| 3321| 2566|
| 1| 3| 5963| 3648| 6192| 425| 1716| 750|
| 2| 3| 6006|11093| 18881| 1159| 7425| 2098|
| 2| 3| 3366| 5403| 12974| 4400| 5977| 1744|
| 2| 3|13146| 1124| 4523| 1420| 549| 497|
| 2| 3|31714|12319| 11757| 287| 3881| 2931|
| 2| 3|21217| 6208| 14982| 3095| 6707| 602|
| 2| 3|24653| 9465| 12091| 294| 5058| 2168|
| 1| 3|10253| 1114| 3821| 397| 964| 412|
| 2| 3| 1020| 8816| 12121| 134| 4508| 1080|
| 1| 3| 5876| 6157| 2933| 839| 370| 4478|
| 2| 3|18601| 6327| 10099| 2205| 2767| 3181|
| 1| 3| 7780| 2495| 9464| 669| 2518| 501|
+-------+------+-----+-----+-------+------+----------------+------------+
only showing top 20 rows

Counting the number of records

In [ ]:
df.count()

Out[ ]: 440

Checking the number of columns

In [ ]:
len(df.columns)

Out[ ]: 8

Printing the schema, to check the datatypes

1. All the features are of string data type

2. The columns Channel & Region are categorical features
3. Other columns are to be converted into numerical data types

In [ ]:
df.printSchema()

Segregating the desired columns to convert the data type ino IntegerType
In [ ]:
numeric_cols = df.columns[2:]
numeric_cols

Out[ ]: ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']

In [ ]:
from pyspark.sql import functions as f
from pyspark.sql.types import IntegerType

In [ ]:
for column in numeric_cols:
df = df.withColumn(column,f.col(column).cast(IntegerType()))

Checking the schema after converting the numerical columns into IntegerType
In [ ]:
df.printSchema()

Creating a variable to store all the numerical columns into a list

In [ ]:
numeric_columns = [column[0] for column in df.dtypes if column[1]=='int']
numeric_columns

Out[ ]: ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']

Creating a customized function

In [ ]:
def find_outliers(df):

# Identifying the numerical columns in a spark dataframe

numeric_columns = [column[0] for column in df.dtypes if column[1]=='int']

# Using the `for` loop to create new columns by identifying the outliers for each feature
for column in numeric_cols:

less_Q1 = 'less_Q1_{}'.format(column)
more_Q3 = 'more_Q3_{}'.format(column)
Q1 = 'Q1_{}'.format(column)
Q3 = 'Q3_{}'.format(column)

# Q1 : First Quartile ., Q3 : Third Quartile

Q1 = df.approxQuantile(column,[0.25],relativeError=0)
Q3 = df.approxQuantile(column,[0.75],relativeError=0)

# IQR : Inter Quantile Range

# We need to define the index [0], as Q1 & Q3 are a set of lists., to perform a mathematical operation
# Q1 & Q3 are defined seperately so as to have a clear indication on First Quantile & 3rd Quantile
IQR = Q3[0] - Q1[0]

#selecting the data, with -1.5*IQR to + 1.5*IQR., where param = 1.5 default value
less_Q1 = Q1[0] - 1.5*IQR
more_Q3 = Q3[0] + 1.5*IQR

isOutlierCol = 'is_outlier_{}'.format(column)

df = df.withColumn(isOutlierCol,f.when((df[column] > more_Q3) | (df[column] < less_Q1), 1).otherwise(0))

# Selecting the specific columns which we have added above, to check if there are any outliers
selected_columns = [column for column in df.columns if column.startswith("is_outlier")]

# Adding all the outlier columns into a new colum "total_outliers", to see the total number of outliers
df = df.withColumn('total_outliers',sum(df[column] for column in selected_columns))

# Dropping the extra columns created above, just to create nice dataframe., without extra columns
df = df.drop(*[column for column in df.columns if column.startswith("is_outlier")])

return df

Using the customized Outlier function and applying it to a spark dataframe

In [ ]:
new_df = find_outliers(df)
new_df.show()

+-------+------+-----+-----+-------+------+----------------+------------+--------------+
|Channel|Region|Fresh| Milk|Grocery|Frozen|Detergents_Paper|Delicatessen|total_outliers|
+-------+------+-----+-----+-------+------+----------------+------------+--------------+
| 2| 3|12669| 9656| 7561| 214| 2674| 1338| 0|
| 2| 3| 7057| 9810| 9568| 1762| 3293| 1776| 0|
| 2| 3| 6353| 8808| 7684| 2405| 3516| 7844| 1|
| 1| 3|13265| 1196| 4221| 6404| 507| 1788| 0|
| 2| 3|22615| 5410| 7198| 3915| 1777| 5185| 1|
| 2| 3| 9413| 8259| 5126| 666| 1795| 1451| 0|
| 2| 3|12126| 3199| 6975| 480| 3140| 545| 0|
| 2| 3| 7579| 4956| 9426| 1669| 3321| 2566| 0|
| 1| 3| 5963| 3648| 6192| 425| 1716| 750| 0|
| 2| 3| 6006|11093| 18881| 1159| 7425| 2098| 0|
| 2| 3| 3366| 5403| 12974| 4400| 5977| 1744| 0|
| 2| 3|13146| 1124| 4523| 1420| 549| 497| 0|
| 2| 3|31714|12319| 11757| 287| 3881| 2931| 0|
| 2| 3|21217| 6208| 14982| 3095| 6707| 602| 0|
| 2| 3|24653| 9465| 12091| 294| 5058| 2168| 0|
| 1| 3|10253| 1114| 3821| 397| 964| 412| 0|
| 2| 3| 1020| 8816| 12121| 134| 4508| 1080| 0|
| 1| 3| 5876| 6157| 2933| 839| 370| 4478| 1|
| 2| 3|18601| 6327| 10099| 2205| 2767| 3181| 0|
| 1| 3| 7780| 2495| 9464| 669| 2518| 501| 0|
+-------+------+-----+-----+-------+------+----------------+------------+--------------+
only showing top 20 rows

Fitering the above dataframe, to select only those records where the outlier count is < = 1
In [ ]:
new_df_with_no_outliers = new_df.filter(new_df['total_Outliers']<=1)
new_df_with_no_outliers = new_df_with_no_outliers.select(*df.columns)

new_df_with_no_outliers.show()

The count of the dataframe, after removing the outliers

In [ ]:
new_df_with_no_outliers.count()

Out[ ]: 399

The dataset, which contains 2 or more outliers in each record

In [ ]:
data_with_outliers = new_df.filter(new_df['total_Outliers']>=2)
data_with_outliers.show()

+-------+------+-----+-----+-------+------+----------------+------------+--------------+
|Channel|Region|Fresh| Milk|Grocery|Frozen|Detergents_Paper|Delicatessen|total_outliers|
+-------+------+-----+-----+-------+------+----------------+------------+--------------+
| 1| 3|31276| 1917| 4469| 9408| 2381| 4334| 2|
| 2| 3|26373|36423| 22019| 5154| 4337| 16523| 2|
| 2| 3| 4113|20484| 25957| 1158| 8604| 5206| 3|
| 1| 3|56159| 555| 902| 10002| 212| 2916| 2|
| 1| 3|24025| 4332| 4757| 9510| 1145| 5864| 2|
| 2| 3| 630|11095| 23998| 787| 9529| 72| 2|
| 2| 3| 5181|22044| 21531| 1740| 7353| 4985| 2|
| 2| 3|44466|54259| 55571| 7782| 24171| 6465| 6|
| 2| 3| 4967|21412| 28921| 1798| 13583| 1163| 3|
| 2| 3| 4098|29892| 26866| 2616| 17740| 1340| 3|
| 2| 3|35942|38369| 59598| 3254| 26701| 2017| 3|
| 2| 3| 85|20959| 45828| 36| 24231| 1423| 3|
| 2| 3|12205|12697| 28540| 869| 12034| 1009| 2|
| 2| 3|16117|46197| 92780| 1026| 40827| 2944| 3|
| 2| 3|22925|73498| 32114| 987| 20070| 903| 3|
| 1| 3|43265| 5025| 8117| 6312| 1579| 14351| 2|
| 2| 3| 9198|27472| 32034| 3232| 18906| 5130| 4|
| 1| 3|56082| 3504| 8906| 18028| 1480| 2498| 2|
| 2| 3| 1406|16729| 28986| 673| 836| 3| 2|
| 1| 3|76237| 3473| 7102| 16538| 778| 918| 2|
+-------+------+-----+-----+-------+------+----------------+------------+--------------+
only showing top 20 rows

In [ ]:
# Selecting the numerical columns from the original dataframe and converting into pandas

numeric_columns

Out[ ]: ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']

Converting a spark dataframe into pandas dataframe, which enables us to plot the graphs using
seaborn and matplotlib
In [ ]:
original_numerical_df = df.select(*numeric_columns).toPandas()

original_numerical_df.head(10)

Out[ ]: Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 12669 9656 7561 214 2674 1338

1 7057 9810 9568 1762 3293 1776

2 6353 8808 7684 2405 3516 7844

3 13265 1196 4221 6404 507 1788

4 22615 5410 7198 3915 1777 5185

5 9413 8259 5126 666 1795 1451

6 12126 3199 6975 480 3140 545

7 7579 4956 9426 1669 3321 2566

8 5963 3648 6192 425 1716 750

9 6006 11093 18881 1159 7425 2098

In [ ]:
# Plotting the box for the dataset after removing the outliers

dataset_after_removing_outliers = new_df_with_no_outliers.toPandas()
dataset_after_removing_outliers.head(10)

Out[ ]: Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 2 3 12669 9656 7561 214 2674 1338

1 2 3 7057 9810 9568 1762 3293 1776

2 2 3 6353 8808 7684 2405 3516 7844

3 1 3 13265 1196 4221 6404 507 1788

4 2 3 22615 5410 7198 3915 1777 5185

5 2 3 9413 8259 5126 666 1795 1451

6 2 3 12126 3199 6975 480 3140 545

7 2 3 7579 4956 9426 1669 3321 2566

8 1 3 5963 3648 6192 425 1716 750

9 2 3 6006 11093 18881 1159 7425 2098

In [ ]:
numeric_columns

Out[ ]: ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']

Plotting the box plot, to check the outliers in the original dataframe., and comaring it with the new
dataframe after removing outliers
In [ ]:
fig,ax = plt.subplots(2,6,figsize=(15,8))
for i,df in enumerate([original_numerical_df,dataset_after_removing_outliers]):

for j, col in enumerate(numeric_columns):

sns.boxplot(data = df, y=col,ax=ax[i][j])

Conclusion / Observations :
1. The initial dataset contains 440 records with 8 features
2. There are total 6 numerical features & 2 categorical features
3. The original spark dataframe has the datatypes of strings
4. The numerical features were converted into IntegerType using cast function
5. Created a customized function to identify outliers in each record
6. Applyng the above customized function, enables us to identify total outliers in each record, based on each feature
7. Filtering the dataset based on the total outliers which are <=1, to eliminate the records with more than 2 outliers
8. The new dataframe, contains 399 records after removing the outliers against 440 records in the inital data frame
9. Comparing the outliers from the original dataset to the new dataset after outlier removal using a box plot
10. There are still some outliers available in the dataset., even after removing majority of the outliers.
11. This shows that the data is skewed, however, the majority of the outliers are removed

In [ ]:

Python Cheat Sheet 2.0
100% (1)
Python Cheat Sheet 2.0
10 pages
Practical-Guide-in-Using - e-RTU.
No ratings yet
Practical-Guide-in-Using - e-RTU.
8 pages
References
No ratings yet
References
3 pages
Cse2005 Operating-Systems Eth 1.0 37 Cse2005
No ratings yet
Cse2005 Operating-Systems Eth 1.0 37 Cse2005
2 pages
Mental Maths Tricks
No ratings yet
Mental Maths Tricks
10 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
Instant Download Your Brain On Exercise Gary L. Wenk PDF All Chapter
100% (4)
Instant Download Your Brain On Exercise Gary L. Wenk PDF All Chapter
64 pages
Spark SQLpdf 20 jan
No ratings yet
Spark SQLpdf 20 jan
4 pages
Gene Expression Lab 7
No ratings yet
Gene Expression Lab 7
4 pages
BUS1136 Syllabus
No ratings yet
BUS1136 Syllabus
4 pages
Please Keep These Details For Reference Purposes
No ratings yet
Please Keep These Details For Reference Purposes
1 page
Almond Quinoa
No ratings yet
Almond Quinoa
3 pages
Foreign Exchange Consensus Forecasts by Consensus Economics Inc.
No ratings yet
Foreign Exchange Consensus Forecasts by Consensus Economics Inc.
36 pages
q1
No ratings yet
q1
2 pages
Practical File ANKIT RAJ CLASS 12-F
No ratings yet
Practical File ANKIT RAJ CLASS 12-F
48 pages
Zomato Rating Prediction
No ratings yet
Zomato Rating Prediction
11 pages
Bigdata - Ipynb - Colab
No ratings yet
Bigdata - Ipynb - Colab
28 pages
PMT2 24
No ratings yet
PMT2 24
56 pages
Ip Practical
No ratings yet
Ip Practical
23 pages
Serato DJ Reloop Beatmix 2 Quickstart Guide
No ratings yet
Serato DJ Reloop Beatmix 2 Quickstart Guide
8 pages
Adobe Scan 25 Nov 2023
No ratings yet
Adobe Scan 25 Nov 2023
17 pages
ANNEX A: SAR Test Setup
No ratings yet
ANNEX A: SAR Test Setup
6 pages
Project_Prog
No ratings yet
Project_Prog
6 pages
YANMAR 3TNV70-GGE-Spec
No ratings yet
YANMAR 3TNV70-GGE-Spec
17 pages
Pyspark_MLlib
No ratings yet
Pyspark_MLlib
4 pages
Programs of Python Pandas
No ratings yet
Programs of Python Pandas
15 pages
B.a. History Syllabus (1)
No ratings yet
B.a. History Syllabus (1)
9 pages
JS Foundations
No ratings yet
JS Foundations
41 pages
DHP Journal
No ratings yet
DHP Journal
29 pages
Methods & Function in Databricks
No ratings yet
Methods & Function in Databricks
34 pages
sowmi DS
No ratings yet
sowmi DS
27 pages
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
100% (4)
Python Cheat Sheet: Pandas - Numpy - Sklearn Matplotlib - Seaborn BS4 - Selenium - Scrapy
11 pages
Data Wrangling- Jupyter Notebook
No ratings yet
Data Wrangling- Jupyter Notebook
5 pages
PySpark Transformations
No ratings yet
PySpark Transformations
18 pages
De Interview Raamashaamy Qna Bank
No ratings yet
De Interview Raamashaamy Qna Bank
11 pages
Protovision Nad306 .CKT
No ratings yet
Protovision Nad306 .CKT
2 pages
GR 17 004 en
No ratings yet
GR 17 004 en
62 pages
journal
No ratings yet
journal
47 pages
Day 10 Pandasdatacleaning
No ratings yet
Day 10 Pandasdatacleaning
6 pages
ML Practical 4D
No ratings yet
ML Practical 4D
11 pages
Cot 2 Science-7
No ratings yet
Cot 2 Science-7
12 pages
Shared Bike Demand Analysis.ipynb
No ratings yet
Shared Bike Demand Analysis.ipynb
390 pages
Pyspark Interview Questions
No ratings yet
Pyspark Interview Questions
4 pages
Document (4)
No ratings yet
Document (4)
15 pages
P1) Code Uber
No ratings yet
P1) Code Uber
6 pages
PRACTICAL FILE IP - Copy (1)
No ratings yet
PRACTICAL FILE IP - Copy (1)
27 pages
Quarterback Passer Rating Worksheets
No ratings yet
Quarterback Passer Rating Worksheets
8 pages
Khaleel CV
No ratings yet
Khaleel CV
3 pages
IP Practic MINE
No ratings yet
IP Practic MINE
30 pages
ML lab manual 1-10
No ratings yet
ML lab manual 1-10
58 pages
Task 6
No ratings yet
Task 6
14 pages
Informatics Practicals 12th (Personal)
No ratings yet
Informatics Practicals 12th (Personal)
89 pages
1
No ratings yet
1
12 pages
A926534728_28953_8_2025_SPARK MLLIB
No ratings yet
A926534728_28953_8_2025_SPARK MLLIB
8 pages
PDF&Rendition=1
No ratings yet
PDF&Rendition=1
47 pages
EXP-3
No ratings yet
EXP-3
10 pages
Document (4)-1
No ratings yet
Document (4)-1
15 pages
Lab Programmes Adwaith
No ratings yet
Lab Programmes Adwaith
18 pages
Ip Practical 123
No ratings yet
Ip Practical 123
24 pages
Dragonfly Patterns - Plush and Medium Weight Yarn
100% (3)
Dragonfly Patterns - Plush and Medium Weight Yarn
14 pages
Course Book 1
No ratings yet
Course Book 1
135 pages
Assignment 2
No ratings yet
Assignment 2
2 pages
unit 6 Pyspark_MLlib
No ratings yet
unit 6 Pyspark_MLlib
6 pages
Brosura Compass
No ratings yet
Brosura Compass
44 pages
32210 - Tapered Roller Bearings | SKF
No ratings yet
32210 - Tapered Roller Bearings | SKF
11 pages
PRACTICALS
No ratings yet
PRACTICALS
52 pages
Python Cheat Sheet For Excel Users
100% (2)
Python Cheat Sheet For Excel Users
5 pages
Sergei Prokofiev 349: Concerto No. 1 in D Major For Violin and Orchestra, Opus 19
No ratings yet
Sergei Prokofiev 349: Concerto No. 1 in D Major For Violin and Orchestra, Opus 19
3 pages
Meaning and Definition of Indian Contract Act 1872
No ratings yet
Meaning and Definition of Indian Contract Act 1872
186 pages
Practical File Python
No ratings yet
Practical File Python
25 pages
Big Data Analysis
No ratings yet
Big Data Analysis
38 pages
Unit3_3) Pandas.ipynb - Colab
No ratings yet
Unit3_3) Pandas.ipynb - Colab
11 pages
IP Practical PRGM
No ratings yet
IP Practical PRGM
41 pages
Car Price Prediction
No ratings yet
Car Price Prediction
67 pages
Eslprintables 2009127141915749711852
No ratings yet
Eslprintables 2009127141915749711852
1 page
IP (12) Proj File Pandas&Matplotlib
No ratings yet
IP (12) Proj File Pandas&Matplotlib
12 pages
Statement of Purpose
No ratings yet
Statement of Purpose
2 pages
Journal 12
No ratings yet
Journal 12
54 pages
Overview of Data Cleaning
No ratings yet
Overview of Data Cleaning
17 pages
Manual Solar Charger
No ratings yet
Manual Solar Charger
24 pages
Practical File IP
No ratings yet
Practical File IP
27 pages
Big Data Analytics in Apache Spark
No ratings yet
Big Data Analytics in Apache Spark
79 pages
Predicting Home Prices in Bangalore
No ratings yet
Predicting Home Prices in Bangalore
18 pages
Data Cleaning
No ratings yet
Data Cleaning
13 pages
12 Pandas
No ratings yet
12 Pandas
9 pages
Practical File Questions With Answers
No ratings yet
Practical File Questions With Answers
7 pages
Dataframe in Pandas
No ratings yet
Dataframe in Pandas
23 pages
Sensata Catalog
No ratings yet
Sensata Catalog
16 pages
Beginning C# and .NET
From Everand
Beginning C# and .NET
Benjamin Perkins
No ratings yet
IBM System 360 RPG Debugging Template and Keypunch Card
From Everand
IBM System 360 RPG Debugging Template and Keypunch Card
Archive Classics
No ratings yet

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.

Customer Data Outliers Pyspark

Uploaded by

Customer Data Outliers Pyspark

Uploaded by

Outlier Detection using PYSPARK

Installing the PYSPARK module

Requirement already satisfied: pyspark in /usr/local/lib/python3.7/dist-packages (3.2.0)

Mounting the google drive datasets

Importing the necessary libraries

Creating a Spark Session

Out[ ]: SparkSession - in-memory

Reading the dataset from Google Drive using spark session

Counting the number of records

Checking the number of columns

Printing the schema, to check the datatypes

1. All the features are of string data type

Out[ ]: ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']

Creating a variable to store all the numerical columns into a list

Out[ ]: ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']

Creating a customized function

# Identifying the numerical columns in a spark dataframe

# Q1 : First Quartile ., Q3 : Third Quartile

# IQR : Inter Quantile Range

df = df.withColumn(isOutlierCol,f.when((df[column] > more_Q3) | (df[column] < less_Q1), 1).otherwise(0))

Using the customized Outlier function and applying it to a spark dataframe

The count of the dataframe, after removing the outliers

The dataset, which contains 2 or more outliers in each record

Out[ ]: ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']

Out[ ]: Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 12669 9656 7561 214 2674 1338

1 7057 9810 9568 1762 3293 1776

2 6353 8808 7684 2405 3516 7844

3 13265 1196 4221 6404 507 1788

4 22615 5410 7198 3915 1777 5185

5 9413 8259 5126 666 1795 1451

6 12126 3199 6975 480 3140 545

7 7579 4956 9426 1669 3321 2566

8 5963 3648 6192 425 1716 750

9 6006 11093 18881 1159 7425 2098

Out[ ]: Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicatessen

0 2 3 12669 9656 7561 214 2674 1338

1 2 3 7057 9810 9568 1762 3293 1776

2 2 3 6353 8808 7684 2405 3516 7844

3 1 3 13265 1196 4221 6404 507 1788

4 2 3 22615 5410 7198 3915 1777 5185

5 2 3 9413 8259 5126 666 1795 1451

6 2 3 12126 3199 6975 480 3140 545

7 2 3 7579 4956 9426 1669 3321 2566

8 1 3 5963 3648 6192 425 1716 750

9 2 3 6006 11093 18881 1159 7425 2098

Out[ ]: ['Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', 'Delicatessen']

for j, col in enumerate(numeric_columns):

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.