Cognizant's Artificial Intelligence Task 1
Cognizant's Artificial Intelligence Task 1
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
#Section 2 - Data loading Now that Google Drive is mounted, you can store the CSV file
anywhere in your Drive and update the path variable below to access it within this notebook.
Once we've updated the path, let's read this CSV file into a pandas dataframe and see what it
looks like
path = "/content/drive/MyDrive/sample_sales_data.csv"
df = pd.read_csv(path)
df.drop(columns=["Unnamed: 0"], inplace=True, errors='ignore')
df.head(5)
transaction_id timestamp \
0 a1c82654-c52c-45b3-8ce8-4c2a1efe63ed 2022-03-02 09:51:38
1 931ad550-09e8-4da6-beaa-8c9d17be9c60 2022-03-06 10:33:59
2 ae133534-6f61-4cd6-b6b8-d1c1d8d90aea 2022-03-04 17:20:21
3 157cebd9-aaf0-475d-8a11-7c8e0f5b76e4 2022-03-02 17:23:58
4 a81a6cd3-5e0c-44a2-826c-aea43e46c514 2022-03-05 14:32:43
<google.colab._quickchart_helpers.SectionTitle at 0x79b2b2a47e20>
<google.colab._quickchart_helpers.SectionTitle at 0x79b2b2a47550>
<google.colab._quickchart_helpers.SectionTitle at 0x79b2b2a47ee0>
<google.colab._quickchart_helpers.SectionTitle at 0x79b2b2a8d060>
<google.colab._quickchart_helpers.SectionTitle at 0x79b2b2a47df0>
<google.colab._quickchart_helpers.SectionTitle at 0x79b2b2a8d510>
To get you started an explanation of what the column names mean are provided below:
transaction_id = this is a unique ID that is assigned to each transaction timestamp = this is the
datetime at which the transaction was made product_id = this is an ID that is assigned to the
product that was sold. Each product has a unique ID category = this is the category that the
product is contained within customer_type = this is the type of customer that made the
transaction unit_price = the price that 1 unit of this item sells for quantity = the number of units
sold for this product within this transaction total = the total amount payable by the customer
payment_type = the payment method used by the customer After this, you should try to
compute some descriptive statistics of the numerical columns within the dataset, such as:
Data types:
df.dtypes
transaction_id object
timestamp object
product_id object
category object
customer_type object
unit_price float64
quantity int64
total float64
payment_type object
dtype: object
#Missing values:
df.isna().sum()
transaction_id 0
timestamp 0
product_id 0
category 0
customer_type 0
unit_price 0
quantity 0
total 0
payment_type 0
dtype: int64
df.describe()
unit_price quantity total
count 7829.000000 7829.000000 7829.000000
mean 7.819480 2.501597 19.709905
std 5.388088 1.122722 17.446680
min 0.190000 1.000000 0.190000
25% 3.990000 1.000000 6.570000
50% 7.190000 3.000000 14.970000
75% 11.190000 4.000000 28.470000
max 23.990000 4.000000 95.960000
#Data visualization
Category distribution
plt.figure(figsize=(10, 6))
sns.histplot(df['quantity'])
plt.title('Quantity Distribution')
plt.xlabel('Quantity')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(10, 6))
sns.histplot(df['unit_price'])
plt.title('Unit Price Distribution')
plt.xlabel('Unit Price')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(10, 6))
sns.histplot(df['total'])
plt.title('Total Price Distribution')
plt.xlabel('Total Price')
plt.ylabel('Count')
plt.show()
plt.figure(figsize=(10, 6))
sns.heatmap(df.corr(), annot=True)
plt.title('Correlation Matrix')
plt.show()
"How to better stock the items that they sell" From this dataset, it is impossible to answer that
question. In order to make the next step on this project with the client, it is clear that:
We need more rows of data. The current sample is only from 1 store and 1 week worth of data
We need to frame the specific problem statement that we want to solve. The current business
problem is too broad, we should narrow down the focus in order to deliver a valuable end
product We need more features. Based on the problem statement that we move forward with,
we need more columns (features) that may help us to understand the outcome that we're
solving for