Nikhil Project
Nikhil Project
Nikhil Project
INTERNSHIP
An Internship Report Submitted at the end of seventh semester
BACHELOR OF TECHNOLOGY
IN
COMPUTER SCIENCE AND ENGINEERING
Submitted By
VISWANADHA NIKHIL
(21981A05I5)
SKILL DZIRE
2024-2025
CERTIFICATE
This is to certify that this project entitled “DATA SCIENCE” done by “VISWANADHA NIKHIL
(21981A05I5)” is a student of B.Tech in the Department of Computer Science and Engineering, Raghu
Engineering College, during the period 2021-2025, in partial fulfillment for the award of the Degree of Bachelor of
Technology in Computer Science and Engineering to the Jawaharlal Nehru Technological University, Gurajada
Vizianagaram is a record of bonafide work carried out under my guidance and supervision. The results embodied in
this internship report have not been submitted to any other University or Institute for the award of any Degree.
EXTERNALEXAMINER
1
DISSERTATION APPROVAL SHEET
This is to certify that the dissertation titled
E-COMMERCE SALES USING EDA
BY
VISWANDHA NIKHIL (21981A05I5)
PROJECT GUIDE
(Assistant Professor)
Internal Examiner
External Examiner
HOD
(Professor)
Date:
2
DECLARATION
This is to certify that this internship titled “DATA SCIENCE” is bonafide work done by
me, impartial fulfillment of the requirements for the award of the degree B.Tech and submitted
to the Department of Computer Science and Engineering, Raghu Engineering College,
Dakamarri.
I also declare that this internship is a result of my own effort and that has not been
copied from any one and I have taken only citations from the sources which are mentioned in
the references.
This work was not submitted earlier at any other University or Institute for the reward of
any degree.
Date:
Place:
VISWANADHA NIKHIL
(21981A05I5)
3
CERTIFICATE
4
ACKNOWLEDGEMENT
I would like to thank the Principal Dr. Ch. Srinivasu of “Raghu Engineering College”,
for providing the requisite facilities to carry out projects on campus. Your expertises in the
subject matter and dedication towards our project have been a source of inspiration for all of us.
I would like to thank Skill Dzire Professionals for providing the technical guidance to
carry out the module assigned. Your expertises in the subject matter and dedication towards our
project have been a source of inspiration for all of us.
I extend my deep hearted thanks to all faculty members of the Computer Science
department for their value based imparting of theory and practical subjects, which were used in
the project.
Regards
VISWANADHA NIKHIL
(21981A05I5)
5
TABLE OF CONTENTS
Section 1: Python
1.1. Introduction to Python 8
1.2. Basic Syntax and Data Types 9-10
1.3. Control Statements 10-11
1.4. Data Structures 11-13
1.5. Object Oriented Programming 13-16
Section 4: Implementation
4.1. Problem Statement 29
4.2. Objective 29
4.3. Data Analysis 30-48
7
PYTHON
Python is a versatile and widely used programming language, created by Guido van
Rossum in 1991. Known for its simple, English-like syntax, Python allows developers to
write programs efficiently with fewer lines of code. It supports multiple programming
paradigms, including procedural, object-oriented, and functional programming. Python is
highly flexible and can be used for a wide range of applications, such as web
development, software automation, data analysis, and mathematical computations. Its
platform independence (running on Windows, Mac, Linux, etc.) and rapid prototyping
capabilities make it a popular choice for both beginners and professionals.
FEATURES OF PYTHON:
APPLICATIONS OF PYTHON:
1.Web Development
2. Data Science and Analytics
3. Machine Learning and Artificial Intelligence
4. Scientific Computing
5. Automation and Scripting
6. Game Development
7. Desktop Applications
8. Network Programming
9. Embedded Systems
10. Cybersecurity
11. Internet of Things (IoT)
12. Finance and Trading
13. Cloud Computing
14. Education
15. Content Management Systems (CMS)
8
1.2. BASIC SYNTAX:
1. Comments
Single-line comments: Use the # symbol.
For Example:
print("Hello, World!") # This prints a message
9
DATA TYPES:
Python Data types are the classification or categorization of data items. It represents the kind
of value that tells what operations can be performed on a particular data. Since everything is
an object in Python programming, Python data types are classes and variables are instances
(objects) of these classes. The followin
following are the standard or built-in
in data types in Python:
Numeric
Sequence Type
Boolean
Set
Dictionary
Binary Types
10
print("x is greater than 5") # Output: x is greater than 5
elif Statement: Allows multiple conditions. If the first condition is False, the elif
statement is checked..
Example:
x = 10
if x > 15:
print("x is greater than 15")
elif x > 5:
print("x is greater than 5 but less than or equal to 15")
else Statement: Executes a block of code if none of the conditions are True.
Example:
x=3
if x > 5:
print("x is greater than 5")
else:
print("x is less than or equal to 5") # Output: x is less than or equal to 5
11
2. Tuples
Definition: Tuples are ordered, immutable collections of elements, making them
useful when you want to ensure data remains unchanged.
Syntax: Use parentheses ().
Example:
coordinates = (10.5, 20.3, "N")
print(coordinates[2]) # Access the third element
3. Dictionaries
Definition: Dictionaries store data as key-value pairs, where each key is unique.
They are unordered and mutable, ideal for fast lookups.
Syntax: Use curly braces {}.
Example:
student = {"name": "Nikhil", "grade": "A", "age": 20}
print(student["name"]) # Access value by key
student["age"] = 21 # Modify the value
4. Sets
Definition: Sets are unordered collections of unique elements. They are mutable
but do not allow duplicates, making them useful for membership tests and
removing duplicates.
Syntax: Use curly braces {} or the set() function.
Example:
colors = {"red", "blue", "green", "red"} # Duplicate "red" is removed
print(colors) # Output: {'red', 'blue', 'green'}
colors.add("yellow") # Add an element
colors.discard("blue") # Remove an element
5. Strings
Definition: Strings are ordered, immutable sequences of characters, commonly
used to store and manipulate text.
Syntax: Use single or double quotes (' ', " ").
12
Example:
greeting = "Hello, Python!"
print(greeting[7]) # Access a specific character
print(greeting.upper()) # Convert to uppercase
def start(self):
print(f"{self.brand} {self.model} is starting...")
2. Encapsulation
Encapsulation is the bundling of data (attributes) and methods (functions) into a single
unit, or class. It also controls access to the internal state of an object, typically using
private or protected members.
Access Modifiers:
13
o Private: Prefix the attribute with __ (double underscores), restricting access
outside the class.
o Protected: Prefix with _ (single underscore), indicating that it should not be
accessed directly outside of the class.
Example:
class Employee:
def __init__(self, name, salary):
self.name = name # Public attribute
self.__salary = salary # Private attribute
def get_salary(self):
return self.__salary # Access private attribute via method
3. Inheritance
Inheritance allows one class (child class) to inherit the properties and methods of
another class (parent class), enabling code reuse and hierarchical relationships.
Example:
class Animal:
def sound(self):
print("Animal makes a sound")
class Dog(Animal):
def sound(self):
print("Dog barks")
my_dog = Dog()
my_dog.sound() # Output: Dog barks
14
4. Polymorphism
Polymorphism allows different objects to respond to the same method or function in a
way that is specific to their class. This can happen through method overriding or
operator overloading.
Example:
class Cat:
def sound(self):
print("Cat meows")
class Dog:
def sound(self):
print("Dog barks")
5. Abstraction
Abstraction hides the complex implementation details and only exposes the necessary
functionality. In Python, abstraction is implemented using abstract base classes (ABC)
from the abc module.
Example:
from abc import ABC, abstract method
class Shape(ABC):
@abstractmethod
def area(self):
pass
15
class Circle(Shape):
def __init__(self, radius):
self.radius = radius
def area(self):
return 3.14 * self.radius * self.radius
circle = Circle(5)
print(circle.area()) # Output: 78.5
def __del__(self):
print(f"Book '{self.title}' deleted.")
16
DBMS & SQL
2.1. INTRODUCTION OF DBMS:
A database is an organized collection of interrelated data that facilitates the efficient retrieval,
insertion, and deletion of information. It stores data in structured forms such as tables, views,
and schemas, enabling better management and organization. For example, a university
database manages information about students, faculty, and administrative staff, helping in
efficient operations andnd data handling. A Database Management System (DBMS) is
specialized software that manages these databases, allowing users to create, modify, and
query data while ensuring data security, integrity, and controlled access, making data storage
and retrieval more efficient.
FEATURES OF DBMS:
Data Abstraction
Data Independence
Data Security
Data Integrity
Efficient Query Processing
Multi-User Access
Backup and Recovery
Data Redundancy Control
Concurrency Control
Support for Data Relationships and Constraints
APPLICATIONS OF DBMS:
17
2.2. OVERVIEW OF DBMS:
Types of DBMS:
1. Relational Database Management System (RDBMS): Data is organized into tables
(relations) with rows and columns, and the relationships between the data are
managed through primary and foreign keys. SQL (Structured Query Language) is
used to query and manipulate the data.
2. NoSQL DBMS: Designed for high-performance scenarios and large-scale data,
NoSQL databases store data in various non-relational formats such as key-value pairs,
documents, graphs, or columns.
3. Object-Oriented DBMS (OODBMS): Stores data as objects, similar to those used in
object-oriented programming, allowing for complex data representations and
relationships
Database Languages:
1.Data Definition Language (DDL):
DDL is the short name for Data Definition Language, which deals with database schemas
and descriptions, of how the data should reside in the database.
CREATE: to create a database and its objects like (table, index, views, store
procedure, function, and triggers)
ALTER: alters the structure of the existing database
DROP: delete objects from the database
TRUNCATE: remove all records from a table, including all spaces allocated for
the records are removed
COMMENT: add comments to the data dictionary
RENAME: rename an object
18
DELETE: Delete all records from a database table
MERGE: UPSERT operation (insert or update)
CALL: call a PL/SQL or Java subprogram
EXPLAIN PLAN: interpretation of the data access path
LOCK TABLE: concurrency Control
ADVANTAGES OF DBMS:
Data Abstraction
Data Integrity and Security
Data Redundancy Reduction
Consistent Data Management
Multi-user Environment
19
Backup and Recovery
Efficient Query Processing
Concurrency Control
DISADVANTAGES OF DBMS:
Complexity
Cost of Hardware and Software
Large Storage Requirements
Performance Overhead
Frequent Updates and Maintenance
Training and Technical Expertise
COMPONENTS OF SQL:
1. Databases: Databases are structured collections of data organized into tables, rows,
and columns. They serve as repositories for efficiently storing information, allowing
users to manage and access data seamlessly.
2. Tables: Tables are fundamental building blocks of a database, consisting of rows
(records) and columns (attributes or fields). They define the structure and
relationships of the stored information, ensuring data integrity and consistency.
3. Queries: Queries are SQL commands used to interact with databases. They enable
users to retrieve, update, insert, or delete data from tables, allowing for efficient data
manipulation and retrieval.
4. Constraints: Constraints are rules applied to tables to maintain data integrity. They
define conditions that data must meet to be stored in the database, ensuring accuracy
and consistency.
20
5. Stored Procedures: Stored procedures are pre-compiled SQL statements stored in the
database. They can accept parameters, execute complex operations, and return results,
enhancing efficiency, reusability, and security in database management.
6. Transactions: Transactions are groups of SQL statements executed as a single unit of
work. They ensure data consistency and integrity by allowing for the rollback of
changes if any part of the transaction fails.
2. LEFT JOIN (or LEFT OUTER JOIN): This join returns all rows from the left table
and the matched rows from the right table. If there is no match, NULL values are
returned for columns from the right table.
Syntax:
SELECT table1.column1,table1.column2,table2.column1,....
FROM table1
LEFT JOIN table2
ON table1.matching_column = table2.matching_column;
3. RIGHT JOIN (or RIGHT OUTER JOIN): This join is the opposite of the LEFT
JOIN. It returns all rows from the right table and the matched rows from the left table.
If there is no match, NULL values are returned for columns from the left table.
Syntax:
SELECTtable1.column1,table1.column2,table2.column1,....
FROMtable1
RIGHT JOINtable2
ONtable1.matching_column = table2.matching_column;
21
4. FULL JOIN (or FULL OUTER JOIN): JOIN): This join returns all rows when there is a
match in either the left or right table records. Rows that do not have matches in one of
the tables will have NULL values for the columns of that table.
Syntax:
SELECTtable1.column1,table1.column2,table2.column1,....
FROMtable1
FULL JOIN table2
ONtable1.matching_column = table2.matching_column;
table2.matching_column
22
DATA SCIENCE
23
TYPES OF STATISTICS:
1. Descriptive Statistics: Descriptive statistics are tools used to summarize and organize
large sets of data, making complex information more comprehensible. They involve the
calculation of measures such as:
Measures of Central Tendency:
o Mean: The average of a data set.
o Median: The middle value when data is arranged in order.
o Mode: The most frequently occurring value in a dataset.
Measures of Dispersion:
o Variance: The measure of how much the data varies from the mean.
o Standard Deviation: The square root of variance, indicating the spread of
data points.
Graphical Representations: Tools like histograms, bar charts, and box plots help
visualize data trends and distributions.
These statistics help present data in a clear and concise manner, allowing for easier
interpretation of trends and patterns within the data.
24
2. Pandas: Built on top of NumPy, Pandas offers data structures like Series and
DataFrames, which simplify data manipulation and analysis. It provides tools for data
cleaning, filtering, merging, and reshaping, making it a staple for data wrangling tasks.
3. Matplotlib: This widely-used library is essential for data visualization. Matplotlib
provides a flexible interface for creating static, animated, and interactive plots in Python.
It enables users to create a variety of visualizations, such as line plots, bar charts, and
histograms.
4. Seaborn: Built on Matplotlib, Seaborn provides a high-level interface for creating
informative and attractive statistical graphics. It simplifies complex visualizations and
supports beautiful default styles and color palettes.
5. SciPy: This library builds on NumPy and provides additional functionality for scientific and
technical computing. It includes modules for optimization, integration, interpolation, eigenvalue
problems, and more, making it useful for complex mathematical calculations.
25
𝑦 = 𝑎 + 𝑎 𝑥 + 𝜖
Where:
a0 is the intercept, representing the value of y when x = 0.
a1 is the slope, showing how much y changes with a change in x.
ε represents the error term, which accounts for the discrepancies between the
predicted and actual values.
2. MULTIPLE LINEAR REGRESSION
Multiple Linear Regression is an extension of simple linear regression that models the
relationship between one dependent variable and two or more independent variables.
The general form of the equation is:
𝑦 = 𝛽 + 𝛽 𝑥 + 𝛽 𝑥 + …+ 𝛽 𝑥 + 𝜖
Where:
y is the predicted value (dependent variable).
x1, x2, ..., xn are independent variables (predictors).
β0, β1, ..., βn are the coefficients representing the contribution of each
independent variable.
3. POLYNOMIAL REGRESSION
Polynomial Regression generalizes the concept of linear regression to model more
complex, non-linear relationships between the independent variable and the dependent
variable using higher-order polynomial terms.
The general equation for polynomial regression:
𝑦 = 𝑎 + 𝑎 𝑥 + 𝑎 𝑥 + …+ 𝑎 𝑥 + 𝜖
This allows for curves that better fit non-linear data patterns.
4. EXPONENTIAL REGRESSION
Exponential Regression is useful for modeling data where the dependent variable
exhibits exponential growth or decay in relation to the independent variable.
The general form is:
𝑦 = 𝑎 . 𝑒{ }
Where the dependent variable either grows or decays at a rate proportional to its
current value.
26
5. LOGISTIC REGRESSION
Logistic Regression is a classification algorithm that models the probability of a
binary outcome (two possible classes: 0 or 1). Instead of a linear relationship, logistic
regression uses the logit function to predict the probability.
The equation for logistic regression is:
𝑝 = { ( )}
27
Feature Engineering: Creating or transforming features for better model
performance.
Dimensionality Reduction: Reducing the number of features while retaining key
information (e.g., using PCA).
28
IMPLEMENTATION
4.1. PROBLEM STATEMENT:
E-commerce businesses face significant challenges in analyzing large volumes of sales data,
making it difficult to identify trends and customer preferences. Poor data insights can lead to
ineffective marketing strategies, lower customer satisfaction, and lost revenue opportunities.
Additionally, the dynamic nature of online shopping requires businesses to continuously
adapt their strategies based on real-time data, which can overwhelm teams lacking the
necessary analytical tools. Consequently, organizations may struggle to maintain
competitiveness in an increasingly crowded marketplace.
4.2. OBJECTIVE:
This project aims to perform exploratory data analysis (EDA) to reveal critical insights from
sales data. The analysis will help businesses understand sales performance, customer
behavior, and market trends, enabling data-driven decision-making to enhance sales strategies
and optimize overall business performance. Additionally, the findings will assist in
identifying key performance indicators (KPIs) and areas for growth, allowing businesses to
tailor their offerings to meet customer demands effectively. Ultimately, the goal is to foster a
culture of continuous improvement and adaptability within the organization.
Data set:
A data set is chosen from the Kaggle This project aims to perform exploratory data analysis
(EDA) to reveal critical insights from sales data. The analysis will help businesses understand
sales performance, customer behaviour, and market trends, enabling data-driven decision-
making to enhance sales strategies and optimize overall business performance. Additionally,
the findings will assist in identifying key performance indicators (KPIs) and areas for growth,
allowing businesses to tailor their offerings to meet customer demands effectively.
Ultimately, the goal is to foster a culture of continuous improvement and adaptability within
the organization. The data set includes of the following:
Amazon Sale Report
Cloud Warehouse
Expense IIGF
International sale Report
May-2022
PL March 2021
Sale Report
29
30
4.3. Importing Libraries and Exploring Input Data:
Data
31
32
Visualizing Amazon sales report Trends:
33
34
35
CLOUD WAREHOUSE COMPARISION:
Compare inventory levels and sales data across different cloud warehouses and Identify
the most efficient warehouse in terms of sales turnover and inventory management:
management
36
37
EXPENSE IIGF:
Analyze monthly expenses related to inventory, infrastructure, and general facilities and
create
reate visualizations to identify trends and suggest areas for cost reduction:
reduction
38
39
MAY-2022:
Perform a detailed analysis of sales transactions for May 2022 and Compare with data
from other months to identify seasonal trends or anomalies:
anomalies
40
41
42
PL MARCH 2021:
Conduct a profit and loss analysis for March 2021 and Correlate the profit margins
with specific sales campaigns or product launches during that period:
43
44
SALE REPORT:
Conduct an overall sales report analysis and Segment sales by product categories and
analyze the sales performance of each category:
category
45
46
47
48
CONCLUSION
The Data Science Course provides an in-depth exploration of how data analysis techniques
and machine learning models can be applied to solve real-world problems. Utilizing powerful
libraries such as Pandas, NumPy, and Matplotlib, we facilitated effective data manipulation
and visualization, allowing complex datasets to become more comprehensible and actionable.
This foundational understanding is crucial for extracting meaningful insights and driving
informed decision-making in various domains.
Exploratory data analysis (EDA) emerged as a key component of the course, highlighting its
importance in uncovering hidden insights within data. By employing statistical methods,
including hypothesis testing and regression analysis, we were able to establish a strong basis
for making data-driven decisions. This analytical approach not only bolstered the reliability
of our findings but also provided a clear roadmap for future investigations into data patterns
and trends.
In addition to traditional statistical methods, the course extensively covered the application of
various machine learning algorithms, such as simple and multiple linear regression,
polynomial regression, and logistic regression. These techniques enable us to create
predictive models that effectively analyze trends and identify underlying patterns in the data.
Evaluating model performance through metrics like confusion matrices and ROC curves
ensures that we can accurately assess the effectiveness of our models and refine them as
necessary.
Thorough validation and rigorous testing of our methodologies confirmed that our
approaches adhere to best practices in the field of data science. This commitment to
excellence not only enhances the credibility of our results but also reveals opportunities for
further improvement and innovation. By adopting a meticulous testing process, we ensure
that our analyses and models remain robust and reliable.
Overall, the Data Science Course highlights the transformative potential of data science
techniques in fostering innovation and informed decision-making across diverse industries.
By combining statistical analysis, machine learning models, and effective data visualization,
we have built a solid foundation for future advancements in data-driven applications. This
empowers us to tackle more complex challenges and leverage the full potential of data
science to drive positive outcomes.
49