Sushil 7th (1 PDF

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING
School of Engineering and Technology

Hemvati Nandan Bahuguna Garhwal University
(A Central University) Srinagar Garhwal, Uttarakhand 249161
An Internship report on
DATA SCIENCE
Submitted By
Sushil Meher
[ Roll No : 21134501032 ]
[ B.Tech (C.S.E) VIIth ]
Under the supervision and guidance of

Dr. Prem Nath
Professor at Dept. of Computer Science & Engineering School
Conducted at ‘UNIFIED MENTOR’
In the partial fulfilment of requirements for the award of Degree in

Bachelor of Technology
Session 2024-2025
STUDENT DECLARATION
I, SUSHIL MEHER bearing the roll number 21134501032 students of Computer

Science at Hemvati Nandan Bahuguna Garhwal University (A Central University),
Srinagar, for the award of the Bachelors of Technology Degree in COMPUTER
SCIENCE and declaring that work done is genuine and produced under the
guidance of Prof. M P Thapliyal, Department of Computer Science and
Engineering, Hemvati Nandan Bahuguna Garhwal University.
Date-17/12/2024 SIGNATURE
Sushil Meher
1
CERTIFICATE
This is to certify that, this Mini-project report submitted by SUSHIL MEHER

bearing the roll no 21134501032 is bonafide record of the work carried out by
their partial fulfilment for the requirement of the award of BACHELOR OF
COMPUTER SCIENCE AND ENGINEERING degree from Hemvati
Nandan Bahuguna Garhwal University (A Central University) at Srinagar
(Garhwal), Uttarakhand.
2
ACKNOWLEDGEMENT
We would like to express our deepest gratitude to all people for sprinkling their
help and kindness in the completion of this Project. We would like to start this
moment by invoking our purest gratitude to M P Thapliyal, Department of
Computer Science and Engineering, Hemvati Nandan Bahuguna Garhwal
University (A Central University), Srinagar (Garhwal), Uttarakhand, our
project instructor.
The completion of this seminar could not have been possible without his
expertise and invaluable guidance in every phase at Hemvati Nandan
Bahuguna Garhwal University (A Central University), Srinagar (Garhwal),
Uttarakhand for helping us.
3
CERTIFICATE
4
ABSTRACT
Present day computer applications require the representation of huge amounts of

complex knowledge and data in programs and thus require tremendous amounts
of work. Our ability to code the computers falls short of the demand for
applications. If the computers are endowed with the learning ability, then our
burden of coding the machine is eased (or at least reduced). This is particularly
true for developing expert systems where the "bottleneck" is to extract the
expert's knowledge and feed the knowledge to computers. The present-day
computer programs in general (with the exception of some Machine Learning
programs) cannot correct their own errors or improve from past mistakes, or
learn to perform a new task by analogy to a previously seen task. In contrast,
human beings are capable of all the above. Machine Learning will produce
smarter computers capable of all the above intelligent behavior.
The area of Machine Learning deals with the design of programs that can learn
rules from data, adapt to changes, and improve performance with experience. In
addition to being one of the initial dreams of Computer Science, Machine
Learning has become crucial as computers are expected to solve increasingly
complex problems and become more integrated into our daily lives. This is a
hard problem, since making a machine learn from its computational tasks
requires work at several levels, and complexities and ambiguities arise at each of
those levels.
So, here we study how Machine learning takes place, what are the methods,
discuss various Projects (Implemented during Training) applications, present
and future status of machine learning.
5
CHAPTER 1: INTRODUCTION
Training in data science with artificial intelligence (AI) and machine learning
(ML) is an exciting and dynamic field that equips individuals with the skills to
extract valuable insights, automate decision-making processes, and unlock the
potential of data. This training encompasses a wide array of knowledge and
practical expertise.
Data science involves collecting, cleaning, and analyzing data to derive

actionable insights. AI and ML are subsets of data science that focus on creating
algorithms and models capable of learning patterns from data and making
predictions or decisions. These technologies are used in diverse applications,
from recommendation systems in e-commerce to predictive maintenance in
manufacturing.
A comprehensive data science, AI, and ML training program typically covers

fundamental concepts such as data manipulation, statistical analysis, and
programming in languages like Python. It delves into advanced topics like deep
learning, natural language processing, and reinforcement learning.
Hands-on experience is vital in this training, involving real-world projects,

model development, and evaluation. Familiarity with popular libraries and tools
like TensorFlow, scikit-learn, and Jupyter notebooks is essential.
Moreover, understanding the ethical implications, such as bias and fairness, is

increasingly crucial in data science with AI and ML. Interdisciplinary skills in
communication and domain knowledge enhance the effectiveness of data-driven
solutions.
6
TECHNICAL TRAINING PLATFORM
VS Code
Visual Studio Code (VS Code) is increasingly important in data science due to
its versatility, ease of use, and extensive extension ecosystem. Data scientists
can leverage VS Code for several critical tasks. It supports various
programming languages commonly used in data science, such as Python and R,
making it a unified environment for coding, data manipulation, and analysis. VS
Code's extensions enable integration with Jupyter notebooks, version control
systems, and data visualization libraries. It offers a streamlined interface for
writing code, running experiments, and collaborating with teams, making it an
invaluable tool for data scientists seeking efficiency and productivity in their
workflow.
Jupyter Notebook
Jupyter Notebook is indispensable in data science due to its interactive and

collaborative nature. It provides an interactive environment where data
scientists can blend code, data, visualizations, and narrative explanations
seamlessly. This flexibility is crucial for exploratory data analysis, modeling,
and sharing insights. Jupyter Notebook supports various programming
languages, with Python being the most popular. It facilitates reproducibility by
allowing researchers to document their workflow step by step. Moreover, it's
instrumental in education, enabling instructors to teach data science concepts
effectively. Its ability to create and share interactive reports makes it a vital tool
for data scientists, researchers, and educators across various domains.
Kaggle
Kaggle is a pivotal platform in the data science community, offering several

vital contributions. It hosts data science competitions that push the boundaries
of innovation, allowing practitioners to apply their skills to real-world
challenges. Kaggle Kernels provide a collaborative environment for code
sharing and learning. Datasets and notebooks shared by the community facilitate
knowledge sharing and learning. Its Learn section provides extensive resources
and courses on data science topics. For those entering the field, Kaggle serves as
a practical, hands-on learning playground, while for experienced practitioners,
it's a hub for showcasing expertise and collaborating on impactful projects.
7
CHAPTER 2: DATA SCIENCE WITH AI AND ML
Data Science, Artificial Intelligence (AI), and Machine Learning (ML) are
transformative fields that have revolutionized how businesses, organizations,
and researchers analyze and extract insights from data. In this introduction, we'll
explore the fundamental concepts and their interplay in these domains.
1. Data Science: Data Science is an interdisciplinary field that combines

domain knowledge, statistics, programming, and data analysis to extract
valuable insights and knowledge from data. It involves collecting, cleaning, and
transforming data, followed by the application of various techniques to uncover
patterns, trends, and correlations.
2. Machine Learning (ML): Machine Learning is a subset of AI that focuses

on creating algorithms and models that enable computers to learn from data and
make predictions or decisions without explicit programming. ML algorithms are
categorized into supervised, unsupervised, and reinforcement learning,
depending on the learning process.
3. Artificial Intelligence (AI): Artificial Intelligence is a broader field that

encompasses the development of intelligent agents capable of performing tasks
that typically require human intelligence. AI can involve rule-based systems,
expert systems, natural language processing, and computer vision in addition to
Machine Learning.
1.1 Several Aspects Of Data Science
Before developing a web site once should keep several aspects in mind like:
● Data Collection: Gathering relevant data from various sources, such as
databases, APIs, and sensors.
● Data Cleaning and Preprocessing: Ensuring data is accurate, complete, and
ready for analysis.
● Exploratory Data Analysis (EDA): Examining data visually and statistically
to discover patterns.
● Feature Engineering: Selecting or creating relevant variables for analysis.
● Machine Learning: Building predictive models and making data-driven
decisions.
8
CHAPTER 3: HARDWARE AND SOFTWARE
REQUIREMENT
HARDWARE REQUIRED :
1. Pentium 4, Window XP/Window 7:The minimum system requirements

for learning data science include a Pentium 4 processor, Windows XP or
Windows 7, and 256 MB of RAM. These specifications provide basic
functionality but may limit performance for more advanced data science
tasks.
2. 256 MB RAM : A minimum of 256 MB of RAM is required for basic
data science learning, providing enough memory for running lightweight
applications and simple data processing tasks, though performance may
be constrained for more resource-intensive operations.
SOFTWARE REQUIRED:
1. Windows XP/7: These older operating systems provide a basic

environment for learning data science, though they may have limitations
in running modern software and handling large datasets.
2. VSCode: Visual Studio Code is a lightweight, versatile IDE that supports
multiple programming languages and extensions, making it ideal for data
science development and Python programming.
3. Jupyter Notebook: Jupyter Notebook is an interactive environment for
writing and running Python code, ideal for data analysis, visualization,
and creating reproducible research.
4. Matplotlib: Matplotlib is a powerful Python library for creating static,
animated, and interactive visualizations, commonly used in data science
for plotting charts and graphs.
5. IDE: An Integrated Development Environment (IDE) provides essential
tools like code editing, debugging, and execution, streamlining the coding
workflow for data science tasks.
9
CHAPTER 4: TOOLS
4.1 Introduction
Fundamental tools in data science serve as the cornerstone for various tasks in
data analysis and machine learning. Python, a versatile programming language,
is the linchpin of the data science toolkit. It's complemented by Jupyter
Notebook, an interactive environment perfect for data exploration and
documentation. Pandas, a robust library, takes center stage in data manipulation
and analysis, particularly suited for structured data. Data visualization is
achieved through Matplotlib, a versatile plotting library, and Seaborn, which
simplifies creating appealing statistical graphics. For machine learning
endeavors, Scikit-Learn provides essential algorithms and tools, making it
accessible for beginners and powerful for experts. Version control is essential,
and Git is the industry standard for tracking code changes and collaboration.
These core tools empower data scientists to clean, explore, and analyze data, as
well as develop machine learning models. While more specialized tools may be
necessary for certain projects, these foundational tools remain indispensable and
are the starting point for anyone venturing into the field of data science.
4.2 Features
Data science involves collecting and analyzing data to derive insights,
employing machine learning for predictions, and using visualization to
communicate results.
10
CHAPTER 5:PYTHON
4.1 Introduction
Python is a versatile and widely-used programming language in the field of data

science. Its rich ecosystem of libraries and tools, such as NumPy, Pandas,
Matplotlib, Seaborn, ScikitLearn, and more, make it a popular choice for data
analysis, manipulation, visualization, and machine learning. Python's simplicity
and readability make it accessible to both beginners and experienced data
scientists, enabling them to work with structured and unstructured data, develop
predictive models, and create informative data visualizations. Python's strong
community support, extensive documentation, and active development continue
to drive its prominence as the go-to language for data science projects.
4.2 Why Python?
➢ Rich Ecosystem: Python boasts a vast ecosystem of libraries and

frameworks tailored for data manipulation, analysis, and machine
learning, such as Pandas, NumPy, Matplotlib, and Scikit-Learn.
➢ Ease of Learning: Python's clean and straightforward syntax makes it

accessible to beginners, allowing for a smooth learning curve.
➢ Cross-Platform Compatibility: Python runs on various platforms,

making it versatile for different operating systems.
➢ Integration: Python seamlessly integrates with other languages like C

and Java, making it a preferred choice for integrating data science
solutions into existing applications.
➢ Open Source: Python is open-source, making it cost-effective and

allowing for extensive customization and collaboration.
11
Operators
Python supports a variety of operators, including arithmetic (+, -, *, /),

comparison (==, !=,<, >, <=, >=), logical (and, or, not), assignment (=), and
more.
Data Types
Python has several built-in data types, such as:
● int: for integers (e.g., 5)
● float: for floating-point numbers (e.g., 3.14)
● str: for strings (e.g., "Hello, World!")
● bool: for Boolean values (True or False)
● list: for ordered, mutable sequences
● tuple: for ordered, immutable sequences
● dict: for key-value mappings
● set: for unordered collections of unique elements
Variables:
Variables are used to store data. In Python, you can create a variable by
assigning a value to a name, like x = 10.
Conditional Statements:
Conditional statements allow you to make decisions in your code using if, elif
(else if), and else.
12
For example:
if x > 10:
print("x is greater than 10")
elif x == 10:
print("x is equal to 10")
else:
print("x is less than 10")
Loops:
Python supports for and while loops. For example:
for i in range(5):
print(x) x += 1
Functions:
Functions allow you to group code into reusable blocks. You can define a
function using the
def keyword. For example:
def greet(name):
return f"Hello, {name}!"
message = greet("Alice")
print(message)
13
CHAPTER 6: STATISTICS
6.1 What is Statistics?
Statistics is a branch of mathematics focused on collecting, organizing,

analyzing,interpreting, and presenting data. It encompasses a range of
techniques for summarizing data, assessing relationships, and drawing
meaningful conclusions from information.
Statistics plays a vital role in various fields, including science, economics, and
social sciences, enabling researchers and analysts to make informed decisions,
test hypotheses, and build predictive models based on empirical evidence and
data patterns. In data science,statistics forms the basis for deriving actionable
insights from large datasets.
6.2 Role of Statistics
➢ Descriptive Statistics: Summarizes data using measures like mean, median,

and
variance.
➢ Inferential Statistics: Makes predictions and inferences about populations

based on
samples.
➢ Probability Distributions: Models data characteristics using distributions

like normal
and binomial.
➢ Hypothesis Testing: Determines if observed differences are statistically

significant.
➢ Regression Analysis: Models relationships between variables and makes

predictions.
➢ ANOVA: Compares group means, useful in experiments.
➢ Non-parametric Statistics: Suitable for non-standard data distributions.
14
➢ Bayesian Statistics (Optional): Deals with uncertainty and probabilistic
modeling.
➢ Time Series Analysis (If applicable): Models and forecasts time-dependent

data.
➢ Statistical Software: Proficiency in R or Python for analysis.
6.3 Descriptive Statistics
Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of a couple of new

values.
Code
import pandas as pd
data=pd.read_csv( "Mode.csv") //reads data from csv file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of subject column
print(mode_data)
15
Mean
import pandas as pd
data=pd.read_csv( "mean.csv") //reads data from csv file
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)
Median
Absolute central value of data set.
import pandas as pd
data=pd.read_csv( "data.csv") //reads data from csv file
median_data=data[Overallmarks].median() //to take mode of subject column
print(median_data)
16
5.4: Probability Distribution
In data science, a probability distribution is a mathematical representation of

how likely different values or outcomes are in a dataset or random process.
These distributions describe the inherent uncertainty in data and help make
data-driven decisions. Common distributions include the normal, binomial, and
Poisson distributions, each with its unique characteristics.
Parameters, such as mean and standard deviation, define these distributions'

shapes. Data scientists use probability distributions to model data, conduct
hypothesis testing, make predictions, and simulate scenarios. The Central Limit
Theorem is crucial, as it asserts that the sample mean of sufficiently large
samples follows a normal distribution, underpinning many statistical techniques
and analyses in data science. Understanding probability distributions is essential
for harnessing the power of data.
17
CHAPTER 7: GRAPHS
6.1 What is Graph?

Graphs in data science refer to visual representations of data that help analysts
and data scientists better understand patterns, relationships, and insights within
datasets. They are powerful tools for data exploration, communication, and
analysis.
6.2 Types of Graphs Here are some common types of graphs used in data
science:
➢ Bar Charts: Bar charts are used to display and compare categorical data.
They represent categories on one axis and the corresponding values on the
other, typically using vertical or horizontal bars.
➢ Histograms: Histograms are used to visualize the distribution of continuous

data. They group data into bins and display the frequency or density of values
within each bin.
18
Scatter Plots:
Scatter plots show individual data points as dots on a two-dimensional plane.
They are used to visualize the relationship between two continuous variables
and identify patterns or correlations.
19
➢ Line Charts:
Line charts display data points as connected lines, often used to show trends or
changes in data over time.
➢ Pie Charts:
Pie charts represent parts of a whole, where each slice corresponds to a
percentage of the total. They are used to visualize the composition of a dataset.
20
FINAL PROJECT
SNAPSHOT :
Reading the CSV file and Mean and Standard deviation:
Splitting data into training and testing sets:
21
Helper Function :
Gradient Descent and Normal function :
22
Regulization Parameter :
OUTPUT :
23
OUTPUT :
24
25
CONCLUSION
In conclusion, training in data science equips individuals with valuable skills to

extract insights and make informed decisions from data. It encompasses a wide
range of topics, including data collection, cleaning, analysis, machine learning,
and data visualization. A well-rounded data science education often involves
learning programming languages like Python, mastering essential libraries and
tools, and gaining expertise in statistical and machine learning techniques. Data
science training also emphasizes the importance of critical thinking and
problem-solving, as well as effective communication of findings to
stakeholders. As the demand for data-driven insights continues to grow across
industries, data science training provides a pathway to exciting and rewarding
career opportunities. Ultimately, data science training is an ongoing journey, as
the field evolves rapidly with emerging technologies and new data challenges.
Continuous learning and staying up-to-date with the latest developments are key
to success in this dynamic and high-demand profession.
26
FUTURE SCOPE
The future of data science, AI, and ML is poised for significant expansion and
impact. AI and ML will increasingly underpin advanced applications in various
sectors, from healthcare to finance. As data generation continues to explode,
data scientists will be at the forefront of managing and deriving insights from
big data. Ethical considerations surrounding AI fairness, transparency, and
accountability will become even more crucial.
Furthermore, edge computing will gain prominence, enabling real-time data

processing closer to the source, particularly with the proliferation of IoT
devices. Automation tools, like AutoML, will democratize data science,
allowing more individuals to harness its power. Interdisciplinary collaboration
will become the norm, as domain experts work hand-in-hand with data scientists
to apply AI solutions effectively in their fields.
Security and privacy challenges will intensify with increased reliance on AI and
data, demanding innovative solutions. Research and development will continue
to drive breakthroughs in AI algorithms and architectures. In summary, the
future of data science, AI, and ML holds immense potential for transforming
industries, improving decision-making, and shaping a technologically advanced
future.
27
REFERENCES
➢ https://www.kaggle.com/learn/overview
➢ https://www.edx.org/micromasters/data-science
➢ https://www.fast.ai/
➢ https://towardsdatascience.com/
➢ https://www.youtube.com/user/joshstarmer
➢ https://github.com/josephmisiti/awesome-machine-learning
➢https://github.com/campusx-official/book-recommender-system
/commit/
678c7ab5a67adfcafaadf5b2924e4d04acafe9ac#diff5983284b94671
de74632c367234334917d7e2de10e4be9c255afb37e33f5352e
➢ https://www.youtube.com/user/sentdex
➢ https://www.coursera.org/specializations/deep-learning
➢https://github.com/ChristosChristofidis/awesome-deep-learning
➢ https://youtu.be/1YoD0fg3_EM?feature=shared
28

Sushil 7th (1 PDF

Uploaded by

Copyright:

Available Formats

Sushil 7th (1 PDF

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sushil 7th (1 PDF

Uploaded by

Copyright:

Available Formats

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

School of Engineering and Technology

Under the supervision and guidance of

Conducted at ‘UNIFIED MENTOR’

In the partial fulfilment of requirements for the award of Degree in

I, SUSHIL MEHER bearing the roll number 21134501032 students of Computer

This is to certify that, this Mini-project report submitted by SUSHIL MEHER

Present day computer applications require the representation of huge amounts of

Data science involves collecting, cleaning, and analyzing data to derive

A comprehensive data science, AI, and ML training program typically covers

Hands-on experience is vital in this training, involving real-world projects,

Moreover, understanding the ethical implications, such as bias and fairness, is

Jupyter Notebook is indispensable in data science due to its interactive and

Kaggle is a pivotal platform in the data science community, offering several

1. Data Science: Data Science is an interdisciplinary field that combines

2. Machine Learning (ML): Machine Learning is a subset of AI that focuses

3. Artificial Intelligence (AI): Artificial Intelligence is a broader field that

1.1 Several Aspects Of Data Science

1. Pentium 4, Window XP/Window 7:The minimum system requirements

1. Windows XP/7: These older operating systems provide a basic

Python is a versatile and widely-used programming language in the field of data

4.2 Why Python?

➢ Rich Ecosystem: Python boasts a vast ecosystem of libraries and

➢ Ease of Learning: Python's clean and straightforward syntax makes it

➢ Cross-Platform Compatibility: Python runs on various platforms,

➢ Integration: Python seamlessly integrates with other languages like C

➢ Open Source: Python is open-source, making it cost-effective and

Python supports a variety of operators, including arithmetic (+, -, *, /),

Python has several built-in data types, such as:

● int: for integers (e.g., 5)

● float: for floating-point numbers (e.g., 3.14)

● str: for strings (e.g., "Hello, World!")

● bool: for Boolean values (True or False)

● list: for ordered, mutable sequences

● tuple: for ordered, immutable sequences

● dict: for key-value mappings

● set: for unordered collections of unique elements

print("x is greater than 10")

print("x is equal to 10")

print("x is less than 10")

Python supports for and while loops. For example:

def keyword. For example:

return f"Hello, {name}!"

6.1 What is Statistics?

Statistics is a branch of mathematics focused on collecting, organizing,

6.2 Role of Statistics

➢ Descriptive Statistics: Summarizes data using measures like mean, median,

➢ Inferential Statistics: Makes predictions and inferences about populations

➢ Probability Distributions: Models data characteristics using distributions

➢ Hypothesis Testing: Determines if observed differences are statistically

➢ Regression Analysis: Models relationships between variables and makes

➢ ANOVA: Compares group means, useful in experiments.

➢ Non-parametric Statistics: Suitable for non-standard data distributions.

➢ Time Series Analysis (If applicable): Models and forecasts time-dependent

➢ Statistical Software: Proficiency in R or Python for analysis.

6.3 Descriptive Statistics

It is a number which occurs most frequently in the data series.

It is robust and is not generally affected much by addition of a couple of new

data=pd.read_csv( "Mode.csv") //reads data from csv file

data.head() //print first five lines