Sushil 7th (1 PDF

Download as pdf or txt
Download as pdf or txt
You are on page 1of 29

DEPARTMENT OF COMPUTER SCIENCE ENGINEERING

School of Engineering and Technology


Hemvati Nandan Bahuguna Garhwal University
(A Central University) Srinagar Garhwal, Uttarakhand 249161

An Internship report on
DATA SCIENCE

Submitted By

Sushil Meher
[ Roll No : 21134501032 ]
[ B.Tech (C.S.E) VIIth ]

Under the supervision and guidance of


Dr. Prem Nath
Professor at Dept. of Computer Science & Engineering School

Conducted at ‘UNIFIED MENTOR’

In the partial fulfilment of requirements for the award of Degree in


Bachelor of Technology
Session 2024-2025
STUDENT DECLARATION

I, SUSHIL MEHER bearing the roll number 21134501032 students of Computer


Science at Hemvati Nandan Bahuguna Garhwal University (A Central University),
Srinagar, for the award of the Bachelors of Technology Degree in COMPUTER
SCIENCE and declaring that work done is genuine and produced under the
guidance of Prof. M P Thapliyal, Department of Computer Science and
Engineering, Hemvati Nandan Bahuguna Garhwal University.

Date-17/12/2024 SIGNATURE
Sushil Meher

1
CERTIFICATE

This is to certify that, this Mini-project report submitted by SUSHIL MEHER


bearing the roll no 21134501032 is bonafide record of the work carried out by
their partial fulfilment for the requirement of the award of BACHELOR OF
COMPUTER SCIENCE AND ENGINEERING degree from Hemvati
Nandan Bahuguna Garhwal University (A Central University) at Srinagar
(Garhwal), Uttarakhand.

2
ACKNOWLEDGEMENT

We would like to express our deepest gratitude to all people for sprinkling their
help and kindness in the completion of this Project. We would like to start this
moment by invoking our purest gratitude to M P Thapliyal, Department of
Computer Science and Engineering, Hemvati Nandan Bahuguna Garhwal
University (A Central University), Srinagar (Garhwal), Uttarakhand, our
project instructor.

The completion of this seminar could not have been possible without his
expertise and invaluable guidance in every phase at Hemvati Nandan
Bahuguna Garhwal University (A Central University), Srinagar (Garhwal),
Uttarakhand for helping us.

3
CERTIFICATE

4
ABSTRACT

Present day computer applications require the representation of huge amounts of


complex knowledge and data in programs and thus require tremendous amounts
of work. Our ability to code the computers falls short of the demand for
applications. If the computers are endowed with the learning ability, then our
burden of coding the machine is eased (or at least reduced). This is particularly
true for developing expert systems where the "bottleneck" is to extract the
expert's knowledge and feed the knowledge to computers. The present-day
computer programs in general (with the exception of some Machine Learning
programs) cannot correct their own errors or improve from past mistakes, or
learn to perform a new task by analogy to a previously seen task. In contrast,
human beings are capable of all the above. Machine Learning will produce
smarter computers capable of all the above intelligent behavior.

The area of Machine Learning deals with the design of programs that can learn
rules from data, adapt to changes, and improve performance with experience. In
addition to being one of the initial dreams of Computer Science, Machine
Learning has become crucial as computers are expected to solve increasingly
complex problems and become more integrated into our daily lives. This is a
hard problem, since making a machine learn from its computational tasks
requires work at several levels, and complexities and ambiguities arise at each of
those levels.

So, here we study how Machine learning takes place, what are the methods,
discuss various Projects (Implemented during Training) applications, present
and future status of machine learning.

5
CHAPTER 1: INTRODUCTION

Training in data science with artificial intelligence (AI) and machine learning
(ML) is an exciting and dynamic field that equips individuals with the skills to
extract valuable insights, automate decision-making processes, and unlock the
potential of data. This training encompasses a wide array of knowledge and
practical expertise.

Data science involves collecting, cleaning, and analyzing data to derive


actionable insights. AI and ML are subsets of data science that focus on creating
algorithms and models capable of learning patterns from data and making
predictions or decisions. These technologies are used in diverse applications,
from recommendation systems in e-commerce to predictive maintenance in
manufacturing.

A comprehensive data science, AI, and ML training program typically covers


fundamental concepts such as data manipulation, statistical analysis, and
programming in languages like Python. It delves into advanced topics like deep
learning, natural language processing, and reinforcement learning.

Hands-on experience is vital in this training, involving real-world projects,


model development, and evaluation. Familiarity with popular libraries and tools
like TensorFlow, scikit-learn, and Jupyter notebooks is essential.

Moreover, understanding the ethical implications, such as bias and fairness, is


increasingly crucial in data science with AI and ML. Interdisciplinary skills in
communication and domain knowledge enhance the effectiveness of data-driven
solutions.

6
TECHNICAL TRAINING PLATFORM

VS Code

Visual Studio Code (VS Code) is increasingly important in data science due to
its versatility, ease of use, and extensive extension ecosystem. Data scientists
can leverage VS Code for several critical tasks. It supports various
programming languages commonly used in data science, such as Python and R,
making it a unified environment for coding, data manipulation, and analysis. VS
Code's extensions enable integration with Jupyter notebooks, version control
systems, and data visualization libraries. It offers a streamlined interface for
writing code, running experiments, and collaborating with teams, making it an
invaluable tool for data scientists seeking efficiency and productivity in their
workflow.

Jupyter Notebook

Jupyter Notebook is indispensable in data science due to its interactive and


collaborative nature. It provides an interactive environment where data
scientists can blend code, data, visualizations, and narrative explanations
seamlessly. This flexibility is crucial for exploratory data analysis, modeling,
and sharing insights. Jupyter Notebook supports various programming
languages, with Python being the most popular. It facilitates reproducibility by
allowing researchers to document their workflow step by step. Moreover, it's
instrumental in education, enabling instructors to teach data science concepts
effectively. Its ability to create and share interactive reports makes it a vital tool
for data scientists, researchers, and educators across various domains.

Kaggle

Kaggle is a pivotal platform in the data science community, offering several


vital contributions. It hosts data science competitions that push the boundaries
of innovation, allowing practitioners to apply their skills to real-world
challenges. Kaggle Kernels provide a collaborative environment for code
sharing and learning. Datasets and notebooks shared by the community facilitate
knowledge sharing and learning. Its Learn section provides extensive resources
and courses on data science topics. For those entering the field, Kaggle serves as
a practical, hands-on learning playground, while for experienced practitioners,
it's a hub for showcasing expertise and collaborating on impactful projects.
7
CHAPTER 2: DATA SCIENCE WITH AI AND ML

Data Science, Artificial Intelligence (AI), and Machine Learning (ML) are
transformative fields that have revolutionized how businesses, organizations,
and researchers analyze and extract insights from data. In this introduction, we'll
explore the fundamental concepts and their interplay in these domains.

1. Data Science: Data Science is an interdisciplinary field that combines


domain knowledge, statistics, programming, and data analysis to extract
valuable insights and knowledge from data. It involves collecting, cleaning, and
transforming data, followed by the application of various techniques to uncover
patterns, trends, and correlations.

2. Machine Learning (ML): Machine Learning is a subset of AI that focuses


on creating algorithms and models that enable computers to learn from data and
make predictions or decisions without explicit programming. ML algorithms are
categorized into supervised, unsupervised, and reinforcement learning,
depending on the learning process.

3. Artificial Intelligence (AI): Artificial Intelligence is a broader field that


encompasses the development of intelligent agents capable of performing tasks
that typically require human intelligence. AI can involve rule-based systems,
expert systems, natural language processing, and computer vision in addition to
Machine Learning.

1.1 Several Aspects Of Data Science

Before developing a web site once should keep several aspects in mind like:
● Data Collection: Gathering relevant data from various sources, such as
databases, APIs, and sensors.
● Data Cleaning and Preprocessing: Ensuring data is accurate, complete, and
ready for analysis.
● Exploratory Data Analysis (EDA): Examining data visually and statistically
to discover patterns.
● Feature Engineering: Selecting or creating relevant variables for analysis.
● Machine Learning: Building predictive models and making data-driven
decisions.

8
CHAPTER 3: HARDWARE AND SOFTWARE
REQUIREMENT

HARDWARE REQUIRED :

1. Pentium 4, Window XP/Window 7:The minimum system requirements


for learning data science include a Pentium 4 processor, Windows XP or
Windows 7, and 256 MB of RAM. These specifications provide basic
functionality but may limit performance for more advanced data science
tasks.
2. 256 MB RAM : A minimum of 256 MB of RAM is required for basic
data science learning, providing enough memory for running lightweight
applications and simple data processing tasks, though performance may
be constrained for more resource-intensive operations.

SOFTWARE REQUIRED:

1. Windows XP/7: These older operating systems provide a basic


environment for learning data science, though they may have limitations
in running modern software and handling large datasets.
2. VSCode: Visual Studio Code is a lightweight, versatile IDE that supports
multiple programming languages and extensions, making it ideal for data
science development and Python programming.
3. Jupyter Notebook: Jupyter Notebook is an interactive environment for
writing and running Python code, ideal for data analysis, visualization,
and creating reproducible research.
4. Matplotlib: Matplotlib is a powerful Python library for creating static,
animated, and interactive visualizations, commonly used in data science
for plotting charts and graphs.
5. IDE: An Integrated Development Environment (IDE) provides essential
tools like code editing, debugging, and execution, streamlining the coding
workflow for data science tasks.

9
CHAPTER 4: TOOLS

4.1 Introduction
Fundamental tools in data science serve as the cornerstone for various tasks in
data analysis and machine learning. Python, a versatile programming language,
is the linchpin of the data science toolkit. It's complemented by Jupyter
Notebook, an interactive environment perfect for data exploration and
documentation. Pandas, a robust library, takes center stage in data manipulation
and analysis, particularly suited for structured data. Data visualization is
achieved through Matplotlib, a versatile plotting library, and Seaborn, which
simplifies creating appealing statistical graphics. For machine learning
endeavors, Scikit-Learn provides essential algorithms and tools, making it
accessible for beginners and powerful for experts. Version control is essential,
and Git is the industry standard for tracking code changes and collaboration.
These core tools empower data scientists to clean, explore, and analyze data, as
well as develop machine learning models. While more specialized tools may be
necessary for certain projects, these foundational tools remain indispensable and
are the starting point for anyone venturing into the field of data science.

4.2 Features
Data science involves collecting and analyzing data to derive insights,
employing machine learning for predictions, and using visualization to
communicate results.

10
CHAPTER 5:PYTHON

4.1 Introduction

Python is a versatile and widely-used programming language in the field of data


science. Its rich ecosystem of libraries and tools, such as NumPy, Pandas,
Matplotlib, Seaborn, ScikitLearn, and more, make it a popular choice for data
analysis, manipulation, visualization, and machine learning. Python's simplicity
and readability make it accessible to both beginners and experienced data
scientists, enabling them to work with structured and unstructured data, develop
predictive models, and create informative data visualizations. Python's strong
community support, extensive documentation, and active development continue
to drive its prominence as the go-to language for data science projects.

4.2 Why Python?

➢ Rich Ecosystem: Python boasts a vast ecosystem of libraries and


frameworks tailored for data manipulation, analysis, and machine
learning, such as Pandas, NumPy, Matplotlib, and Scikit-Learn.

➢ Ease of Learning: Python's clean and straightforward syntax makes it


accessible to beginners, allowing for a smooth learning curve.

➢ Cross-Platform Compatibility: Python runs on various platforms,


making it versatile for different operating systems.

➢ Integration: Python seamlessly integrates with other languages like C


and Java, making it a preferred choice for integrating data science
solutions into existing applications.

➢ Open Source: Python is open-source, making it cost-effective and


allowing for extensive customization and collaboration.

11
Operators

Python supports a variety of operators, including arithmetic (+, -, *, /),


comparison (==, !=,<, >, <=, >=), logical (and, or, not), assignment (=), and
more.

Data Types

Python has several built-in data types, such as:

● int: for integers (e.g., 5)

● float: for floating-point numbers (e.g., 3.14)

● str: for strings (e.g., "Hello, World!")

● bool: for Boolean values (True or False)

● list: for ordered, mutable sequences

● tuple: for ordered, immutable sequences

● dict: for key-value mappings

● set: for unordered collections of unique elements

Variables:

Variables are used to store data. In Python, you can create a variable by
assigning a value to a name, like x = 10.

Conditional Statements:

Conditional statements allow you to make decisions in your code using if, elif
(else if), and else.

12
For example:

if x > 10:

print("x is greater than 10")

elif x == 10:

print("x is equal to 10")

else:

print("x is less than 10")

Loops:

Python supports for and while loops. For example:

for i in range(5):

print(x) x += 1

Functions:

Functions allow you to group code into reusable blocks. You can define a
function using the

def keyword. For example:

def greet(name):

return f"Hello, {name}!"

message = greet("Alice")

print(message)

13
CHAPTER 6: STATISTICS

6.1 What is Statistics?

Statistics is a branch of mathematics focused on collecting, organizing,


analyzing,interpreting, and presenting data. It encompasses a range of
techniques for summarizing data, assessing relationships, and drawing
meaningful conclusions from information.

Statistics plays a vital role in various fields, including science, economics, and
social sciences, enabling researchers and analysts to make informed decisions,
test hypotheses, and build predictive models based on empirical evidence and
data patterns. In data science,statistics forms the basis for deriving actionable
insights from large datasets.

6.2 Role of Statistics

➢ Descriptive Statistics: Summarizes data using measures like mean, median,


and

variance.

➢ Inferential Statistics: Makes predictions and inferences about populations


based on

samples.

➢ Probability Distributions: Models data characteristics using distributions


like normal

and binomial.

➢ Hypothesis Testing: Determines if observed differences are statistically


significant.

➢ Regression Analysis: Models relationships between variables and makes


predictions.

➢ ANOVA: Compares group means, useful in experiments.

➢ Non-parametric Statistics: Suitable for non-standard data distributions.

14
➢ Bayesian Statistics (Optional): Deals with uncertainty and probabilistic
modeling.

➢ Time Series Analysis (If applicable): Models and forecasts time-dependent


data.

➢ Statistical Software: Proficiency in R or Python for analysis.

6.3 Descriptive Statistics

Mode

It is a number which occurs most frequently in the data series.

It is robust and is not generally affected much by addition of a couple of new


values.

Code

import pandas as pd

data=pd.read_csv( "Mode.csv") //reads data from csv file

data.head() //print first five lines

mode_data=data['Subject'].mode() //to take mode of subject column

print(mode_data)

15
Mean

import pandas as pd

data=pd.read_csv( "mean.csv") //reads data from csv file

data.head() //print first five lines

mean_data=data[Overallmarks].mean() //to take mode of subject column

print(mean_data)

Median

Absolute central value of data set.

import pandas as pd

data=pd.read_csv( "data.csv") //reads data from csv file

data.head() //print first five lines

median_data=data[Overallmarks].median() //to take mode of subject column

print(median_data)

16
5.4: Probability Distribution

In data science, a probability distribution is a mathematical representation of


how likely different values or outcomes are in a dataset or random process.
These distributions describe the inherent uncertainty in data and help make
data-driven decisions. Common distributions include the normal, binomial, and
Poisson distributions, each with its unique characteristics.

Parameters, such as mean and standard deviation, define these distributions'


shapes. Data scientists use probability distributions to model data, conduct
hypothesis testing, make predictions, and simulate scenarios. The Central Limit
Theorem is crucial, as it asserts that the sample mean of sufficiently large
samples follows a normal distribution, underpinning many statistical techniques
and analyses in data science. Understanding probability distributions is essential
for harnessing the power of data.

17
CHAPTER 7: GRAPHS

6.1 What is Graph?


Graphs in data science refer to visual representations of data that help analysts
and data scientists better understand patterns, relationships, and insights within
datasets. They are powerful tools for data exploration, communication, and
analysis.

6.2 Types of Graphs Here are some common types of graphs used in data
science:
➢ Bar Charts: Bar charts are used to display and compare categorical data.
They represent categories on one axis and the corresponding values on the
other, typically using vertical or horizontal bars.

➢ Histograms: Histograms are used to visualize the distribution of continuous


data. They group data into bins and display the frequency or density of values
within each bin.

18
Scatter Plots:
Scatter plots show individual data points as dots on a two-dimensional plane.
They are used to visualize the relationship between two continuous variables
and identify patterns or correlations.

19
➢ Line Charts:
Line charts display data points as connected lines, often used to show trends or
changes in data over time.

➢ Pie Charts:
Pie charts represent parts of a whole, where each slice corresponds to a
percentage of the total. They are used to visualize the composition of a dataset.

20
FINAL PROJECT

SNAPSHOT :

Reading the CSV file and Mean and Standard deviation:

Splitting data into training and testing sets:

21
Helper Function :

Gradient Descent and Normal function :

22
Regulization Parameter :

OUTPUT :

23
OUTPUT :

24
25
CONCLUSION

In conclusion, training in data science equips individuals with valuable skills to


extract insights and make informed decisions from data. It encompasses a wide
range of topics, including data collection, cleaning, analysis, machine learning,
and data visualization. A well-rounded data science education often involves
learning programming languages like Python, mastering essential libraries and
tools, and gaining expertise in statistical and machine learning techniques. Data
science training also emphasizes the importance of critical thinking and
problem-solving, as well as effective communication of findings to
stakeholders. As the demand for data-driven insights continues to grow across
industries, data science training provides a pathway to exciting and rewarding
career opportunities. Ultimately, data science training is an ongoing journey, as
the field evolves rapidly with emerging technologies and new data challenges.
Continuous learning and staying up-to-date with the latest developments are key
to success in this dynamic and high-demand profession.

26
FUTURE SCOPE

The future of data science, AI, and ML is poised for significant expansion and
impact. AI and ML will increasingly underpin advanced applications in various
sectors, from healthcare to finance. As data generation continues to explode,
data scientists will be at the forefront of managing and deriving insights from
big data. Ethical considerations surrounding AI fairness, transparency, and
accountability will become even more crucial.

Furthermore, edge computing will gain prominence, enabling real-time data


processing closer to the source, particularly with the proliferation of IoT
devices. Automation tools, like AutoML, will democratize data science,
allowing more individuals to harness its power. Interdisciplinary collaboration
will become the norm, as domain experts work hand-in-hand with data scientists
to apply AI solutions effectively in their fields.

Security and privacy challenges will intensify with increased reliance on AI and
data, demanding innovative solutions. Research and development will continue
to drive breakthroughs in AI algorithms and architectures. In summary, the
future of data science, AI, and ML holds immense potential for transforming
industries, improving decision-making, and shaping a technologically advanced
future.

27
REFERENCES

➢ https://www.kaggle.com/learn/overview
➢ https://www.edx.org/micromasters/data-science
➢ https://www.fast.ai/
➢ https://towardsdatascience.com/
➢ https://www.youtube.com/user/joshstarmer
➢ https://github.com/josephmisiti/awesome-machine-learning
➢https://github.com/campusx-official/book-recommender-system
/commit/
678c7ab5a67adfcafaadf5b2924e4d04acafe9ac#diff5983284b94671
de74632c367234334917d7e2de10e4be9c255afb37e33f5352e
➢ https://www.youtube.com/user/sentdex
➢ https://www.coursera.org/specializations/deep-learning
➢https://github.com/ChristosChristofidis/awesome-deep-learning
➢ https://youtu.be/1YoD0fg3_EM?feature=shared

28

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy