Harsh Synopsis

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 21

An

Industrial Synopsis Report

Statistics for data Science

Submitted in for the partial fulfilment of the degree


By

Harshwardhan Singh

Reg.No:21BCAN178

Under the Guidance of

Faculty Internship Guide Industry Guide


Name: Vartika Name:
Internship Organisation Details

Internshala is a dot com business with the heart of dot org. It is a technology company on a
mission to equip students with relevant skills & practical exposure through internships and
trainings.
Internshala's offerings include Placement Guarantee Courses, ensuring students acquire the vital
skills demanded by current competitive job market. These courses are complemented by short-
term online trainings meticulously crafted to address specific skill gaps and industry requirements.
Furthermore, Internshala plays a pivotal role in guiding students through their first steps into the
professional realm by facilitating internships and fresher job opportunities. This hands-on
experience is invaluable, providing students with the real-world exposure necessary for a
successful career launch.

In August 2016, Telangana's not-for-profit organisation, Telangana Academy for Skill and
Knowledge (TASK) partnered with Internshala to help students with internship resources and
career services.
From Internshala we can complete courses and gain certificates after the completion of the course
which helps us for applying a job
INDEX
Sr. Topic
No.
1 Title
2 Index
3 Declaration
4 Acknowledgement
5 About Training
6 About Internshala
7 Objectives
8 Data Science
9 My Learnings
10 Reason for choosing Data Science
11 Learning Outcome
12 Scope in Data Science
13 Results
1. ABOUT TRAINING
• NAME OF TRAINING: DATA SCIENCE
• HOSTING INSTITUTION: INTERNSHALA
• DATES: From 1st July 2021 to 12th August 2021

2. ABOUT INTERNSHALA
Internshala is an internship and online training platform, based in Gurgaon, India. Founded in
2011 by Sarvesh Agrawal, an IIT Madras alumni. The site offers searching and posting
internships, and other career services such as counselling, cover-letter writing, resume
building and training programs to students.

3. OBJECTIVES
To explore, sort and analyse mega data from various sources to take advantage of them and
reach conclusions to optimize business processes and for decision support.
Examples include machine maintenance or (predictive maintenance), in the fields of
marketing and sales with sales forecasting based on weather.

4. DATA SCIENCE
Data Science as a multi-disciplinary subject that uses mathematics, statistics, and computer
science to study and evaluate data. The key objective of Data Science is to extract valuable
information for use in strategic decision making, product development, trend analysis, and
forecasting.
Data Science concepts and processes are mostly derived from data engineering, statistics,
programming, social engineering, data warehousing, machine learning, and natural language
processing. The key techniques in use are data mining, big data analysis, data extraction and
data retrieval.
Data science is the field of study that combines domain expertise, programming skills, and
knowledge of mathematics and statistics to extract meaningful insights from data. Data
science practitioners apply machine learning algorithms to numbers, text, images, video,
audio, and more to produce artificial intelligence (AI) systems to perform tasks that
ordinarily require
human intelligence. In turn, these systems generate insights which analysts and business users
can translate into tangible business value.

DATA SCIENCE PROCESS:


1. The first step of this process is setting a research goal. The main purpose here is making
sure all the stakeholders understand the what, how, and why of the project.
2. The second phase is data retrieval. You want to have data available for analysis, so this
step includes finding suitable data and getting access to the data from the data owner.
The result is data in its raw form, which probably needs polishing and transformation
before it becomes usable.
3. Now that you have the raw data, it’s time to prepare it. This includes transforming the
data from a raw form into data that’s directly usable in your models. To achieve this,
you’ll detect and correct different kinds of errors in the data, combine data from
different data sources, and transform it. If you have successfully completed this step,
you can progress to data visualization and modeling.
4. The fourth step is data exploration. The goal of this step is to gain a deep
understanding of the data. You’ll look for patterns, correlations, and deviations based
on visual and descriptive techniques. The insights you gain from this phase will enable
you to start modeling.
5. Finally, we get to the sexiest part: model building (often referred to as “data modeling”
throughout this book). It is now that you attempt to gain the insights or make the
predictions stated in your project charter. Now is the time to bring out the heavy guns,
but remember research has taught us that often (but not always) a combination of
simple models tends to outperform one complicated model. If you’ve done this phase
right, you’re almost done.
The last step of the data science model is presenting your results and automating the
analysis, if needed. One goal of a project is to change a process and/or make better decisions. You
may still need to convince the business that your findings will indeed change the business process
as expected. This is where you can shine in your influencer role. The importance of this step is
more apparent in projects on a strategic and tactical level. Certain projects require you to perform
the business process over and over again, so automating the project will save time.

5. MY LEARNINGS
1) INTRODUCTION TO DATA SCIENCE
• Overview & Terminologies in Data Science
• Applications of Data Science
 Unfamiliar detection (fraud, disease, etc.)
 Automation and decision-making (credit worthiness, etc.)
 Classifications (classifying emails as “important” or “junk”)
 Forecasting (sales, revenue, etc.)
 Pattern detection (weather patterns, financial market patterns, etc.)
 Recognition (facial, voice, text, etc.)
 Recommendations (based on learned preferences, recommendation engines can
refer you to movies, restaurants and books you may like)

2) PYTHON FOR DATA SCIENCE


Introduction to Python, Understanding Operators, Variables and Data Types, Conditional
Statements, Looping Constructs, Functions, Data Structure, Lists, Dictionaries, Understanding
Standard Libraries in Python, reading a CSV File in Python, Data Frames and basic operations
with Data Frames, Indexing Data Frame.

3) UNDERSTANDING THE STATISTICS FOR DATA SCIENCE


Introduction to Statistics, Measures of Central Tendency, Understanding the spread of data,
Data Distribution, Introduction to Probability, Probabilities of Discrete and Continuous
Variables, Normal Distribution, Introduction to Inferential Statistics, Understanding the
Confidence Interval and margin of error, Hypothesis Testing, Various Tests, Correlation.
4)PREDICTIVE MODELING AND BASICS OF MACHINE LEARNING
Introduction to Predictive Modeling, Types and Stages of Predictive Models, Hypothesis
Generation, Data Extraction and Exploration, Variable Identification, Univariate Analysis for
Continuous Variables and Categorical Variables, Bivariate Analysis, Treating Missing Values
and Outliers, Transforming the Variables, Basics of Model Building, Linear and Logistic
Regression, Decision Trees, K-means Algorithms in Python.
Summary of Procedure of Analyzing Data:
Data science generally has a five-stage life cycle that consists of:
• Capture: data entry, signal reception, data extraction
• Maintain: Data cleansing, data staging, data processing.
• Process: Data mining, clustering/classification, data modelling
• Communicate: Data reporting, data visualization
• Analyse: Predictive analysis, regression

Application of Data Science


 Recommendation System
Example-In Amazon recommendations are different for different users according to their past search.

 Social Media
1. Recommendation Engine
2. Ad placement
3. Sentiment Analysis
 Deciding the right credit limit for credit card customers.
 Suggesting right products from e-commerce companies
1. Recommendation System
2. Past Data Searched
3. Discount Price Optimization
 How google and other search engines know what are the more relevant results for our search query?
1. Apply ML and Data Science
2. Fraud Detection
3. AD placement
4. Personalized search results
Python Introduction
Python is an interpreted, high-level, general-purpose programming language. It has efficient high-level data
structures and a simple but effective approach to object-oriented programming. Python’s elegant syntax and
dynamic typing, together with its interpreted nature, make it an ideal language for scripting and rapid
application development in many areas on most platforms.

Python for Data science:

Why Python???

1. Python is an open source language.


2. Syntax as simple as English.
3. Very large and Collaborative developer community.
4. Extensive Packages.
 UNDERSTANDING OPERATORS:
Theory of operators: - Operators are symbolic representation of Mathematical tasks.
 VARIABLES AND DATATYPES:
Variables are named bounded to objects. Data types in python are int (Integer), Float, Boolean and
strings.
 CONDITIONAL STATEMENTS:
If-else statements (Single condition)
If- elif- else statements (Multiple Condition)
 LOOPING CONSTRUCTS:
For loop
 FUNCTIONS:
Functions are re-usable piece of code. Created for solving specific problem.
Two types: Built-in functions and User- defined functions.
Functions cannot be reused in python.
 DATA STRUCTURES:

Two types of Data structures:

LISTS: A list is an ordered data structure with elements separated by comma and enclosed within
square brackets.

DICTIONARY: A dictionary is an unordered data structure with elements separated by comma and
stored as key: value pair, enclosed with curly braces {}.
Statistics
Descriptive Statistic
Mode
It is a number which occurs most frequently in the data series.
It is robust and is not generally affected much by addition of couple of new values.
Code
import pandas as pd
data=pd.read_csv( "Mode.csv") //reads data from csv file
data.head() //print first five lines
mode_data=data['Subject'].mode() //to take mode of subject column
print(mode_data)
Mean
import pandas as pd
data=pd.read_csv( "mean.csv") //reads data from csv file
data.head() //print first five lines
mean_data=data[Overallmarks].mean() //to take mode of subject column
print(mean_data)
Median
Absolute central value of data set.
import pandas as pd
data=pd.read_csv( "data.csv") //reads data from csv file
data.head() //print first five lines
median_data=data[Overallmarks].median() //to take mode of subject column
print(median_data)
Types of variables
 Continous – Which takes continuous numeric values. Eg-marks
 Categorial-Which have discrete values. Eg- Gender
 Ordinal – Ordered categorial variables. Eg- Teacher feedback
 Nominal – Unorderd categorial variable. Eg- Gender
Outliers
Any value which will fall outside the range of the data is termed as a outlier. Eg- 9700 instead of 97.
Reasons of Outliers
 Typos-During collection. Eg-adding extra zero by mistake.
 Measurement Error-Outliers in data due to measurement operator being faulty.
 Intentional Error-Errors which are induced intentionally. Eg-claiming smaller amount of alcohol
consumed then actual.
 Legit Outlier—These are values which are not actually errors but in data due to legitimate
reasons. Eg - a CEO’s salary might actually be high as compared to other employees.
Interquartile Range (IQR)
Is difference between third and first quartile from last. It is robust to outliers.
Histograms
Histograms depict the underlying frequency of a set of discrete or continuous data that are measured on an
interval scale.
import pandas as pd
histogram=pd.read_csv(histogram.csv)
import matplotlib.pyplot as plt
%matplot inline
plt.hist(x= 'Overall Marks',data=histogram)
plt.show()
Inferential Statistics
Inferential statistics allows to make inferences about the population from the sample data.
Hypothesis Testing
Hypothesis testing is a kind of statistical inference that involves asking a question, collecting data, and then
examining what the data tells us about how to proceed. The hypothesis to be tested is called the null
hypothesis and given the symbol Ho. We test the null hypothesis against an alternative hypothesis, which is
given the symbol Ha.

T Tests
When we have just a sample not population statistics.
Use sample standard deviation to estimate population standard deviation.
T test is more prone to errors, because we just have samples.
Z Score
The distance in terms of number of standard deviations, the observed value is away from mean, is standard
score or z score.
+Z – value is above mean.
-Z – value is below mean.
The distribution once converted to z- score is always same as that of shape of original distribution.

Chi Squared Test


To test categorical variables.
Correlation
Determine the relationship between two variables.
It is denoted by r. The value ranges from -1 to +1. Hence, 0 means no relation.
Syntax
import pandas as pd
import numpy as np
data=pd.read_csv("data.csv")
data.corr()
Predictive Modelling

Making use of past data and attributes we predict future using this data.
Eg-
Past Horror Movies
Future Unwatched Horror Movies

Predicting stock price movement


1. Analysing past stock prices.
2. Analysing similar stocks.
3. Future stock price required.
Types
1. Supervised Learning
Supervised learning is a type algorithm that uses a known dataset (called the training dataset) to
make predictions. The training dataset includes input data and response values.
 Regression-which have continuous possible values. Eg-Marks
 Classification-which have only two values. Eg-Cancer prediction is either 0 or 1.
2. Unsupervised Learning
Unsupervised learning is the training of machine using information that is neither classified nor.
Here the task of machine is to group unsorted information according to similarities, patterns and
differences without any prior training of data.
 Clustering: A clustering problem is where you want to discover the inherent groupings in the
data, such as grouping customers by purchasing behaviour.
 Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.

Stages of Predictive Modelling


1. Problem definition
2. Hypothesis Generation
3. Data Extraction/Collection
4. Data Exploration and Transformation
5. Predictive Modelling
6. Model Development/Implementation

Problem Definition
Identify the right problem statement, ideally formulate the problem mathematically.
Hypothesis Generation
List down all possible variables, which might influence problem objective. These variables should be free
from personal bias and preferences.
Quality of model is directly proportional to quality of hypothesis.
Data Extraction/Collection
Collect data from different sources and combine those for exploration and model building.
While looking at data we might come across new hypothesis.
Data Exploration and Transformation
Data extraction is a process that involves retrieval of data from various sources for further data processing or
data storage.
Steps of Data Extraction
 Reading the data
Eg- From csv file
 Variable identification
 Univariate Analysis
 Bivariate Analysis
 Missing value treatment
 Outlier treatment
 Variable Transformation

Variable Treatment
It is the process of identifying whether variable is
1. Independent or dependent variable
2. Continuous or categorical variable
Why do we perform variable
identification?
1. Techniques like supervised learning require identification of dependent variable.
2. Different data processing techniques for categorical and continuous data.
Categorical variable- Stored as object.
Continuous variable-Stored as int or float.
Univariate Analysis
1. Explore one variable at a time.
2. Summarize the variable.
3. Make sense out of that summary to discover insights, anomalies,
etc. Bivariate Analysis
 When two variables are studied together for their empirical relationship.
 When you want to see whether the two variables are associated with each other.
 It helps in prediction and detecting anomalies.
Missing Value Treatment
Reasons of missing value
1. Non-response – Eg-when you collect data on people’s income and many choose not to answer.
2. Error in data collection. Eg- Faculty data
3. Error in data
reading. Types
1. MCAR (Missing completely at random): Missing values have no relation to the variable in which
missing value exist and other variables in dataset.
2. MAR (Missing at random): Missing values have no relation to the in which missing value exist and
the variables other than the variables in which missing values exist.
3. MNAR (Missing not at random): Missing values have relation to the variable in which missing value
exists
Identifying
Syntax: -
1. describe()
2. Isnull()
Output will we in True or False
Different methods to deal with missing values
1. Imputation
Continuous-Impute with help of mean, median or regression mode.
Categorical-With mode, classification model.
2. Deletion
Row wise or column wise deletion. But it leads to loss of data.
Outlier Treatment
Reasons of Outliers
1. Data entry Errors
2. Measurement Errors
3. Processing Errors
4. Change in underlying population
Types of Outlier
Univariate
Analysing only one variable for outlier.
Eg – In box plot of height and weight.
Weight will we analysed for outlier
Bivariate
Analysing both variables for outlier.
Eg- In scatter plot graph of height and weight. Both will we analysed.
Identifying Outlier
Graphical Method
 Box Plot

 Scatter Plot

Formula Method
Using Box Plot
< Q1 - 1.5 * IQR or > Q3+1.5 * IQR
Where IQR= Q3 – Q1
Q3=Value of 3rd quartile
Q1=Value of 1st quartile
Treating Outlier
1. Deleting observations
2. Transforming and binning values
3. Imputing outliers like missing values
4. Treat them as separate
Variable Transformation
Is the process by which-
1. We replace a variable with some function of that variable. Eg – Replacing a variable x with its log.
2. We change the distribution or relationship of a variable with others.
Used to –
1. Change the scale of a variable
2. Transforming non linear relationships into linear relationship
3. Creating symmetric distribution from skewed distribution.
Common methods of Variable Transformation – Logarithm, Square root, Cube root, Binning, etc.
Model Building
It is a process to create a mathematical model for estimating / predicting the future based on past data.
Eg-
A retail wants to know the default behaviour of its credit card customers. They want to predict the
probability of default for each customer in next three months.
 Probability of default would lie between 0 and 1.
 Assume every customer has a 10% default rate.
Probability of default for each customer in next 3 months=0.1
It moves the probability towards one of the extremes based on attributes of past information.
A customer with volatile income is more likely (closer to) to default.
A customer with healthy credit history for last years has low chances of default (closer to 0).

Steps in Model Building


1. Algorithm Selection
2. Training Model
3. Prediction / Scoring

Algorithm Selection
Example-

Have dependent variable?

Yes No

Supervised Unsupervised
Learning Learning

Is dependent
variable continuous?

Yes No

Regression Classification
Eg- Predict the customer will buy product or not.
Algorithms
 Logistic Regression
 Decision Tree
 Random Forest

Training Model
It is a process to learn relationship / correlation between independent and dependent variables.
We use dependent variable of train data set to predict/estimate.
Dataset
 Train
Past data (known dependent variable).
Used to train model.
 Test
Future data (unknown dependent variable)
Used to score.
Prediction / Scoring
It is the process to estimate/predict dependent variable of train data set by applying model rules.
We apply training learning to test data set for prediction/estimation.

Algorithm of Machine Learning


Linear Regression
Linear regression is a statistical approach for modelling relationship between a dependent variable with a
given set of independent variables.
It is assumed that the wo variables are linearly related. Hence, we try to find a linear function. That predicts
the response value(y) as accurately as possible as a function of the feature or independent variable(x).

Y-Values The equation of regression line is


14 represented as:
12

10

6 The squared error or cost function, J as:


4

0
0 1 2 3 4 5 6 7 8 9
Logistic Regression

Logistic regression is a statistical model that in its basic form uses a logistic function to model a binary
dependent variable, although many more complex extensions exist.

C = -y (log(y) – (1-y) log(1-y))

K-Means Clustering (Unsupervised learning)

K-means clustering is a type of unsupervised learning, which is used when you have unlabelled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data,
with the number of groups represented by the variable K. The algorithm works iteratively to assign each
data point to one of K groups based on the features that are provided. Data points are clustered based on
feature similarity.

7. SCOPE IN DATA SCIENCE FIELD


Few factors that point out to data science’s future, demonstrating compelling reasons why it is
crucial to today’s business needs are listed below:
 Companies’ Inability to handle data
Data is being regularly collected by businesses and companies for transactions and through
website interactions. Many companies face a common challenge – to analyze and categorize
the data that is collected and stored. A data scientist becomes the savior in a situation of
mayhem like this. Companies can progress a lot with proper and efficient handling of data,
which results in productivity.
 Revised Data Privacy Regulations
Countries of the European Union witnessed the passing of the General Data Protection
Regulation (GDPR) in May 2018. A similar regulation for data protection will be passed by
California in 2020. This will create co-dependency between companies and data scientists for
the need of storing data adequately and responsibly. In today’s times, people are generally
more cautious and alert about sharing data to businesses and giving up a certain amount of
control to them, as there is rising awareness about data breaches and their malefic
consequences. Companies can no longer afford to be careless and irresponsible about their
data. The GDPR will ensure some amount of data privacy in the coming future.
 Data Science is constantly evolving
Career areas that do not carry any growth potential in them run the risk of stagnating. This
indicates that the respective fields need to constantly evolve and undergo a change for
opportunities to arise and flourish in the industry. Data science is a broad career path that is
undergoing developments and thus promises abundant opportunities in the future. Data
science job roles are likely to get more specific, which in turn will lead to specializations in
the field. People inclined towards this stream can exploit their opportunities and pursue what
suits them best through these specifications and specializations.
 An astonishing incline in data growth
Data is generated by everyone on a daily basis with and without our notice. The interaction we
have with data daily will only keep increasing as time passes. In addition, the amount of data
existing in the world will increase at lightning speed. As data production will be on the rise,
the demand for data scientists will be crucial to help enterprises use and manage it well.
 Virtual Reality will be friendlier
In today’s world, we can witness and are in fact witnessing how Artificial Intelligence is
spreading across the globe and companies’ reliance on it. Big data prospects with its current
innovations will flourish more with advanced concepts like Deep Learning and neural
networking. Currently, machine learning is being introduced and implemented in almost every
application. Virtual Reality (VR) and Augmented Reality (AR) are undergoing
monumental modifications too. In addition, human and machine interaction, as well as
dependency, is likely to improve and increase drastically.
 Blockchain updating with Data science
The main popular technology dealing with cryptocurrencies like Bitcoin is referred to as
Blockchain. Data security will live true to its function in this aspect as the detailed
transactions will be secured and made note of. If big data flourishes, then Iot will witness
growth too and gain popularity. Edge computing will be responsible for dealing with
data issues and address them.

8. RESULTS
In this complete 6 weeks training I successfully learnt about DATA SCIENCE. Also,
now I’m able to perform data analysis using python. I also attempted various quizzes and
assignments provided for periodic evaluation during 6 weeks and completed this training
with 100% score in Final Test.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy