Data Science Bcs A

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Page |1

What is Data Science


1).Data science is an interconnected field that involves the use of statistical and
computational methods to extract insightful information and knowledge from
data.
2)Data science is the field of study that combines domain expertise,programming
skills,and knowledge of mathematics and statistics to extract meaningful insight
from data.
Python is a popular and versatile programming language, now has become a
popular choice among data scientists for its ease of use, extensive libraries, and
flexibility. Python provide and efficient and streamlined approach to handing
complex data structure and extracts insights.

How to Learn Data Science?


Usually, There are four areas to master data science.
1. Industry Knowledge : Domain knowledge in which you are going to work is
necessary like If you want to be a data scientist in Blogging domain so you
have much information about blogging sector like SEOs, Keywords and
serializing. It will be beneficial in your data science journey.
2. Models and logics Knowledge: All machine learning systems are built on
Models or algorithms, its important prerequisites to have a basic knowledge
about models that are used in data science.
3. Computer and programming Knowledge : Not master level programming
knowledge is required in data science but some basic like variables, constants,
loops, conditional statements, input/output, functions.
4. Mathematics Used : It is an important part in data science. There is no such
tutorial presents but you should have knowledge about the topics : mean,
median, mode, variance, percentiles, distribution, probability, bayes theorem
and statistical tests like hypothesis testing, Anova, chi squre, p-value.

1|Page
Page |2

Data science Jobs:

As per various surveys, data scientist job is becoming the most demanding Job of
the 21st century due to increasing demands for data science. Some people also
called it "the hottest job title of the 21st century". Data scientists are the experts
who can use various statistical tools and machine learning algorithms to understand
and analyze the data.

The average salary range for data scientist will be approximately $95,000 to $
165,000 per annum, and as per different researches, about 11.5 millions of job
will be created by the year 2026.

Types of Data Science Job

If you learn data science, then you get the opportunity to find the various exciting
job roles in this domain. The main job roles are given below:

1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst.
8. Statistician

Below is the explanation of some critical job titles of data science.

1. Data Analyst:

Data analyst is an individual, who performs mining of huge amount of data,


models the data, looks for patterns, relationship, trends, and so on. At the end of
the day, he comes up with visualization and reporting for analyzing the data for
decision making and problem-solving process.

Skill required: For becoming a data analyst, you must get a good background
in mathematics, business intelligence, data mining, and basic knowledge
of statistics. You should also be familiar with some computer languages and tools
such as MATLAB, Python, SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.

2|Page
Page |3

2. Machine Learning Expert:

The machine learning expert is the one who works with various machine learning
algorithms used in data science such as regression, clustering, classification,
decision tree, random forest, etc.

Skill Required: Computer programming languages such as Python, C++, R, Java,


and Hadoop. You should also have an understanding of various algorithms,
problem-solving analytical skill, probability, and statistics.

3. Data Engineer:

A data engineer works with massive amount of data and responsible for building
and maintaining the data architecture of a data science project. Data engineer also
works for the creation of data set processes used in modeling, mining, acquisition,
and verification.

Skill required: Data engineer must have depth knowledge of SQL, MongoDB,
Cassandra, HBase, Apache Spark, Hive, MapReduce, with language knowledge
of Python, C/C++, Java, Perl, etc.

4. Data Scientist:

A data scientist is a professional who works with an enormous amount of data to


come up with compelling business insights through the deployment of various
tools, techniques, methodologies, algorithms, etc.

Skill required: To become a data scientist, one should have technical language
skills such as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data
scientists must have an understanding of Statistics, Mathematics, visualization, and
communication skills.

5.Database Administrator:

The database administrator are responsible for the functioning of all the databases
of an enterprises and revoke its services to the employees of the company
depending on their requirements.

Skill required:Some of the essential skills and talents of a database administrator


include database backup and recovery,data security,data modeling and design etc.

3|Page
Page |4

6.Data Architect:

A data architect creats the blueprints for data management so that the databases can
be esily,integrated,centralized and protected with the best security measures.

Skill required: A career in data architecture requires expertise in data


warehousing,data modeling,extraction transformation etc.

7. Statistician

A statistician, as the name suggests, has a sound understanding of statistical


theories and data organization. Not only do they extract and offer valuable insights
from the data clusters, but they also help create new methodologies for the
engineers to apply.

8. Business Analyst

The role of business analysts is slightly different than other data science jobs.
While they do have a good understanding of how data-oriented technologies work
and how to handle large volumes of data they also separate the high-value data
from the low-value data.

4|Page
Page |5

Life Cycle of a Typical Data Science Project Explained:

1) Understanding the Business Problem:

In order to build a successful business model, its very important to first understand
the business problem that the client is facing. Suppose he wants to predict the
customer churn rate of his retail business. You may first want to understand his
business, his requirements and what he is actually wanting to achieve from the
prediction. In such cases, it is important to take consultation from domain experts
and finally understand the underlying problems that are present in the system. A
Business Analyst is generally responsible for gathering the required details from
the client and forwarding the data to the data scientist team for further speculation.
5|Page
Page |6

Even a minute error in defining the problem and understanding the requirement
may be very crucial for the project hence it is to be done with maximum precision.

After asking the required questions to the company stakeholders or clients, we


move to the next process which is known as data collection.

2) Data Collection

After gaining clarity on the problem statement, we need to collect relevant data to
break the problem into small components.

The data science project starts with the identification of various data sources,
which may include web server logs, social media posts, data from digital libraries
such as the US Census datasets, data accessed through sources on the internet via
APIs, web scraping, or information that is already present in an excel spreadsheet.
Data collection entails obtaining information from both known internal and
external sources that can assist in addressing the business issue.

Normally, the data analyst team is responsible for gathering the data. They need to
figure out proper ways to source data and collect the same to get the desired
results.

There are two ways to source the data:

1. Through web scraping with Python


2. Extracting Data with the use of third party APIs

6|Page
Page |7

7|Page
Page |8

3) Data Preparation

This is a very important stage in the data science lifecycle, this stage includes data
cleaning, data reduction, data transformation, and data integration. This stage takes
lots of time and data scientists spend a significant amount of time preparing the
data.
Data cleaning is handling the missing values in the data and filling out these
missing values with appropriate values and smoothing out the noisy data.
Data reduction is using various strategies to reduce the size of data such that the
output remains the same and the processing time of data reduces.
Data transformation is transforming the data from one type to another type so
that it can be used efficiently for analysis and visualization.
Data integration is resolving any conflicts in the data and handling redundancies.
Data preparation is the most time-consuming process, accounting for up to 90% of
the total project duration, and this is the most crucial step throughout the entire life
cycle.

4)Exploratory Data Analysis: This step includes getting some concept about the
answer and elements affecting it, earlier than constructing the real model.
Distribution of data inside distinctive variables of a character is explored
graphically the usage of bar-graphs, Relations between distinct aspects are
captured via graphical representations like scatter plots and warmth maps. Many
data visualization strategies are considerably used to discover each and every
characteristic individually and by means of combining them with different features.

5) Data Modeling

Throughout most cases of data analysis, data modeling is regarded as the core
process. In this process of data modeling, we take the prepared data as the input
and with this, we try to prepare the desired output.

We first tend to select the appropriate type of model that would be implemented to
acquire results, whether the problem is a regression problem or classification, or a
8|Page
Page |9

clustering-based problem. Depending on the type of data received we happen to


choose the appropriate machine learning algorithm that is best suited for the model.
Once this is done, we ought to tune the hyperparameters of the chosen models to
get a favorable outcome.

Finally, we tend to evaluate the model by testing the accuracy and relevance. In
addition to this project, we need to make sure there is a correct balance between
specificity and generalizability, which is the created model must be unbiased.

6) Model Evaluation

Once the model has been developed, data scientists need to evaluate its
performance on the new data to check if it meets the business requirement or not.
They also evaluate how well the model performs in relation to the KPIs and
business criteria established in the first step.
During this stage, data scientists may need to adjust the model or retrain it if is not
up to the mark and not meeting the business requirements. This stage is very
crucial because it ensures that the model is accurate and meets the business
requirements

7) Model Deployment

After a thorough evaluation, the model is finally deployed in the production


environment to solve the business problem. At this step, the model is tested in a
practical setting and its performance is monitored. It is also integrated with
existing systems.
During this stage, the data scientists need to ensure that the model is scalable,
robust, and secure. The data scientist also needs to check if this model is giving
some valuable input to the organization or not.

9|Page
P a g e | 10

Applications of Data Science

1. In Search Engines
The most useful application of Data Science is Search Engines. As we know
when we want to search for something on the internet, we mostly used Search
engines like Google, Yahoo, Safari, Firefox, etc. So Data Science is used to get
Searches faster.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the
help of Driverless Cars, it is easy to reduce the number of Accidents.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always
have an issue of fraud and risk of losses. Thus, Financial Industries needs to
automate risk of loss analysis in order to carry out strategic decisions for the
company. Also, Financial Industries uses Data Science Analytics tools in order to
predict the future. It allows the companies to predict customer lifetime value and
their stock market moves.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a
better user experience with personalized recommendations.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
 Detecting Tumor.
 Drug discoveries.
 Medical Image Analysis.
 Virtual Medical Bots.
 Genetics and Genomics.
 Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When
we upload our image with our friend on Facebook, Facebook gives suggestions
Tagging who is in the picture. This is done with the help of machine learning and
Data Science. When an Image is Recognized, the data analysis is done on one’s
Facebook friends and after analysis, if the faces which are present in the picture
matched with someone else profile then Facebook suggests us auto-tagging.

10 | P a g e
P a g e | 11

7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science.
Whatever the user searches on the Internet, he/she will see numerous posts
everywhere. This can be explained properly with an example: Suppose I want a
mobile phone, so I just Google search it and after that, I changed my mind to buy
offline. Data Science helps those companies who are paying for Advertisements
for their mobile. So everywhere on the internet in the social media, in the
websites, in the apps everywhere I will see the recommendation of that mobile
phone which I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of
it, it becomes easy to predict flight delays. It also helps to decide whether to
directly land into the destination or take a halt in between like a flight can have a
direct route from Delhi to the U.S.A or it can halt in between after that reach at
the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer
Opponent, data science concepts are used with machine learning where with the
help of past data the Computer will improve its performance. There are many
games like Chess, EA Sports, etc. will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to
be done with full disciplined because it is a matter of Someone’s life. Without
Data Science, it takes lots of time, resources, and finance or developing new
Medicine or drug but with the help of Data Science, it becomes easy because the
prediction of success rate can be easily determined based on biological data or
factors. The algorithms based on data science will forecast how this will react to
the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science.
Data Science helps these companies to find the best route for the Shipment of
their Products, the best time suited for delivery, the best mode of transport to
reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will
get the facility to just type a few letters or words, and he will get the feature of
auto-completing the line. In Google Mail, when we are writing formal mail to

11 | P a g e
P a g e | 12

someone so at that time data science concept of Autocomplete feature is used


where he/she is an efficient choice to auto-complete the whole line. Also in
Search Engines in social media, in various apps, AutoComplete feature is widely
used.

DATA SCIENCE APPLICATIONS AND EXAMPLES

 Healthcare: Data science can identify and predict disease, and personalize
healthcare recommendations.
 Transportation: Data science can optimize shipping routes in real-time.
 Sports: Data science can accurately evaluate athletes’ performance.
 Government: Data science can prevent tax evasion and predict incarceration rates.
 E-commerce: Data science can automate digital ad placement.
 Gaming: Data science can improve online gaming experiences.
 Social media: Data science can create algorithms to pinpoint compatible partners.
 Fintech: Data science can help create credit reports and financial profiles, run
accelerated underwriting and create predictive models based on historical payroll
data.

What is Data Collection?


1).Data Collection is the process of collecting information from relevant sources in
order to find a solution to the given statistical enquiry. Collection of Data is the
first and foremost step in a statistical investigation.
Here, statistical enquiry means an investigation made by any agency on a topic in
which the investigator collects the relevant quantitative information. In simple
terms, a statistical enquiry is the search of truth by using statistical methods of
collection, compiling, analysis, interpretation, etc. The basic problem for any
statistical enquiry is the collection of facts and figures related to this specific
phenomenon that is being studied. Therefore, the basic purpose of data collection
is collecting evidence to reach a sound and clear solution to a problem.

12 | P a g e
P a g e | 13

Data Collection: Definition


Data collection is a process of measuring and gathering information on
desired variables in a fashion so that questions related to the data can be
found and used in research of various types. Data collection is a common
feature of study in various disciplines, such as marketing, statistics, economics,
sciences, etc. The methods of collecting data may vary according to subjects, but
the ultimate aim of the study and honesty in data collection are of the same
importance in all matters of study.

Types of Data Collection


Depending on the nature of data collection, it can be divided into two major types,
namely:
 Primary data collection method
 Secondary data collection method.

Important Terms related to Data Collection:


1. Investigator: An investigator is a person who conducts the statistical enquiry.
2. Enumerators: In order to collect information for statistical enquiry, an
investigator needs the help of some people. These people are known as
enumerators.
3. Respondents: A respondent is a person from whom the statistical information
required for the enquiry is collected.
4. Survey: It is a method of collecting information from individuals. The basic
purpose of a survey is to collect data to describe different characteristics such as

13 | P a g e
P a g e | 14

usefulness, quality, price, kindness, etc. It involves asking questions about a


product or service from a large number of people.
Example:
The table below shows the production of rice in India.

The above table contains the production of rice in India in different years. It can be
seen that these values vary from one year to another. Therefore, they are known
as variable. A variable is a quantity or attribute, the value of which varies from
one investigation to another. In general, the variables are represented by letters
such as X, Y, or Z. In the above example, years are represented by variable X, and
the production of rice is represented by variable Y. The values of variable X and
variable Y are data from which an investigator and enumerator collect information
regarding the trends of rice production in India.
Thus, Data is a tool that helps an investigator in understanding the problem by
providing him with the information required. Data can be classified into two types;
viz., Primary Data and Secondary Data. Primary Data is the data collected by the
investigator from primary sources for the first time from scratch. However,
Secondary Data is the data already in existence that has been previously collected
by someone else for other purposes. It does not include any real-time data as the
research has already been done on that information.

14 | P a g e
P a g e | 15

Methods of Collecting Data


There are two different methods of collecting data: Primary Data Collection and
Secondary Data Collection.

A. Methods of Collecting Primary Data:

Primary Data Collection: Quantitative Data Collection

There are a number of methods of collecting primary data, Some of the common
methods are as follows:
1. Direct Personal Investigation: As the name suggests, the method of direct
personal investigation involves collecting data personally from the source of
origin. In simple words, the investigator makes direct contact with the person from
whom he/she wants to obtain information. This method can attain success only
when the investigator collecting data is efficient, diligent, tolerant and impartial.
For example, direct contact with the household women to obtain information about
their daily routine and schedule.
2. Indirect Oral Investigation: In this method of collecting primary data, the
investigator does not make direct contact with the person from whom he/she needs
information, instead they collect the data orally from some other person who has
the necessary required information. For example, collecting data of employees
from their superiors or managers.
3. Information from Local Sources or Correspondents: In this method, for the
collection of data, the investigator appoints correspondents or local persons at
various places, which are then furnished by them to the investigator. With the help
of correspondents and local persons, the investigators can cover a wide area.
4. Information through Questionnaires and Schedules: In this method of
collecting primary data, the investigator, while keeping in mind the motive of the
study, prepares a questionnaire. The investigator can collect data through the
questionnaire in two ways:
 Mailing Method: This method involves mailing the questionnaires to the
informants for the collection of data. The investigator attaches a letter with the
questionnaire in the mail to define the purpose of the study or research. The
investigator also assures the informants that their information would be kept
secret, and then the informants note the answers to the questionnaire and return
the completed file.

15 | P a g e
P a g e | 16

 Enumerator’s Method: This method involves the preparation of a


questionnaire according to the purpose of the study or research. However, in this
case, the enumerator reaches out to the informants himself with the prepared
questionnaire. Enumerators are not the investigators themselves; they are the
people who help the investigator in the collection of data.

Primary data is collected by researchers on their own and for the first time in a
study. There are various ways of collecting primary data, some of which are the
following:
 Interview: Interviews are the most used primary data collection method. In
interviews a questionnaire is used to collect data or the researcher may ask
questions directly to the interviewee. The idea is to seek information on
concerning topics from the answers of the respondent. Questionnaires used
can be sent via email or details can be asked over telephonic interviews.
 Delphi Technique: In this method, the researcher asks for information from
the panel of experts. The researcher may choose in-person research or
questionnaires may be sent via email. At the end of the Delphi technique, all
data is collected according to the need of the research.
 Projective techniques: Projective techniques are used in research that is
private or confidential in a manner where the researcher thinks that
respondents won’t reveal information if direct questions are asked. There are
many types of projective techniques, such as Thematic appreciation tests
(TAT), role-playing, cartoon completion, word association, and sentence
completion.
 Focus Group Interview: Here a few people gather to discuss the problem at
hand. The number of participants is usually between six to twelve in such
interviews. Every participant expresses his own insights and a collective
unanimous decision is reached.
 Questionnaire Method: Here a questionnaire is used for collecting data
from a diverse group population. A set of questions is used for the concerned
research and respondents answer queries related to the questionnaire directly
or indirectly. This method can either be open-ended or closed-ended.

16 | P a g e
P a g e | 17

B. Methods of Collecting Secondary Data or Qualitative Data

Secondary data can be collected through different published and unpublished


sources. Some of them are as follows:
1. Published Sources
 Government Publications: Government publishes different documents which
consists of different varieties of information or data published by the Ministries,
Central and State Governments in India as their routine activity. As the
government publishes these Statistics, they are fairly reliable to the investigator.
Examples of Government publications on Statistics are the Annual Survey of
Industries, Statistical Abstract of India, etc.
 Semi-Government Publications: Different Semi-Government bodies also
publish data related to health, education, deaths and births. These kinds of data
are also reliable and used by different informants. Some examples of semi-
government bodies are Metropolitan Councils, Municipalities, etc.
 Publications of Trade Associations: Various big trade associations collect and
publish data from their research and statistical divisions of different trading
activities and their aspects. For example, data published by Sugar Mills
Association regarding different sugar mills in India.
 Journals and Papers: Different newspapers and magazines provide a variety of
statistical data in their writings, which are used by different investigators for
their studies.
 International Publications: Different international organizations like IMF,
UNO, ILO, World Bank, etc., publish a variety of statistical information which
are used as secondary data.
 Publications of Research Institutions: Research institutions and universities
also publish their research activities and their findings, which are used by
different investigators as secondary data. For example National Council of
Applied Economics, the Indian Statistical Institute, etc.
2. Unpublished Sources
Another source of collecting secondary data is unpublished sources. The data in
unpublished sources is collected by different government organizations and other
organizations. These organizations usually collect data for their self-use and are
not published anywhere. For example, research work done by professors,
professionals, teachers and records maintained by business and private enterprises.

17 | P a g e
P a g e | 18

18 | P a g e
P a g e | 19

What are Statistical Errors?

The errors which are occurred while collecting data are known as Statistical
Errors. These are dependent on the sample size selected for the study. There are
two types of Statistical Errors; viz., Sampling Errors and Non-Sampling
Errors.

1. Sampling Errors:
The errors which are related to the nature or size of the sample selected for the
study are known as Sampling Errors. If the size of the sample selected is very
small or the nature of the sample is non-representative, then the estimated value
may differ from the actual value of a parameter. This kind of error is sampling
error. For example, if the estimated value of a parameter is 10, while the actual
value is 30, then the sampling error will be 10-30=-20.
Sampling Error = Estimated Value – Actual Value

2. Non-Sampling Errors:

The errors related to the collection of data are known as Non-Sampling Errors.
The different types of Non-Sampling Errors are Error of Measurement, Error of
Non-response, Error of Misinterpretation, Error of Calculation or Arithmetical
Error, and Error of Sampling Bias.

19 | P a g e
P a g e | 20

i) Error of Measurement:
The reason behind the occurrence of Error of Measurement may be difference in
the scale of measurement and difference in the rounding-off procedure that is
adopted by different investigators.
ii) Error of Non-response:
These errors arise when the respondents do not offer the information required for
the study.
iii) Error of Misinterpretation:
These errors arise when the respondents fail to interpret the question given in the
questionnaire.
iv) Error of Calculation or Arithmetical Error:
These errors occur while adding, subtracting, or multiplying figures of data.
v) Error of Sampling Bias:
These errors occur when because of one reason or another, a part of the target
population cannot be included in the sample choice.

20 | P a g e

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy