Data Science Bcs A
Data Science Bcs A
Data Science Bcs A
1|Page
Page |2
As per various surveys, data scientist job is becoming the most demanding Job of
the 21st century due to increasing demands for data science. Some people also
called it "the hottest job title of the 21st century". Data scientists are the experts
who can use various statistical tools and machine learning algorithms to understand
and analyze the data.
The average salary range for data scientist will be approximately $95,000 to $
165,000 per annum, and as per different researches, about 11.5 millions of job
will be created by the year 2026.
If you learn data science, then you get the opportunity to find the various exciting
job roles in this domain. The main job roles are given below:
1. Data Scientist
2. Data Analyst
3. Machine learning expert
4. Data engineer
5. Data Architect
6. Data Administrator
7. Business Analyst.
8. Statistician
1. Data Analyst:
Skill required: For becoming a data analyst, you must get a good background
in mathematics, business intelligence, data mining, and basic knowledge
of statistics. You should also be familiar with some computer languages and tools
such as MATLAB, Python, SQL, Hive, Pig, Excel, SAS, R, JS, Spark, etc.
2|Page
Page |3
The machine learning expert is the one who works with various machine learning
algorithms used in data science such as regression, clustering, classification,
decision tree, random forest, etc.
3. Data Engineer:
A data engineer works with massive amount of data and responsible for building
and maintaining the data architecture of a data science project. Data engineer also
works for the creation of data set processes used in modeling, mining, acquisition,
and verification.
Skill required: Data engineer must have depth knowledge of SQL, MongoDB,
Cassandra, HBase, Apache Spark, Hive, MapReduce, with language knowledge
of Python, C/C++, Java, Perl, etc.
4. Data Scientist:
Skill required: To become a data scientist, one should have technical language
skills such as R, SAS, SQL, Python, Hive, Pig, Apache spark, MATLAB. Data
scientists must have an understanding of Statistics, Mathematics, visualization, and
communication skills.
5.Database Administrator:
The database administrator are responsible for the functioning of all the databases
of an enterprises and revoke its services to the employees of the company
depending on their requirements.
3|Page
Page |4
6.Data Architect:
A data architect creats the blueprints for data management so that the databases can
be esily,integrated,centralized and protected with the best security measures.
7. Statistician
8. Business Analyst
The role of business analysts is slightly different than other data science jobs.
While they do have a good understanding of how data-oriented technologies work
and how to handle large volumes of data they also separate the high-value data
from the low-value data.
4|Page
Page |5
In order to build a successful business model, its very important to first understand
the business problem that the client is facing. Suppose he wants to predict the
customer churn rate of his retail business. You may first want to understand his
business, his requirements and what he is actually wanting to achieve from the
prediction. In such cases, it is important to take consultation from domain experts
and finally understand the underlying problems that are present in the system. A
Business Analyst is generally responsible for gathering the required details from
the client and forwarding the data to the data scientist team for further speculation.
5|Page
Page |6
Even a minute error in defining the problem and understanding the requirement
may be very crucial for the project hence it is to be done with maximum precision.
2) Data Collection
After gaining clarity on the problem statement, we need to collect relevant data to
break the problem into small components.
The data science project starts with the identification of various data sources,
which may include web server logs, social media posts, data from digital libraries
such as the US Census datasets, data accessed through sources on the internet via
APIs, web scraping, or information that is already present in an excel spreadsheet.
Data collection entails obtaining information from both known internal and
external sources that can assist in addressing the business issue.
Normally, the data analyst team is responsible for gathering the data. They need to
figure out proper ways to source data and collect the same to get the desired
results.
6|Page
Page |7
7|Page
Page |8
3) Data Preparation
This is a very important stage in the data science lifecycle, this stage includes data
cleaning, data reduction, data transformation, and data integration. This stage takes
lots of time and data scientists spend a significant amount of time preparing the
data.
Data cleaning is handling the missing values in the data and filling out these
missing values with appropriate values and smoothing out the noisy data.
Data reduction is using various strategies to reduce the size of data such that the
output remains the same and the processing time of data reduces.
Data transformation is transforming the data from one type to another type so
that it can be used efficiently for analysis and visualization.
Data integration is resolving any conflicts in the data and handling redundancies.
Data preparation is the most time-consuming process, accounting for up to 90% of
the total project duration, and this is the most crucial step throughout the entire life
cycle.
4)Exploratory Data Analysis: This step includes getting some concept about the
answer and elements affecting it, earlier than constructing the real model.
Distribution of data inside distinctive variables of a character is explored
graphically the usage of bar-graphs, Relations between distinct aspects are
captured via graphical representations like scatter plots and warmth maps. Many
data visualization strategies are considerably used to discover each and every
characteristic individually and by means of combining them with different features.
5) Data Modeling
Throughout most cases of data analysis, data modeling is regarded as the core
process. In this process of data modeling, we take the prepared data as the input
and with this, we try to prepare the desired output.
We first tend to select the appropriate type of model that would be implemented to
acquire results, whether the problem is a regression problem or classification, or a
8|Page
Page |9
Finally, we tend to evaluate the model by testing the accuracy and relevance. In
addition to this project, we need to make sure there is a correct balance between
specificity and generalizability, which is the created model must be unbiased.
6) Model Evaluation
Once the model has been developed, data scientists need to evaluate its
performance on the new data to check if it meets the business requirement or not.
They also evaluate how well the model performs in relation to the KPIs and
business criteria established in the first step.
During this stage, data scientists may need to adjust the model or retrain it if is not
up to the mark and not meeting the business requirements. This stage is very
crucial because it ensures that the model is accurate and meets the business
requirements
7) Model Deployment
9|Page
P a g e | 10
1. In Search Engines
The most useful application of Data Science is Search Engines. As we know
when we want to search for something on the internet, we mostly used Search
engines like Google, Yahoo, Safari, Firefox, etc. So Data Science is used to get
Searches faster.
2. In Transport
Data Science also entered into the Transport field like Driverless Cars. With the
help of Driverless Cars, it is easy to reduce the number of Accidents.
3. In Finance
Data Science plays a key role in Financial Industries. Financial Industries always
have an issue of fraud and risk of losses. Thus, Financial Industries needs to
automate risk of loss analysis in order to carry out strategic decisions for the
company. Also, Financial Industries uses Data Science Analytics tools in order to
predict the future. It allows the companies to predict customer lifetime value and
their stock market moves.
4. In E-Commerce
E-Commerce Websites like Amazon, Flipkart, etc. uses data Science to make a
better user experience with personalized recommendations.
5. In Health Care
In the Healthcare Industry data science act as a boon. Data Science is used for:
Detecting Tumor.
Drug discoveries.
Medical Image Analysis.
Virtual Medical Bots.
Genetics and Genomics.
Predictive Modeling for Diagnosis etc.
6. Image Recognition
Currently, Data Science is also used in Image Recognition. For Example, When
we upload our image with our friend on Facebook, Facebook gives suggestions
Tagging who is in the picture. This is done with the help of machine learning and
Data Science. When an Image is Recognized, the data analysis is done on one’s
Facebook friends and after analysis, if the faces which are present in the picture
matched with someone else profile then Facebook suggests us auto-tagging.
10 | P a g e
P a g e | 11
7. Targeting Recommendation
Targeting Recommendation is the most important application of Data Science.
Whatever the user searches on the Internet, he/she will see numerous posts
everywhere. This can be explained properly with an example: Suppose I want a
mobile phone, so I just Google search it and after that, I changed my mind to buy
offline. Data Science helps those companies who are paying for Advertisements
for their mobile. So everywhere on the internet in the social media, in the
websites, in the apps everywhere I will see the recommendation of that mobile
phone which I searched for. So this will force me to buy online.
8. Airline Routing Planning
With the help of Data Science, Airline Sector is also growing like with the help of
it, it becomes easy to predict flight delays. It also helps to decide whether to
directly land into the destination or take a halt in between like a flight can have a
direct route from Delhi to the U.S.A or it can halt in between after that reach at
the destination.
9. Data Science in Gaming
In most of the games where a user will play with an opponent i.e. a Computer
Opponent, data science concepts are used with machine learning where with the
help of past data the Computer will improve its performance. There are many
games like Chess, EA Sports, etc. will use Data Science concepts.
10. Medicine and Drug Development
The process of creating medicine is very difficult and time-consuming and has to
be done with full disciplined because it is a matter of Someone’s life. Without
Data Science, it takes lots of time, resources, and finance or developing new
Medicine or drug but with the help of Data Science, it becomes easy because the
prediction of success rate can be easily determined based on biological data or
factors. The algorithms based on data science will forecast how this will react to
the human body without lab experiments.
11. In Delivery Logistics
Various Logistics companies like DHL, FedEx, etc. make use of Data Science.
Data Science helps these companies to find the best route for the Shipment of
their Products, the best time suited for delivery, the best mode of transport to
reach the destination, etc.
12. Autocomplete
AutoComplete feature is an important part of Data Science where the user will
get the facility to just type a few letters or words, and he will get the feature of
auto-completing the line. In Google Mail, when we are writing formal mail to
11 | P a g e
P a g e | 12
Healthcare: Data science can identify and predict disease, and personalize
healthcare recommendations.
Transportation: Data science can optimize shipping routes in real-time.
Sports: Data science can accurately evaluate athletes’ performance.
Government: Data science can prevent tax evasion and predict incarceration rates.
E-commerce: Data science can automate digital ad placement.
Gaming: Data science can improve online gaming experiences.
Social media: Data science can create algorithms to pinpoint compatible partners.
Fintech: Data science can help create credit reports and financial profiles, run
accelerated underwriting and create predictive models based on historical payroll
data.
12 | P a g e
P a g e | 13
13 | P a g e
P a g e | 14
The above table contains the production of rice in India in different years. It can be
seen that these values vary from one year to another. Therefore, they are known
as variable. A variable is a quantity or attribute, the value of which varies from
one investigation to another. In general, the variables are represented by letters
such as X, Y, or Z. In the above example, years are represented by variable X, and
the production of rice is represented by variable Y. The values of variable X and
variable Y are data from which an investigator and enumerator collect information
regarding the trends of rice production in India.
Thus, Data is a tool that helps an investigator in understanding the problem by
providing him with the information required. Data can be classified into two types;
viz., Primary Data and Secondary Data. Primary Data is the data collected by the
investigator from primary sources for the first time from scratch. However,
Secondary Data is the data already in existence that has been previously collected
by someone else for other purposes. It does not include any real-time data as the
research has already been done on that information.
14 | P a g e
P a g e | 15
There are a number of methods of collecting primary data, Some of the common
methods are as follows:
1. Direct Personal Investigation: As the name suggests, the method of direct
personal investigation involves collecting data personally from the source of
origin. In simple words, the investigator makes direct contact with the person from
whom he/she wants to obtain information. This method can attain success only
when the investigator collecting data is efficient, diligent, tolerant and impartial.
For example, direct contact with the household women to obtain information about
their daily routine and schedule.
2. Indirect Oral Investigation: In this method of collecting primary data, the
investigator does not make direct contact with the person from whom he/she needs
information, instead they collect the data orally from some other person who has
the necessary required information. For example, collecting data of employees
from their superiors or managers.
3. Information from Local Sources or Correspondents: In this method, for the
collection of data, the investigator appoints correspondents or local persons at
various places, which are then furnished by them to the investigator. With the help
of correspondents and local persons, the investigators can cover a wide area.
4. Information through Questionnaires and Schedules: In this method of
collecting primary data, the investigator, while keeping in mind the motive of the
study, prepares a questionnaire. The investigator can collect data through the
questionnaire in two ways:
Mailing Method: This method involves mailing the questionnaires to the
informants for the collection of data. The investigator attaches a letter with the
questionnaire in the mail to define the purpose of the study or research. The
investigator also assures the informants that their information would be kept
secret, and then the informants note the answers to the questionnaire and return
the completed file.
15 | P a g e
P a g e | 16
16 | P a g e
P a g e | 17
17 | P a g e
P a g e | 18
18 | P a g e
P a g e | 19
The errors which are occurred while collecting data are known as Statistical
Errors. These are dependent on the sample size selected for the study. There are
two types of Statistical Errors; viz., Sampling Errors and Non-Sampling
Errors.
1. Sampling Errors:
The errors which are related to the nature or size of the sample selected for the
study are known as Sampling Errors. If the size of the sample selected is very
small or the nature of the sample is non-representative, then the estimated value
may differ from the actual value of a parameter. This kind of error is sampling
error. For example, if the estimated value of a parameter is 10, while the actual
value is 30, then the sampling error will be 10-30=-20.
Sampling Error = Estimated Value – Actual Value
2. Non-Sampling Errors:
The errors related to the collection of data are known as Non-Sampling Errors.
The different types of Non-Sampling Errors are Error of Measurement, Error of
Non-response, Error of Misinterpretation, Error of Calculation or Arithmetical
Error, and Error of Sampling Bias.
19 | P a g e
P a g e | 20
i) Error of Measurement:
The reason behind the occurrence of Error of Measurement may be difference in
the scale of measurement and difference in the rounding-off procedure that is
adopted by different investigators.
ii) Error of Non-response:
These errors arise when the respondents do not offer the information required for
the study.
iii) Error of Misinterpretation:
These errors arise when the respondents fail to interpret the question given in the
questionnaire.
iv) Error of Calculation or Arithmetical Error:
These errors occur while adding, subtracting, or multiplying figures of data.
v) Error of Sampling Bias:
These errors occur when because of one reason or another, a part of the target
population cannot be included in the sample choice.
20 | P a g e