Data Science, a new discovery paradigm, is potentially one of the most significant advances
of the early 21st century. Originating in scientific discovery, it is being applied to every
human endeavor for which there is adequate data. While remarkable successes have been
achieved, even greater claims have been made. Benefits, challenge, and risks abound. The
science underlying data science has yet to emerge. Maturity is more than a decade away. This
claim is based firstly on observing the centuries-long developments of its predecessor
paradigms – empirical, theoretical, and Jim Gray’s Fourth Paradigm of Scientific Discovery
(Hey, Tansley & Tolle, 2009) (aka eScience, data-intensive, computational, procedural); and
secondly on my studies of over 150 data science use cases, several data science-based
startups, and, on my scientific advisory role for Insight1 , a Data Science Research Institute
(DSRI) that requires that I understand the opportunities, state of the art, and research
challenges for the emerging discipline of data science. This chapter addresses essential
questions for a DSRI: What is data science? and What is worldclass data science research? A
companion chapter On Developing Data Science (Brodie, 2018b) addresses the development
of data science applications and of the data science discipline itself











What can data science do? What characteristics distinguish data science from previous
scientific discovery paradigms? What are the methods for conducting data science? What is
the impact of data science? This chapter offers initial answers to these and related questions.
A companion chapter (Brodie, 2018b) addresses the development of data science as a
discipline, as a methodology, as well as data science research and education. Let’s start with
some slightly provocative claims concerning data science. Data science has been used
successfully to accelerate discovery of probabilistic outcomes in many domains. Piketty’s
(2014) monumental result on wealth and income inequality was achieved through data
science. It used over 120 years of sporadic, incomplete, observational economic data,
collected over ten years from all over the world (Brodie, 2014b). What is now called
computational economics was used to establish the correlation, with a very high likelihood
(0.90), that wealth gained from labour could never keep up with wealth gained from assets.
What made front page news worldwide was a second, more dramatic correlation that there is
a perpetual and growing wealth gap between the rich and the poor. This second correlation
was not derived by data analysis but is a human interpretation of Piketty’s data analytic
result. It contributed to making Capital in the 21st Century the best-selling book on
economics, but possibly the least read. Within a year, the core result was verified by
independent analyses to a far greater likelihood (0.99). One might expect that further
confirmation of Piketty’s finding would be newsworthy; however, it was not as the more
dramatic rich-poor correlation, while never analytically established had far greater appeal.
This illustrates the benefits and risks of data science. Frequently, due to the lack of evidence,
economic theories fail. Matthew Weinzierl, a leading Harvard University economist,
questions such economic modelling in general saying, “that the world is too complicated to
be modelled with anything like perfect accuracy” and "Used in isolation, however, it can lead
to trouble” (Economist, February 2018). Reputedly, Einstein said: “Not everything that
counts can be counted. Not everything that’s counted, counts”. The hope is that data science
and computational economics will provide theories that are fact-based rather than based on
hypotheses of “expert” economists (Economist, January 2018) leading to demonstrably
provable economic theories, i.e., what really happened or will happen. This chapter suggests
that this hope will not be realized this year.


Data science is the study of data. Like biological sciences is a study of biology, physical
sciences, it’s the study of physical reactions. Data is real, data has real properties, and we
need to study them if we’re going to work on them. Data Science involves data and some
It is a process, not an event. It is the process of using data to understand too many different
things, to understand the world. Let’s Suppose that you have a model or proposed
explanation of a problem, and you try to validate that proposed explanation or model with
your data. 
It is the skill of unfolding the insights and trends that are hiding (or abstract) behind
data. It’s when you translate data into a story. So use storytelling to generate insight.
And with these insights, you can make strategic choices for a company or an institution.  
We can also define data science as a field that is about processes and systems to extract
data of various forms and from various resources whether the data is unstructured or
The definition and the name came up in the 1980s and 1990s when some professors, IT
Professionals, and scientists were looking into the statistics curriculum, and they thought it
would be better to call it data science and then later on data analytics derived.  
But the biggest question and confusion in the world is what is Data Science? 
we’d see data science as one and from one to many attempts to work with data, to find
answers to questions that they are exploring. On summarizing all, we can say that  it’s much
more about data than about science. If you have proper or improper data, and you have
curiosity for working with data, and you’re manipulating it according to your needs, you’re
exploring it according to your needs, the very exercise of going through analyzing data,
trying to get some answers or fulfill the society need from your explored, manipulated and
exercised Data – it is Data Science.
Data Science is relevant today because we have millions of data available on single data or
for single data. We didn’t use to worry about the lack of data. Now we have tons of data. In
the past, we didn’t have defined algorithms, now we have algorithms. In the past, the
software was not affordable by everyone because it was too expensive, so only industries
with big-bucks can use it but now it is open source and freely available. In the past, we
didn’t even think about storing a large amount of data, because the storage facilities are also
very costly and now it is available for a fraction of a cost, we can have gazillions of data
sets for a very low cost. Also, internet connectivity was not common and too costly. So, the
tools to work with data, the variability of data, the ability to store, analyze data and last and
most important Connectivity, it’s all cheap, it’s all available, it’s all ubiquitous, it’s here.
There’s never been a better time to be a data scientist than now . 

We can say that Data Scientists are Business Analysts and Data Analysts, with a
difference!. Though the initial training or basic requirements are similar for all these
disciplines, Data Scientists require: 
 Strong Business Acumen
 Strong Communication Skills
 Exploring Big Data
Just like an agricultural scientist wants to know the percentage increase in the yield of
wheat this year as compared to last year’s (also the reasons associated with it) or if a
financial company wants to classify its customers based on their creditworthiness (before
granting loans) or whether a retail organization wants to reward extra points to its loyal
customers, all need data scientists to process a large volume of both structured and
unstructured data in order to make crucial business decisions. 
In today’s dynamic and vast world, the main challenge that today’s Data Scientists face is
to find solutions to the existing business problems and above it, to identify the problems
that are most relevant and crucial to the organization and its success.

The term “Data Scientist” has been in existence after considering the fact that a Data
Scientist collects a huge amount of information from the scientific fields and applications
whether the information is statistical, mathematical, or computer science. They make use of
the latest technologies and tools in finding the solutions and reaching the conclusions that
are important for an organization’s growth and development. Data Scientists present the
data in a much more useful form as compared to the raw data available to them from
structured as well as unstructured forms. 

Just like any other scientific piece of training, data scientists always need to ask and find
answers of What, How Who, and Why that data is available to them. They are required to
make a clearly defined plan and work towards achieving the results within a limited time,
effort and money

majors_df.pkl'import pickle

majors_df = pickle.load(open("majors_df.pkl", "rb"))

majors_df["data scientist"]


 Fraud and risk detection: Over the years, financial organizations have learned to
analyze the probabilities of risks and defaults through customer profiling, past
expenditures, and other variables available through data.
 Healthcare: Data science makes it possible to manage and analyze very large diverse
datasets in healthcare systems, drug development, medical image analysis, and more.
Recently Data Science approaches were brought in to combat the COVID-19
pandemic. Data Scientists helped in digital contact tracing, diagnosis, risk assessment,
resource allocation, estimating epidemiological parameters, drug development, social
media analytics, etc.
 Internet search: All search engines, including Google, use data science algorithms to
deliver the best result for searched queries within seconds.
 Targeted advertising: Digital ads have a higher call-through rate (CTR) than
traditional ads because targeted advertising is based on a user’s past behavior with the
help of data science algorithms.
 Recommendation systems: Internet giants as well as other businesses have fervidly
made use of recommendation engines to promote their products based on users’
previous search results and their interests.
 Advanced image, speech, or character recognition: Facial recognition algorithms on
Facebook, speech recognition products, such as Siri, Cortana, Alexa, etc., and Google
Lens are all perfect examples of data science applications in image, speech, and
character recognition.
 Gaming: Today, games use machine learning algorithms to improve or upgrade
themselves as players move up to higher levels. In motion gaming, the opponent
(computer) is able to analyze a player’s previous moves and accordingly shape up its
game. This is all possible because of data science.
 Augmented reality (AR): Augmented reality promises an exciting future through Data
Science. A VR headset, for example, contains algorithms, data, and computing
knowledge to offer the best viewing experience.

 Supervised learning
It is used for the structured dataset. It analyzes the training data and generates a function that
will be used for other datasets.
 Unsupervised learning
It is used for raw datasets. Its main task is to convert raw data to structured data.In today’s
world, there is a huge amount of raw data in every field. Even the computer generates log
files which are in the form of raw data. Therefore it’s the most important part of machine

 Linear Regression
It is the most well known and popular algorithm in machine learning and statistics. This
model will assume a linear relationship between the input and the output variable. It is

represented in the form of linear equation which has a set of inputs and a predictive output.
Then it will estimate the values of coefficient used in the representation.

The linear regression model represents the relationship between the input variables (x) and
the output variable (y) of a dataset in terms of a line given by the equation,

y = b0 + b1x

 y is the dependent variable whose value we want to predict.

 x is the independent variable whose values are used for predicting the dependent
 b0 and b1 are constants in which b0 is the Y-intercept and b1 is the slope.
The main aim of this method is to find the value of b0 and b1 to find the best fit line that will
be covering or will be nearest to most of the data points.

2. Logistic Regression
Linear Regression is always used for representing the relationship between
some continuous values. However, contrary to this Logistic Regression works on discrete
Logistic regression finds the most common application in solving binary classification
problems, that is, when there are only two possibilities of an event, either the event will occur
or it will not occur (0 or 1).
Thus, in Logistic Regression, we convert the predicted values into such values that lie in the
range of 0 to 1 by using a non-linear transform function which is called a logistic function.
The logistic function results in an S-shaped curve and is therefore also called a Sigmoid
function given by the equation,
?(x) = 1/1+e^-x

3. Decision Trees

Decision trees help in solving both classification and prediction problems. It makes it easy

to understand the data for better accuracy of the predictions. Each node of the Decision tree
represents a feature or an attribute, each link represents a decision and each leaf node holds a
class label, that is, the outcome.
The drawback of decision trees is that it suffers from the problem of overfitting.

Basically, these two Data Science algorithms are most commonly used for implementing the
Decision trees.

 Cart ( Classification and Regression Tree) Algorithm

4. Naive Bayes
The Naive Bayes algorithm helps in building predictive models. We use this Data Science
algorithm when we want to calculate the probability of the occurrence of an event in the
Here, we have prior knowledge that another event has already occurred.

The Naive Bayes algorithm works on the assumption that each feature is independent and
has an individual contribution to the final prediction.
The Naive Bayes theorem is represented by:

P(A|B) = P(B|A) P(A) / P(B)

Where A and B are two events.

 P(A|B) is the posterior probability i.e. the probability of A given that B has
already occurred.
 P(B|A) is the likelihood i.e. the probability of B given that A has already
 P(A) is the class prior to probability.
 P(B) is the predictor prior probability.


Data science has become an essential part of any industry today. Its a method for
transforming business data into assets that help organizations improve revenue, reduce costs,
seize business opportunities, improve customer experience, and more. Data science is one of
the most debated topics in the industries these days. Its popularity has grown over the years,
and companies have started implementing data science techniques to grow their business and
increase customer satisfaction. Data science is the domain of study that deals with vast
volumes of data using modern tools and techniques to find unseen patterns, derive
meaningful information, and make business decisions.

Advantages of Data Science :- In today’s world, data is being generated at an alarming
rate. Every second, lots of data is generated; be it from the users of Facebook or any other
social networking site, or from the calls that one makes, or the data which is being generated
from different organizations. And because of this huge amount of data the value of the field
of Data Science has a number of advantages. Some of the advantages are mentioned below :-
 Multiple Job Options :- Being in demand, it has given rise to a large number of
career opportunities in its various fields. Some of them are Data Scientist, Data
Analyst, Research Analyst, Business Analyst, Analytics Manager, Big Data Engineer,
 Business benefits :- Data Science helps organizations knowing how and when
their products sell best and that’s why the products are delivered always to the
right place and right time. Faster and better decisions are taken by the
organization to improve efficiency and earn higher profits. 
 Highly Paid jobs & career opportunities :- As Data Scientist continues being the
sexiest job and the salaries for this position are also grand. According to a Dice Salary
Survey, the annual average salary of a Data Scientist $106,000 per year.
 Hiring benefits :- It has made it comparatively easier to sort data and look for best
of candidates for an organization. Big Data and data mining have made processing
and selection of CVs, aptitude tests and games easier for the recruitment teams.

Disadvantages of Data Science :- Everything that comes with a number of benefits
also has some consequences . So let’s have a look at some of the disadvantages of Data
Science :-
 Data Privacy :- Data is the core component that can increase the productivity and the
revenue of industry by making game-changing business decisions. But the
information or the insights obtained from the data can be misused against any
organization or a group of people or any committee etc. Extracted information from
the structured as well as unstructured data for further use can also misused against a
group of people of a country or some committee.
 Cost :- The tools used for data science and analytics can cost a lot to an
organization as some of the tools are complex and require the people to
undergo a training in order to use them. Also, it is very difficult to select the
right tools according to the circumstances because their selection is based on
the proper knowledge of the tools as well as their accuracy in analyzing the
data and extracting information.

Top 9 best tools which we use in Data Science :-  It is required that they
have a clear understanding of the tools that are necessary for the programming to work. we
decided to provide a little insight into the tools that can be used for data visualization,
statistical programming languages, algorithms, and databases. These tools will help speed up
your process as you do not have to further search anywhere else for what you need.
1. DataRobot :- It is a global automated Machine Learning platform. With the
capabilities of Data Science, Machine Learning, Statistical Modeling, Artificial
Intelligence, Augmented Analytics, Machine Learning Operations (MLOps), Time
Series Modeling.
2. MLBASE :- One of the best Data Science tools and provides distributed and
statistical techniques that are key to transforming big data into actionable knowledge.
It provides functionality to end-users for a wide variety of standard machine learning
tasks such as classification, regression, collaborative filtering, and more general
exploratory data analysis techniques 
3. Apache Graph :- Apache Graph supports high-level scalability. It is an iterative
graph processing system that has been specially developed for this purpose. This was
derived from the Pregel model but comes with more number of features and
functionalities when compared with the Pregel model. This open-source model helps
data scientists to utilize the underlying potential of structured datasets at a large scale.
4. Apache Spark :- This is another free tool that offers cluster computing in a blink of
the eye, which is at lightning bolt speed. Today, a number of organizations are using
Spark for processing large datasets. This data scientist tool is capable of accessing
diverse data sources, which include HDFS, HBase, S3, and Cassandra.
5. Cascading :- It is specifically for data scientists who are building big data apps on
Apache Hadoop. It allows users to solve both complex and simple data problems,
using cascading. This is because it offers computation engines, data processing,
scheduling capabilities, and systems integration framework.
6. TENSORFLOW :- This is an ML tool, which is widely used for advanced Machine
Learning algorithms like Deep Learning. It is an open-source and ever-evolving
toolkit which is known for its performance and high computational abilities.
7. SAP HANA :- It is an effective tool from SAP with SAP HANA Predictive Analysis
Library (PAL).
8. MONGODB :- This is another Data Analysis tool that is quite popular since it allows
cross-platform document orientation. It has a basic query and aggregation framework,
but to do more advanced analytics. It is a perfect choice to iterate ML training


1. Healthcare

Healthcare companies are using data science to build sophisticated medical instruments to
detect and cure diseases.

2. Gaming

Video and computer games are now being created with the help of data science and that has
taken the gaming experience to the next level.

3. Image Recognition

Identifying patterns in images and detecting objects in an image is one of the most popular
data science applications.

4. Recommendation Systems

Netflix and Amazon give movie and product recommendations based on what you like to
watch, purchase, or browse on their platforms.

5. Logistics

Data Science is used by logistics companies to optimize routes to ensure faster delivery of
products and increase operational efficiency.

6. Fraud Detection

Banking and financial institutions use data science and related algorithms to detect fraudulent

7. Internet Search

When we think of search, we immediately think of Google. Right? However, there are other
search engines, such as Yahoo, Duckduckgo, Bing, AOL, Ask, and others, that employ data
science algorithms to offer the best results for our searched query in a matter of seconds.
Given that Google handles more than 20 petabytes of data per day. Google would not be the
'Google' we know today if data science did not exist.

8. Speech recognition

Speech recognition is dominated by data science techniques. We may see the excellent work
of these algorithms in our daily lives. Have you ever needed the help of a virtual speech
assistant like Google Assistant, Alexa, or Siri? Well, its voice recognition technology is
operating behind the scenes, attempting to interpret and evaluate your words and delivering
useful results from your use. Image recognition may also be seen on social media platforms
such as Facebook, Instagram, and Twitter. When you submit a picture of yourself with
someone on your list, these applications will recognise them and tag them.

9. Targeted Advertising

If you thought Search was the most essential data science use, consider this: the whole digital
marketing spectrum. From display banners on various websites to digital billboards at
airports, data science algorithms are utilised to identify almost anything. This is why digital
advertisements have a far higher CTR (Call-Through Rate) than traditional marketing. They
can be customised based on a user's prior behaviour. That is why you may see adverts for
Data Science Training Programs while another person sees an advertisement for clothes in
the same region at the same time.

10. Airline Route Planning

As a result of data science, it is easier to predict flight delays for the airline industry, which is
helping it grow. It also helps to determine whether to land immediately at the destination or
to make a stop in between, such as a flight from Delhi to the United States of America or to
stop in between and then arrive at the destination.


We hear a lot about the trends and applications of Data Science these days. Now, many
questions would have already popped up in your mind such as: ‘Why do we need Data
Science?’, ‘What is Data Science?’, ‘Is Data Science a good career?’, and ‘What are the
trends in Data Science in 2021?’

Let me answer these questions with a practical example. A few days back, I got a job offer
through LinkedIn. The concerned HR of the company sent me an email regarding my
interview timings. I overlooked the mail because I was quite busy. On the next day, a
notification popped up on my mobile phone saying, ‘You have an interview scheduled at 1.30
pm today with Infosys,’ along with the location of the company. Then, I opened Google
Maps on my phone to search for the location of the company. It showed me the best route
with less traffic to reach the destination. Isn’t it a great technology that assists us in
reminding us of important day-to-day tasks?

Another wonderful real-life example of Data Science is Tesla’s self-driving cars. These cars
collect real-time data from the surroundings with the help of a camera, IoT sensors, and
ultrasonic sensors. After collecting the real-time data, they process the data, visualize it, and
use software algorithms to receive the best suitable actions to follow while navigating
themselves on a safe drive. In the coming years, these self-driving cars would revolutionize
the automobile industr

From this, you can infer how fast human beings are moving toward automation in various
fields. This is only possible due to the evolution of Data Science. So, Data Science growth
depends on the amount of data and the creativity of Data Science enthusiasts.
Further in this blog, we would see more advancements in the applications of Data Science in
According to an IBM report, Data Science jobs would likely grow by 30 percent. The
estimated figure of a job listing is 2,720,000 for Data Science in 2021. Also, according to the
US Bureau of Labor Statistics, about 11 million jobs will be created by 2026.

Every organization wants to gain supreme profits. As data is the key factor in Data Science,
every industry has realized that it requires Data Scientists to play with data to optimize its
business profits. This is the reason for the popularity of Data Science jobs.

Also, in this digital world, there are diverse classes of businesses. Organizations managing
these diverse classes deal with zettabytes (a billion Terabytes) of data. In the upcoming years,
this data will continue increasing in huge proportions day by day, and it would, in turn,
increase the demand for skilled individuals in Data Science in 2021.

All this ensures that the best jobs in today’s world are Data Scientist, Data Analyst, Data
Engineer, and Business Analyst.

In addition, data is opening up a lot of routes for Data Scientists in all major government and
private sectors across the world. So, now, we will look at the areas where jobs would be
created in the field of Data Science in 2021.

Data Science Job Trends in 2021

Data Science is crucial in healthcare to keep track of the patients’ health and help doctors to
understand the patterns of disease and to prevent it. The healthcare industry needs Data
Engineers who can assist in making automated systems for the analysis of complex data in
clinical applications. Improved patient care, faster and more accurate diagnoses, preventative
measures, more individualized treatment, and more informed decision-making are all possible
thanks to the application of data science in the health domain. Due to its high importance,

Data Science jobs are going to increase enormously in the future. By 2021, it will majorly
create 20,000 jobs in the field of healthcare.

Aviation and Airlines:

In the aviation and airlines industry, companies use data for putting up their prices,
optimizing routes, and carrying out preemptive maintenance. Data Scientists are needed to
collect and analyze the airline’s data such as route distance and altitudes, aircraft type and
weight, weather, etc. By having a better grasp of how passengers function using Data
Science, it will be easy to enhance the services provided to them. This industry will create
more than 3,000 new jobs for Data Science in 2021.

Due to an increase in online transactions and Internet usage, fraudulent activities have also
increased. Organizations are adopting Data Science techniques to detect such fraudulent
activities and to prevent losses. It gives a scientific method for detecting hostile assaults on
digital infrastructure. It also incorporates machine learning technologies to understand the
patterns of data and to create effective algorithms to protect the data. Data Scientists help to
manage large amounts of data and derive the best solutions. This will boost the demand for
Data Scientists by opening up more than 5,000 jobs in 2021.



Data science education is well into its formative stages of development; it is

evolving into a self-supporting discipline and producing professionals with
distinct and complementary skills relative to professionals in the computer,
information, and statistical sciences. However, regardless of its potential
eventual disciplinary status, the evidence points to robust growth of data science
education that will indelibly shape the undergraduate students of the future. In
fact, fueled by growing student interest and industry demand, data science
education will likely become a staple of the undergraduate experience. There
will be an increase in the number of students majoring, minoring, earning
certificates, or just taking courses in data science as the value of data skills
becomes even more widely recognized. The adoption of a general education
requirement in data science for all undergraduates will endow future generations
of students with the basic understanding of data science that they need to
become responsible citizens. Continuing education programs such as data
science boot camps, career accelerators, summer schools, and incubators will
provide another stream of talent. This constitutes the emerging watershed of
data science education that feeds multiple streams of generalists and specialists
in society; citizens are empowered by their basic skills to examine, interpret, and
draw value from data.

Today, the nation is in the formative phase of data science education, where
educational organizations are pioneering their own programs, each with
different approaches to depth, breadth, and curricular emphasis (e.g., business,
computer science, engineering, information science, math-
Suggested Citation:"6 Conclusions." National Academies of Sciences
It is too early to expect consensus to emerge on certain best practices of data
science education. However, it is not too early to envision the possible forms
that such practices might take. Nor is it too early to make recommendations that
can help the data science education community develop strategic vision and
practices. The following is a summary of the findings and recommendations
discussed in the preceding four chapters of this report.

Finding 2.1: Data scientists today draw largely from extensions of the “analyst”
of years past trained in traditional disciplines. As data science becomes an
integral part of many industries and enriches research and development, there
will be an increased demand for more holistic and more nuanced data science

Finding 2.2: Data science programs that strive to meet the needs of their students
will likely evolve to emphasize certain skills and capabilities. This will result in
programs that prepare different types of data scientists.
National Academies of Sciences, Engineering, and Medicine. 2018. Data
Science for Undergraduates: Opportunities and Options. Washington, DC: The
National Academies Press.

