Case Study Data Science
Case Study Data Science
In this use case, we will predict the occurrence of diabetes using the entire lifecycle we
discussed earlier. Let’s go through the various steps.
Step 1:
● First, we will collect the data based on the medical history of the patient as discussed
in Phase 1. You can refer to the sample data below.

Attributes:
Step 2:
● Now, once we have the data, we need to clean and prepare the data for data
analysis.
● This data has a lot of inconsistencies like missing values, blank columns, abrupt
values and incorrect data format which need to be cleaned.
● Here, we have organized the data into a single table under different attributes –
making it look more structured.
● Let’s have a look at the sample data below.
This data has a lot of inconsistencies.
1. In the column npreg, “one” is written in words, whereas it should be in the numeric
form like 1.
2. In column bp one of the values is 6600 which is impossible (at least for humans) as
bp cannot go up to such huge value.
3. As you can see the Income column is blank and also makes no sense in predicting
diabetes. Therefore, it is redundant to have it here and should be removed from the
table.
● So, we will clean and preprocess this data by removing the outliers, filling up the null
values and normalizing the data type. If you remember, this is our second phase
which is data preprocessing.
● Finally, we get the clean data as shown below which can be used for analysis.
Step 3:
● First, we will load the data into the analytical sandbox and apply various statistical
functions on it. For example, R has functions like describe which gives us the number
of missing values and unique values. We can also use the summary function which
will give us statistical information like mean, median, range, min and max values.
● Then, we use visualization techniques like histograms, line graphs, box plots to get a
fair idea of the distribution of data.
Step 4:
Now, based on insights derived from the previous step, the best fit for this kind of problem is
the decision tree. Let’s see how?
● Since, we already have the major attributes for analysis like npreg, bmi, etc., so we
will use supervised learning technique to build a model here.
● Further, we have particularly used decision tree because it takes all attributes into
consideration in one go, like the ones which have a linear relationship as well as
those which have a non-linear relationship. In our case, we have a linear relationship
between npreg and age, whereas the nonlinear relationship between npreg and ped.
● Decision tree models are also very robust as we can use the different combination of
attributes to make various trees and then finally implement the one with the
maximum efficiency.
If you want to learn more about the implementation of the decision tree, refer this blog How
To Create A Perfect Decision Tree
Step 5:
In this phase, we will run a small pilot project to check if our results are appropriate. We will
also look for performance constraints if any. If the results are inaccurate, we need to replan
and rebuild the model.
Step 6:
Data Science with Python Certification Course
Weekday / Weekend Batches
See Batch Details
Once we have executed the project successfully, we will share the output for full deployment.
Being a Data Scientist is easier said than done. So, let’s see what all you need to be a Data
Scientist. A Data Scientist requires skills basically from three major areas as shown below.
As you can see in the above image, you need to acquire various hard skills and soft skills.
You need to be good at statistics and mathematics to analyze and visualize data. Needless
to say, Machine Learning forms the heart of Data Science and requires you to be good at it.
Also, you need to have a solid understanding of the domain you are working in to
understand the business problems clearly. Your task does not end here. You should be
capable of implementing various algorithms which require good coding skills. Finally, once
you have made certain key decisions, it is important for you to deliver them to the
stakeholders. So, good communication will definitely add brownie points to your skills.
I urge you to see this Data Science video tutorial that explains what is Data Science and all
that we have discussed in the blog. Go ahead, enjoy the video and tell me what you think.
What Is Data Science? Data Science Course – Data Science Tutorial For Beginners |
subject
This subject Data Science course video will take you through the need of data science, what
is data science, data science use cases for business, BI vs data science, data analytics
tools, data science lifecycle along with a demo.
In the end, it won’t be wrong to say that the future belongs to Data Scientists. It is predicted
that by the end of the year 2018, there will be a need of around one million Data Scientists.
More and more data will provide opportunities to drive key business decisions. It will soon
change how we look at the world deluged with data around us. Therefore, a Data Scientist
should be highly skilled and motivated to solve the most complex problems. You can predict
the growth of their business by incorporating data science methods in operations in the
coming years, anticipate the potential for problems, and develop strategies based on data to
achieve success. This is the best opportunity to kick off your career in the field of data
science by taking the Data Science Masters Program.