Practise Questions
Practise Questions
Practise Questions
QUESTIONS
Question 1.
Calculate the waste of the year 1995,1996,2000
The table is:
Year Tons of Solid Waste
Generated (in thousands)
1990 19,358
1991 19,484
1992 20,293
1993 21,499
1994 23,561
ANSWER: -
1995: 23965.3
1996: 25007.4
2000: 29175.8
Question 2.
Calculate the numbers of insured commercial banks y (in thousands) in the United States for the years
1987 to 1996 are shown in the table. Find the values for 2005,2010
ANSWER: -
2005: 5.42018182
2006: 4.96145455
2007: 4.50272727
2008: 4.044
2009: 3.58527273
2010: 3.12654545
Question 3.
Calculate the acres of the farm from year 2000,2002
The table is:
Year Average Acreage Per
Farm
1910 139
1920 149
1930 157
1940 175
1950 216
1959 303
1969 390
1978 449
1987 462
1997 487
ANSWER: -
2000:509.72564103
2002: 519.16153846
Question 4.
Calculate the height when time is 1.000
The table is:
Time (sec) Height (m)
0.0000 1.03754
0.1080 1.40205
0.2150 1.63806
0.3225 1.77412
0.4300 1.80392
0.5375 1.71522
0.6450 1.50942
0.7525 1.21410
0.8600 0.83173
ANSWER: -
1.000: 1.28564909
Question 5.
Calculate the stopping distance when speed is 60,100
The table is:Speed (mph) Stopping Distance (ft)
10 15.1
20 39.9
30 75.2
40 120.5
50 175.9
ANSWER: -
60: 205.98
100: 366.86
33,21 = 0.5708976
141,51 = 184.739
96.08,9.7,3.101 = 17.84577
Q2. Crimes
7,23,5,2.19 = 81.22555333
Ans:- output
Independent variables:-area,age
Dependent:-price
Ans:- output
model.predict(x_test): array([0], dtype=int64)
Ans:- output
model.score(x_test,y_test): 1.0
Ans:- output
model.predict(x_test): array(['F'], dtype=object)
RANDOM FOREST
The problem we will tackle is predicting the max temperature for tomorrow in our city using one
year of past weather data. I am using Seattle, WA but feel free to find data for your own city using
the NOAA Climate Data Online tool. We are going to act as if we don’t have access to any
weather forecasts (and besides, it’s more fun to make our own predictions rather than rely on
others). What we do have access to is one year of historical max temperatures, the temperatures
for the previous two days, and an estimate from a friend who is always claiming to know
everything about the weather. This is a supervised, regression machine learning problem. It’s
supervised because we have both the features (data for the city) and the targets (temperature) that
we want to predict. During training, we give the random forest both the features and targets and it
must learn how to map the data to a prediction. Moreover, this is a regression task because the
target value is continuous (as opposed to discrete classes in classification). That’s pretty much all
the background we need, so let’s start!
Roadmap
Before we jump right into programming, we should lay out a brief guide to keep us on track. The
following steps form the basis for any machine learning workflow once we have a problem and
model in mind:
8. Compare predictions to the known test set targets and calculate performance metrics
9. If performance is not satisfactory, adjust the model, acquire more data, or try a different
modeling technique
Step 1 is already checked off! We have our question: “can we predict the max temperature
tomorrow for our city?” and we know we have access to historical max temperatures for the past
year in Seattle, WA.
Data Acquisition
First, we need some data. To use a realistic example, I retrieved weather data for Seattle, WA
from 2016 using the NOAA Climate Data Online tool. Generally, about 80% of the time spent in
data analysis is cleaning and retrieving data, but this workload can be reduced by finding high-
quality data sources. The NOAA tool is surprisingly easy to use and temperature data can be
downloaded as clean csv files which can be parsed in languages such as Python or R. The
complete data file is available for download for those wanting to follow along.
The information is in the tidy data format with each row forming one observation, with the
variable values in the columns.
friend: your friend’s prediction, a random number between 20 below the average and 20 above
the average
If we look at the dimensions of the data, we notice only there are only 348 rows, which doesn’t
quite agree with the 366 days we know there were in 2016. Looking through the data from the
NOAA, I noticed several missing days, which is a great reminder that data collected in the real-
world will never be perfect. Missing data can impact an analysis as can incorrect data or outliers.
In this case, the missing data will not have a large effect, and the data quality is good because of
the source. We also can see there are nine columns which represent eight features and the one
target (‘actual’).
Data Summary
There are not any data points that immediately appear as anomalous and no zeros in any of the
measurement columns. Another method to verify the quality of the data is make basic plots. Often
it is easier to spot anomalies in a graph than in numbers. I have left out the actual code here,
because plotting is Python is non-intuitive but feel free to refer to the notebook for the complete
implementation (like any good data scientist, I pretty much copy and pasted the plotting code
from Stack Overflow).
Examining the quantitative statistics and the graphs, we can feel confident in the high quality of
our data. There are no clear outliers, and although there are a few missing points, they will not
detract from the analysis.
Data Preparation
Unfortunately, we aren’t quite at the point where you can just feed raw data into a model and have
it return an answer (although people are working on this)! We will need to do some minor
modification to put our data into machine-understandable terms. We will use the Python
library Pandas for our data manipulation relying, on the structure known as a dataframe, which is
basically an excel spreadsheet with rows and columns.
The exact steps for preparation of the data will depend on the model used and the data gathered,
but some amount of data manipulation will be required for any machine learning application.
One-Hot Encoding
The first step for us is known as one-hot encoding of the data. This process takes categorical
variables, such as days of the week and converts it to a numerical representation without an
arbitrary ordering. Days of the week are intuitive to us because we use them all the time. You will
(hopefully) never find anyone who doesn’t know that ‘Mon’ refers to the first day of the
workweek, but machines do not have any intuitive knowledge. What computers know is numbers
and for machine learning we must accommodate them. We could simply map days of the week to
numbers 1–7, but this might lead to the algorithm placing more importance on Sunday because it
has a higher numerical value. Instead, we change the single column of weekdays into seven
columns of binary data. This is best illustrated pictorially. One hot encoding takes this:
So, if a data point is a Wednesday, it will have a 1 in the Wednesday column and a 0 in all other
columns. This process can be done in pandas in a single line!-hot encoding: Data after One-Hot
Encoding
The shape of our data is now 349 x 15 and all of the column are numbers, just how the algorithm
likes it!
Now, we need to separate the data into the features and targets. The target, also known as the
label, is the value we want to predict, in this case the actual max temperature and the features are
all the columns the model uses to make a prediction. We will also convert the Pandas dataframes
to Numpy arrays because that is the way the algorithm works. (I save the column headers, which
are the names of the features, to a list to use for later visualization).
Training and Testing Sets
There is one final step of data preparation: splitting data into training and testing sets. During
training, we let the model ‘see’ the answers, in this case the actual temperature, so it can learn
how to predict the temperature from the features. We expect there to be some relationship
between all the features and the target value, and the model’s job is to learn this relationship
during training. Then, when it comes time to evaluate the model, we ask it to make predictions on
a testing set where it only has access to the features (not the answers)! Because we do have the
actual answers for the test set, we can compare these predictions to the true value to judge how
accurate the model is. Generally, when training a model, we randomly split the data into training
and testing sets to get a representation of all data points (if we trained on the first nine months of
the year and then used the final three months for prediction, our algorithm would not perform well
because it has not seen any data from those last three months.) I am setting the random state to 42
which means the results will be the same each time I run the split for reproducible results.
It looks as if everything is in order! Just to recap, to get the data into a form acceptable for
machine learning we:
3. Converted to arrays
Depending on the initial data set, there may be extra work involved such as removing
outliers, imputing missing values, or converting temporal variables into cyclical representations.
These steps may seem arbitrary at first, but once you get the basic workflow, it will be generally
the same for any machine learning problem. It’s all about taking human-readable data and putting
it into a form that can be understood by a machine learning model.
Establish Baseline
Before we can make and evaluate predictions, we need to establish a baseline, a sensible measure
that we hope to beat with our model. If our model cannot improve upon the baseline, then it will
be a failure and we should try a different model or admit that machine learning is not right for our
problem. The baseline prediction for our case can be the historical max temperature averages. In
other words, our baseline is the error we would get if we simply predicted the average max
temperature for all days.
We now have our goal! If we can’t beat an average error of 5 degrees, then we need to rethink our
approach.
Train Model
After all the work of data preparation, creating and training the model is pretty simple using
Scikit-learn. We import the random forest regression model from skicit-learn, instantiate the
model, and fit (scikit-learn’s name for training) the model on the training data. (Again setting the
random state for reproducible results). This entire process is only 3 lines in scikit-learn!
Our model has now been trained to learn the relationships between the features and the targets.
The next step is figuring out how good the model is! To do this we make predictions on the test
features (the model is never allowed to see the test answers). We then compare the predictions to
the known answers. When performing regression, we need to make sure to use the absolute error
because we expect some of our answers to be low and some to be high. We are interested in how
far away our average prediction is from the actual value so we take the absolute value (as we also
did when establishing the baseline).
To put our predictions in perspective, we can calculate an accuracy using the mean average
percentage error subtracted from 100 %.
Code:-
# Pandas is used for data manipulation
import pandas as pd# Read in data and display first 5 rows
features = pd.read_csv('temps.csv')
features.head(5)
Output:-
Our average estimate is off by 3.83 degrees. That is more than a 1 degree average improvement
over the baseline. Although this might not seem significant, it is nearly 25% better than the
baseline, which, depending on the field and the problem, could represent millions of dollars to a
company.
# Calculate mean absolute percentage error (MAPE)
mape = 100 * (errors / test_labels) # Calculate and display accuracy
accuracy = 100 - np.mean(mape)
print('Accuracy:', round(accuracy, 2), '%.')
Output:-
That looks pretty good! Our model has learned how to predict the maximum temperature for the
next day in Seattle with 94% accuracy.
Q2. Petrolium
In this section we will study how random forests can be used to solve regression problems using
Scikit-Learn. In the next section we will solve classification problem via random forests.
Problem Definition
The problem here is to predict the gas consumption (in millions of gallons) in 48 of the US states
based on petrol tax (in cents), per capita income (dollars), paved highways (in miles) and the
proportion of population with the driving license.
Solution
To solve this regression problem we will use the random forest algorithm via the Scikit-Learn
Python library. We will follow the traditional machine learning pipeline to solve this problem.
Follow these steps:
1. Import Libraries
2. Importing Dataset
The dataset for this problem is available at:
To get a high-level view of what the dataset looks like, execute the following command:
We can see that the values in our dataset are not very well scaled. We will scale them down
before training the algorithm.
Finally, let's divide the data into training and testing sets:
4. Feature Scaling
We know our dataset is not yet a scaled value, for instance the Average_Income field has values
in the range of thousands while Petrol_tax has values in range of tens. Therefore, it would be
beneficial to scale our data (although, as mentioned earlier, this step isn't as important for the
random forests algorithm).
Attribute Information:
Problem Definition
The task here is to predict whether a bank currency note is authentic or not based on four
attributes i.e. variance of the image wavelet transformed image, skewness, entropy, and curtosis
of the image.
Solution
This is a binary classification problem and we will use a random forest classifier to solve this
problem. Steps followed to solve this problem will be similar to the steps performed for
regression.
1. Import Libraries
2. Importing Dataset
dataset.head()
As was the case with regression dataset, values in this dataset are not very well scaled. The
dataset will be scaled before training the algorithm.
4. Feature Scaling
As with before, feature scaling works the same way:
And 7 categories without (CID) patient follow-up (22), medical consultation (23), blood donation
(24), laboratory examination (25), unjustified absence (26), physiotherapy (27), dental
consultation (28).
3. Month of absence
4. Day of the week (Monday (2), Tuesday (3), Wednesday (4), Thursday (5), Friday (6))
5. Seasons (summer (1), autumn (2), winter (3), spring (4))
6. Transportation expense
7. Distance from Residence to Work (kilometers)
8. Service time
9. Age
10. Hit target
11. Disciplinary failure (yes=1; no=0)
12. Education (high school (1), graduate (2), postgraduate (3), master and doctor (4))
13. Son (number of children)
14. Social drinker (yes=1; no=0)
15. Social smoker (yes=1; no=0)
16. Pet (number of pet)
17. Weight
18. Height
19. Body mass index
20. Absenteeism time in hours (target)
Original(Reason Vs Hours of leave)
With Optimization
SVM PRACTICE QUESTION
Q1. HandWritten digit
This algorithm is normally a second stepping stone for those who have learned linear and logistic
regression. It is quite a popular algorithm used mostly in classification problems. It creates high-
performance accuracy models with fewer efforts and minimum resources. Though it can be used
for regression, its application is mostly found in classification scenario.
There is n number of data points which are features of our dataset. These points are plotted in an
n-dimensional graph or space. After classification, these features are divided by a
hyperparameter line which separates the two plains and helps us find the best one.
There are many possible criteria for finding the optimum hyperplane. Our objective is to find a
plane in which has the maximum distance between the data points of both the classes. The
dimension of a hyperplane is directly proportional to the number of features. If were input only 2
features, then the hyperplane is just a line. If it’s 3 or more, then the hyperplane becomes a two-
dimensional plane. It becomes more complex when it exceeds 3.
These data points are support vectors that are closer to hyperplane and influence the position and
orientation of the hyperplane. With the help of these vectors, we maximize the margin of the
classifier. If we change or delete support vectors, it will change the position of the hyperplane.