0% found this document useful (0 votes)
81 views14 pages

Cia 2

The document discusses predicting the age of abalone using machine learning algorithms. It first describes abalone and the importance of estimating their age. Currently, age is estimated by cutting open the shell and counting growth rings, which is time-consuming. The goal is to develop a model using physical measurements like length, diameter, and weight to classify abalone into age groups. Logistic regression models are created using the abalone dataset to predict a binary variable of long or short life based on attribute values. Multivariate logistic regression is also explored to include interactions between variables.

Uploaded by

Shivangi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views14 pages

Cia 2

The document discusses predicting the age of abalone using machine learning algorithms. It first describes abalone and the importance of estimating their age. Currently, age is estimated by cutting open the shell and counting growth rings, which is time-consuming. The goal is to develop a model using physical measurements like length, diameter, and weight to classify abalone into age groups. Logistic regression models are created using the abalone dataset to predict a binary variable of long or short life based on attribute values. Multivariate logistic regression is also explored to include interactions between variables.

Uploaded by

Shivangi Gupta
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 14

CIA- 2

MACHINE LEARNING ALGORTIHMS

SUBMITTED BY

SHIVANGI GUPTA (20221026)

UNDER THE GUIDANCE

OF DR. DURGANSH

SHARMA

INSTITUTE OF BUSINESS AND MANAGEMENT


CHRIST (DEEMEDTOBE UNIVERSITY),DELHI
NCR
BUSINESS UNDERSTANDING

Abalone is a sort of shellfish that is highly common. Their flesh is prized as a delicacy, and
their shells are frequently used in jewellery. The topic of assessing the age of abalone based
on its physical properties is addressed in this paper. Alternative techniques of estimating their
age are time-consuming. Therefore this subject is of interest. Depending on the species,
abalone can live up to 50 years. Environmental elements such as water flow and wave
activity play a significant role in how quickly they grow. Those from protected waters often
develop more slowly than those from exposed reef areas due to differences in food
availability. Estimating the age of abalone is challenging because to the fact that their size is
determined not only by their age, but also by the availability of food. Furthermore, abalone
can develop so-called 'stunted' populations, which have substantially distinct growth
characteristics than other abalone populations. The abalone age prediction problem has been
classified as a classification problem in most of the research on the dataset, which entails
assigning a label to each case in the dataset. In this case, the label represents the abalone's
ring count, which is an actual quantity. As a result, the classifier will be unable to distinguish
between many classes and will perform insignificantly. The age of abalone has a positive
correlation with its price. However, identifying an abalone's age is a time-consuming
operation. As the abalone matures, rings form in its inner shell, generally at a pace of one
ring per year. Cutting the shell of an abalone allows access to the rings. A lab technician
examines a shell sample under a microscope and counts the rings after polishing and
staining them.

PROBLEM STATEMENT

Abalones are endangered marine snails that are found within the cold coastal waters
worldwide, majorly being distributed off the coasts of recent Zealand, African nation,
Australia, Western North America, and Japan. Abalones are sea snails or molluscs otherwise
commonly called as ear shells or sea ears. due to the economic importance of the age of the
abalone and therefore the cumbersome process that's involved in calculating it, much research
has been done to resolve the problem of abalone age prediction using its physical
measurements available within the dataset.
DATA UNDERSTANDING
The abalone dataset is a collection of measurements of different abalones' physical features.
There are 4177 examples of it. To demonstrate the algorithms in action, we'll use the
Abalone dataset that has previously been collected. With this data, we can create a number of
regression models to investigate how different independent variables affect our dependent
variable, Rings. Knowing how each factor influences the Abalone's age can help
oceanographers, jewelers, and businesses better examine their production, distribution, and
pricing strategies. To understand the data, you must first understand what it contains.
Understanding the type (continuous numeric, discrete numeric, or categorical) and meaning
of each feature and the number of instances and features in the dataset is essential

MODELLING

It is a classification method used to determine the probability of an event's success or failure in


R. Binary dependent variables (true/false, yes/no) are utilised in logistic regression. In a
binomial distribution, the logit function is employed as a link function. Modelling (Multiple
Linear regression) a. Model selection and assumptions, if any

• The model selected for predicting the lifespan/Ageing of Abalone was Logistic Regression,

• Objective of conducting a logistic regression

o Detect the number of rings on the Abalone an ordinal scale of long life or short life.

o Test interactions between attributes

SIMPLE BINARY LOGISTIC REGRESSION

> library(readxl)
> abalone <- read_excel("C:/Users/Shivangi Gupta/Desktop/abalone.xlsx")
> View(abalone)
> summary(abalone)
S Sex Length D
Min. :0.0000 Length:4177 Min. :0.075 Min. :0.0000
1st Qu.:0.0000 Class :character 1st Qu.:0.450 1st Qu.:0.0000
Median :1.0000 Mode :character Median :0.545 Median :1.0000
Mean :0.6342 Mean :0.524 Mean :0.7328
3rd Qu.:1.0000 3rd Qu.:0.615 3rd Qu.:1.0000
Max. :1.0000 Max. :0.815 Max. :1.0000

Diameter Height Whole weight Shucked weight


Min. :0.0550 Min. :0.0000 Min. :0.0020 Min. :0.0010
1st Qu.:0.3500 1st Qu.:0.1150 1st Qu.:0.4415 1st Qu.:0.1860
Median :0.4250 Median :0.1400 Median :0.7995 Median :0.3360
Mean :0.4079 Mean :0.1395 Mean :0.8287 Mean :0.3594
3rd Qu.:0.4800 3rd Qu.:0.1650 3rd Qu.:1.1530 3rd Qu.:0.5020
Max. :0.6500 Max. :1.1300 Max. :2.8255 Max. :1.4880

Viscera weight Shell weight Rings W


Min. :0.0005 Min. :0.0015 Min. : 1.000 Min. :0.0000
1st Qu.:0.0935 1st Qu.:0.1300 1st Qu.: 8.000 1st Qu.:1.0000
Median :0.1710 Median :0.2340 Median : 9.000 Median :1.0000
Mean :0.1806 Mean :0.2388 Mean : 9.934 Mean :0.8152
3rd Qu.:0.2530 3rd Qu.:0.3290 3rd Qu.:11.000 3rd Qu.:1.0000
Max. :0.7600 Max. :1.0050 Max. :29.000 Max. :1.0000

> table(abalone$W, abalone$S)


0 1
0 129 643
1 1399 2006

> abalonelogmod1<-glm(W~S, family = binomial(link="logit"), data = abalone)


> summary(abalonelogmod1)
Call:
glm(formula = W ~ S, family = binomial(link = "logit"), data = abalone)

Deviance Residuals:
Min 1Q Median 3Q Max
-2.2235 0.4200 0.4200 0.7457 0.7457

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.38370 0.09201 25.91 <2e-16 ***
S -1.24595 0.10257 -12.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 3998.4 on 4176 degrees of freedom


Residual deviance: 3820.7 on 4175 degrees of freedom
AIC: 3824.7

Number of Fisher Scoring iterations: 5

Because the bulk of explanatory factors were shown to be negligible by individual p-values, I
considered running another model with only the control variables "diameter" and "weight.s."

> exp(cbind(coef(abalonelogmod1),confint(abalonelogmod1)))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 10.8449612 9.0949468 13.049000
S 0.2876683 0.2344214 0.350545
> round(exp(cbind(coef(abalonelogmod1),confint(abalonelogmod1))),3)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 10.845 9.095 13.049
S 0.288 0.234 0.351
Odds Ratio obtained indicates chances of gender determination of abalone is 0.288 times related to
weight of the shell.

UNIVARIATE BINARY LOGISTIC REGRESSION

> table(abalone$W, abalone$D)


0 1
0 771 1
1 345 3060
> abalonelogmod2<-glm(abalone$W ~ abalone$D, family=binomial(link="logit"), data=abalone)
> summary(abalonelogmod2)

Call:
glm(formula = abalone$W ~ abalone$D, family = binomial(link = "logit"),
data = abalone)

Deviance Residuals:
Min 1Q Median 3Q Max
-4.0066 0.0256 0.0256 0.0256 1.5323

Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.80414 0.06477 -12.415 <2e-16 ***
abalone$D 8.83031 1.00207 8.812 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)


Null deviance: 3998.4 on 4176 degrees of freedom
Residual deviance: 1398.3 on 4175 degrees of freedom
AIC: 1402.3

Number of Fisher Scoring iterations: 10> round(exp(cbind(coef(abalonelogmod2),


confint(abalonelogmod2))),3)
Waiting for profiling to be done...

2.5 % 97.5 %
(Intercept) 0.447 0.394 0.508
abalone$D 6838.434 1542.648 120165.822> x<-data.frame(abalone$S, abalone$D)

> table (abalone$W, x$abalone.S, abalone$W, x$abalone.D)


, , = 0, = 0
0 1
0 129 642
1 0 0
, , = 1, = 0
0 1
0 0 0
1 81 264

, , = 0, = 1

0 1
0 0 1
1 0 0

, , = 1, = 1

0 1
0 0 0
1 1318 1742

MULTIVARIATE BINARY LOGISTIC REGRESSION

> abalonelogmod3<-glm(abalone$W ~ abalone$S + abalone$D, family=binomial(link="logit"),


data=abalone)

> round(exp(cbind(coef(abalonelogmod3), confint(abalonelogmod3))),3)


Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.632 0.478 0.832
abalone$S 0.649 0.476 0.889
abalone$D 6329.174 1426.810 111238.284

> library(readxl)
> abalone <- read_excel("C:/Users/Shivangi Gupta/Desktop/abalone.xlsx")
> View(abalone)
> abalonelogmod1<-glm(W ~ Sex + Length + Diameter + Rings, family=binomial (link="logit"),
data=abalone)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> abalonelogmod1<-glm(W ~ H + S + D + R, family=binomial (link="logit"), data=abalone)
> summary(abalonelogmod1)

Call:
glm(formula = W ~ H + S + D + R, family = binomial(link = "logit"),
data = abalone)

Deviance Residuals:
Min 1Q Median 3Q Max
-3.9568 0.0130 0.0146 0.0282 2.2501
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2090 0.2217 -9.964 < 2e-16 ***
H 3.3347 0.1779 18.741 < 2e-16 ***
S -0.2395 0.2222 -1.078 0.281
D 6.9416 1.0064 6.897 5.30e-12 ***
R 1.3141 0.3195 4.113 3.91e-05 ***

---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 3998.45 on 4176 degrees of freedom


Residual deviance: 856.12 on 4172 degrees of freedom
AIC: 866.12

Number of Fisher Scoring iterations: 10

> exp(cbind(coef(abalonelogmod1), confint(abalonelogmod1)))


Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.1098121 0.07030839 1.677201e-01
H 28.0696722 19.92973074 4.005609e+01

S 0.7870051 0.50951643 1.218486e+00


D 1034.3932455 229.83152199 1.825352e+04
R 3.7212957 2.01601457 7.058820e+00
> round(exp(cbind(coef(abalonelogmod1),confint(abalonelogmod1))),3)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 0.110 0.070 0.168
H 28.070 19.930 40.056
S 0.787 0.510 1.218
D 1034.393 229.832 18253.516
R 3.721 2.016 7.059

All outputs where probability of z values are less than 0.05 are considered as major factors
for gender determination, such as weight, height, diameter, rings etc. There odds of gender
determination increases by a factor of 40.056 and 7.059 with increase in height and rings
respectively.

LINEAR DISCRIMINANT ANALYSIS


> library(psych)

DATA PARTITION
> set.seed(555)
> ind <- sample(2, nrow(abalone), replace = TRUE,prob = c(0.6, 0.4))
> training <- abalone[ind==1,]
> testing <- abalone[ind==2,]
> library(MASS)
> linear <- lda(W~., training)
Warning message:
In lda.default(x, grouping,....) : variables are collinear
> linear
Call:
lda(W ~ ., data = training)

Prior probabilities of groups:


0 1
0.1787268 0.8212732

Group means:
S SexI SexM Length D Diameter H Height
0 0.8628319 0.7986726 0.1371681 0.3333407 0.0000000 0.2503319 0.1283186 0.08277655
1 0.5917188 0.2161772 0.4082812 0.5666827 0.8955224 0.4430380 0.9740010 0.15133606
`Whole weight` `Shucked weight` `Viscera weight` `Shell weight` R
0 0.1962909 0.08656305 0.0428219 0.05803982 0.03539823
1 0.9655279 0.42000289 0.2107434 0.27752792 0.41550313
Rings
0 6.597345
1 10.630236

Coefficients of linear discriminants:


LD1
S -0.02271797
SexI -0.12657893
SexM 0.02271797
Length 6.21531058
D 1.20696507
Diameter 3.67483137
H 3.43807409
Height -6.62207580

`Whole weight` -1.82072215


`Shucked weight` 0.53309955
`Viscera weight` 0.01258240
`Shell weight` 1.81269316
R -0.02220770
Rings 0.02356770

> attributes(linear)
$names
[1] "prior" "counts" "means" "scaling" "lev" "svd" "N" "call"
[9] "terms" "xlevels"

$class
[1] "lda"
HISTOGRAM
> p <- predict(linear, training)
> ldahist(data = p$x[,1], g = training$W)

> library(devtools)
In addition: Warning messages:
1: package ‘devtools’ was built under R version 4.0.5
2: package ‘usethis’ was built under R version 4.0.5
> library(klaR)
Warning message:
package ‘klaR’ was built under R version 4.0.5
> p1 <- predict(linear, training)$class

> tab <- table(Predicted = p1, Actual = training$W)


> tab

Actual
Predicted 0 1
0 394 48
1 58 2029
> sum(diag(tab))/sum(tab)
[1] 0.9580862
> p2 <- predict(linear, testing)$class
> tab1 <- table(Predicted = p2, Actual = testing$Species)
Error in table(Predicted = p2, Actual = testing$Species) :
all arguments must have the same length
In addition: Warning message:

Unknown or uninitialised column: `Species`.


> tab1 <- table(Predicted = p2, Actual = testing$W)
> tab1
Actual
Predicted 0 1
0 286 40
1 34 1288
> sum(diag(tab1))/sum(tab1)
[1] 0.9550971

Thus, Linear Discriminant Analysis has helped to produce robust, decent, and interpretable
classification results, and classifying abalone shells on the basis of their gender which was
not possible on the first glance. The continuous independent variables help in determining the
classifying variable that is gender.

CONCLUSION

The dataset was examined with Logistic regression and covered with the basics of Machine
learning and examined the model constructions and workflow. It was relevantly evident
that the model accuracy was comparatively good. Apparently, there isn't much of a
difference between Males and Females, a claim that can be confidently made given the small
variation in intersex means for each of the eight regressors.

In addition, because the accuracy indicator exceeds the "No-Information Rate" (the
theoretical accuracy that would be attained if all observations were assigned "No" and then
compared to
actual data), the model might be considered a more-or-less decent predictor of abalone sex.
To summarize, after accounting for all of the interfering factors, I was generally pleased
with the findings obtained by logistic regression. Given the abundance of other, more
advanced, and generally more effective classification algorithms, I strongly urge their use
with respect to the dataset, believing that they will result in increased accuracy and, as a
result, more precise results.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy