Cia 2
Cia 2
SUBMITTED BY
OF DR. DURGANSH
SHARMA
Abalone is a sort of shellfish that is highly common. Their flesh is prized as a delicacy, and
their shells are frequently used in jewellery. The topic of assessing the age of abalone based
on its physical properties is addressed in this paper. Alternative techniques of estimating their
age are time-consuming. Therefore this subject is of interest. Depending on the species,
abalone can live up to 50 years. Environmental elements such as water flow and wave
activity play a significant role in how quickly they grow. Those from protected waters often
develop more slowly than those from exposed reef areas due to differences in food
availability. Estimating the age of abalone is challenging because to the fact that their size is
determined not only by their age, but also by the availability of food. Furthermore, abalone
can develop so-called 'stunted' populations, which have substantially distinct growth
characteristics than other abalone populations. The abalone age prediction problem has been
classified as a classification problem in most of the research on the dataset, which entails
assigning a label to each case in the dataset. In this case, the label represents the abalone's
ring count, which is an actual quantity. As a result, the classifier will be unable to distinguish
between many classes and will perform insignificantly. The age of abalone has a positive
correlation with its price. However, identifying an abalone's age is a time-consuming
operation. As the abalone matures, rings form in its inner shell, generally at a pace of one
ring per year. Cutting the shell of an abalone allows access to the rings. A lab technician
examines a shell sample under a microscope and counts the rings after polishing and
staining them.
PROBLEM STATEMENT
Abalones are endangered marine snails that are found within the cold coastal waters
worldwide, majorly being distributed off the coasts of recent Zealand, African nation,
Australia, Western North America, and Japan. Abalones are sea snails or molluscs otherwise
commonly called as ear shells or sea ears. due to the economic importance of the age of the
abalone and therefore the cumbersome process that's involved in calculating it, much research
has been done to resolve the problem of abalone age prediction using its physical
measurements available within the dataset.
DATA UNDERSTANDING
The abalone dataset is a collection of measurements of different abalones' physical features.
There are 4177 examples of it. To demonstrate the algorithms in action, we'll use the
Abalone dataset that has previously been collected. With this data, we can create a number of
regression models to investigate how different independent variables affect our dependent
variable, Rings. Knowing how each factor influences the Abalone's age can help
oceanographers, jewelers, and businesses better examine their production, distribution, and
pricing strategies. To understand the data, you must first understand what it contains.
Understanding the type (continuous numeric, discrete numeric, or categorical) and meaning
of each feature and the number of instances and features in the dataset is essential
MODELLING
• The model selected for predicting the lifespan/Ageing of Abalone was Logistic Regression,
o Detect the number of rings on the Abalone an ordinal scale of long life or short life.
> library(readxl)
> abalone <- read_excel("C:/Users/Shivangi Gupta/Desktop/abalone.xlsx")
> View(abalone)
> summary(abalone)
S Sex Length D
Min. :0.0000 Length:4177 Min. :0.075 Min. :0.0000
1st Qu.:0.0000 Class :character 1st Qu.:0.450 1st Qu.:0.0000
Median :1.0000 Mode :character Median :0.545 Median :1.0000
Mean :0.6342 Mean :0.524 Mean :0.7328
3rd Qu.:1.0000 3rd Qu.:0.615 3rd Qu.:1.0000
Max. :1.0000 Max. :0.815 Max. :1.0000
Deviance Residuals:
Min 1Q Median 3Q Max
-2.2235 0.4200 0.4200 0.7457 0.7457
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.38370 0.09201 25.91 <2e-16 ***
S -1.24595 0.10257 -12.15 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Because the bulk of explanatory factors were shown to be negligible by individual p-values, I
considered running another model with only the control variables "diameter" and "weight.s."
> exp(cbind(coef(abalonelogmod1),confint(abalonelogmod1)))
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 10.8449612 9.0949468 13.049000
S 0.2876683 0.2344214 0.350545
> round(exp(cbind(coef(abalonelogmod1),confint(abalonelogmod1))),3)
Waiting for profiling to be done...
2.5 % 97.5 %
(Intercept) 10.845 9.095 13.049
S 0.288 0.234 0.351
Odds Ratio obtained indicates chances of gender determination of abalone is 0.288 times related to
weight of the shell.
Call:
glm(formula = abalone$W ~ abalone$D, family = binomial(link = "logit"),
data = abalone)
Deviance Residuals:
Min 1Q Median 3Q Max
-4.0066 0.0256 0.0256 0.0256 1.5323
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.80414 0.06477 -12.415 <2e-16 ***
abalone$D 8.83031 1.00207 8.812 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
2.5 % 97.5 %
(Intercept) 0.447 0.394 0.508
abalone$D 6838.434 1542.648 120165.822> x<-data.frame(abalone$S, abalone$D)
, , = 0, = 1
0 1
0 0 1
1 0 0
, , = 1, = 1
0 1
0 0 0
1 1318 1742
> library(readxl)
> abalone <- read_excel("C:/Users/Shivangi Gupta/Desktop/abalone.xlsx")
> View(abalone)
> abalonelogmod1<-glm(W ~ Sex + Length + Diameter + Rings, family=binomial (link="logit"),
data=abalone)
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred
> abalonelogmod1<-glm(W ~ H + S + D + R, family=binomial (link="logit"), data=abalone)
> summary(abalonelogmod1)
Call:
glm(formula = W ~ H + S + D + R, family = binomial(link = "logit"),
data = abalone)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.9568 0.0130 0.0146 0.0282 2.2501
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -2.2090 0.2217 -9.964 < 2e-16 ***
H 3.3347 0.1779 18.741 < 2e-16 ***
S -0.2395 0.2222 -1.078 0.281
D 6.9416 1.0064 6.897 5.30e-12 ***
R 1.3141 0.3195 4.113 3.91e-05 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
All outputs where probability of z values are less than 0.05 are considered as major factors
for gender determination, such as weight, height, diameter, rings etc. There odds of gender
determination increases by a factor of 40.056 and 7.059 with increase in height and rings
respectively.
DATA PARTITION
> set.seed(555)
> ind <- sample(2, nrow(abalone), replace = TRUE,prob = c(0.6, 0.4))
> training <- abalone[ind==1,]
> testing <- abalone[ind==2,]
> library(MASS)
> linear <- lda(W~., training)
Warning message:
In lda.default(x, grouping,....) : variables are collinear
> linear
Call:
lda(W ~ ., data = training)
Group means:
S SexI SexM Length D Diameter H Height
0 0.8628319 0.7986726 0.1371681 0.3333407 0.0000000 0.2503319 0.1283186 0.08277655
1 0.5917188 0.2161772 0.4082812 0.5666827 0.8955224 0.4430380 0.9740010 0.15133606
`Whole weight` `Shucked weight` `Viscera weight` `Shell weight` R
0 0.1962909 0.08656305 0.0428219 0.05803982 0.03539823
1 0.9655279 0.42000289 0.2107434 0.27752792 0.41550313
Rings
0 6.597345
1 10.630236
> attributes(linear)
$names
[1] "prior" "counts" "means" "scaling" "lev" "svd" "N" "call"
[9] "terms" "xlevels"
$class
[1] "lda"
HISTOGRAM
> p <- predict(linear, training)
> ldahist(data = p$x[,1], g = training$W)
> library(devtools)
In addition: Warning messages:
1: package ‘devtools’ was built under R version 4.0.5
2: package ‘usethis’ was built under R version 4.0.5
> library(klaR)
Warning message:
package ‘klaR’ was built under R version 4.0.5
> p1 <- predict(linear, training)$class
Actual
Predicted 0 1
0 394 48
1 58 2029
> sum(diag(tab))/sum(tab)
[1] 0.9580862
> p2 <- predict(linear, testing)$class
> tab1 <- table(Predicted = p2, Actual = testing$Species)
Error in table(Predicted = p2, Actual = testing$Species) :
all arguments must have the same length
In addition: Warning message:
Thus, Linear Discriminant Analysis has helped to produce robust, decent, and interpretable
classification results, and classifying abalone shells on the basis of their gender which was
not possible on the first glance. The continuous independent variables help in determining the
classifying variable that is gender.
CONCLUSION
The dataset was examined with Logistic regression and covered with the basics of Machine
learning and examined the model constructions and workflow. It was relevantly evident
that the model accuracy was comparatively good. Apparently, there isn't much of a
difference between Males and Females, a claim that can be confidently made given the small
variation in intersex means for each of the eight regressors.
In addition, because the accuracy indicator exceeds the "No-Information Rate" (the
theoretical accuracy that would be attained if all observations were assigned "No" and then
compared to
actual data), the model might be considered a more-or-less decent predictor of abalone sex.
To summarize, after accounting for all of the interfering factors, I was generally pleased
with the findings obtained by logistic regression. Given the abundance of other, more
advanced, and generally more effective classification algorithms, I strongly urge their use
with respect to the dataset, believing that they will result in increased accuracy and, as a
result, more precise results.