Probability and Statistics CC01 Group 3

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 161

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH UNIVERSITY OF TECHNOLOGY


--------

PROBABILITY AND STATISTICS


CLASS: CC01

GROUP 03

3D PRINTER DATASET

FOR MECHANICAL ENGINEERS

Instructor: Nguyễn Tiến Dũng

No. Member ID

1 Nguyễn Hòa Hiệp 2352342

2 Hồ Minh Hoàng 2352348

3 Đặng Hữu Huy 2352369


2

TABLE OF CONTENTS
PART 1: DATA INTRODUCTION.............................................................................................4
1. Dataset description...................................................................................................................4
2. Variable description.................................................................................................................5
3. Summary..................................................................................................................................5
PART 2: THEORETICAL BASIS...............................................................................................6
1. Analysis of one-factor variance...............................................................................................6
2. Multiple Linear Regression......................................................................................................9
a. Multiple linear regression.....................................................................................................9
b. Population regression function (PRF)................................................................................11
c. Sample regression function (SRF)......................................................................................12
d. Assumptions of the least squares method for multiple linear regression models..............13
e. Model fit metrics................................................................................................................14
f. Confidence interval and Hypothesis testing........................................................................16
g. Testing the overall significance of the model (Special Case of WALD Test)...................17
PART 3: DATA PREPROCEEDING........................................................................................18
1. Data importing.......................................................................................................................19
2. Data cleaning..........................................................................................................................24
a. Handling missing values.....................................................................................................24
b. Handling outliers and inconsistent data..............................................................................28
c. Data transformation............................................................................................................29
3. Feature Engineering...............................................................................................................30
PART 4: DESCRIPTIVE STATISTICS...................................................................................32
1. Data summary........................................................................................................................32
2. Plot data..................................................................................................................................34
a. Box plot..............................................................................................................................34
b. Correlation coefficients between variables........................................................................41
PART 5: INFERENTIAL STATISTICS...................................................................................44
1. Using Two-way ANOVA to evaluate how qualitative affect output parameters..................44

Probability and statistics – CC01


2
3

a. Assumption of the Two-way ANOVA...............................................................................44


b. Building ANOVA examine whether infill pattern, bed_temperature and material affect
roughness................................................................................................................................56
c. Building ANOVA examine whether infill pattern, bed_temperature and material affect
tension_strenght......................................................................................................................56
d. Building ANOVA examine whether infill pattern, bed_temperature and material affect
elongation...............................................................................................................................56
e. Conclusion of Two-way ANOVA......................................................................................57
2. Building regression model.....................................................................................................57
a. Building regression model based on roughness and eight setting parameters...................57
b. Building regression model based on tension strength and eight setting parameters..........69
c. Building regression model based on elongation and eight setting parameters...................73
d. Conclusion of regression models.......................................................................................78
PART 6: DISCUSSION & EXTENSION..................................................................................79
1. Discussion..............................................................................................................................79
a. Advantage...........................................................................................................................79
b. Disadvantage......................................................................................................................79
2. Extension................................................................................................................................80
a. Building multivariate polynomial regression model based on roughness and eight setting
parameters..............................................................................................................................80
b. Building multivariate polynomial regression model based on tension strenght and eight
setting parameters...................................................................................................................84
c. Building multivariate polynomial regression model based on elongation and eight setting
parameters..............................................................................................................................90
d. Model comparison..............................................................................................................94
3. Conclusion.............................................................................................................................94
PART 7: DATA SOURCE & CODE SOURCE........................................................................95
PART 8: REFERENCES............................................................................................................95

Probability and statistics – CC01


3
4

PART 1: DATA INTRODUCTION


This report analyzes a 3D printer dataset to understand the performance and
properties of 3D-printed materials and the impact of various printing parameters.
The dataset originates from research conducted by the TR/Selcuk University
Mechanical Engineering Department, focusing on the Ultimaker S5 3D printer.
The aim is to explore how adjustments in printer settings influence print quality,
accuracy, and mechanical strength.

Key components of the dataset include nine input parameters, such as infill
density, nozzle temperature, and print speed, alongside three output parameters:
roughness, tension strength, and elongation. With 50 samples, the dataset provides
insights into optimizing printing settings for producing high-quality, durable parts.

1. Dataset description
The dataset consists of experimental results from 3D printing tests performed on
the Ultimaker S5 3D printer, with material strength tested using a Sincotec GMBH
tester capable of applying up to 20 kN. Each observation represents a unique 3D-
printed sample.

- Population: All 3D-printed samples produced using the Ultimaker S5 printer.


- Sample Size: 50 observations, including nine input parameters and three
measured output parameters.
- Objective: To identify relationships between 3D printing settings, output
parameters, and material properties.

2. Variable description
Variable Data Type Units Description
Setting parameter

Probability and statistics – CC01


4
5

layer_height x ∈ R — 0.02 ≤ x ≤ 0.2, cont mm


wall_thickness x ∈ N — 1 ≤ x ≤ 10, cont mm
infill_density x ∈ N — 10 ≤ x ≤ 90, cont %
0 = “grid”, 1 =
infill_pattern x = 0 or x = 1, categ none
“honeycomb”
nozzle_temperature x ∈ N — 200 ≤ x ≤ 250, cont ◦C
bed_temperature x ∈ N — 60 ≤ x ≤ 80, cont ◦C
print_speed x ∈ N — 40 ≤ x ≤ 120, cont mm/s
material x = 0 or x = 1, categ none 0 = “abs”, 1 = “pla”
fan_speed x ∈ N — 0 ≤ x ≤ 100, cont %
Output parameters (Measured)
roughness x ∈ N — 21 ≤ x ≤ 368, cont µm
Table 1.1. Setting parameters & output parameters
3. Summary
The dataset from the TR/Selcuk University Mechanical Engineering Department
provides a robust foundation for analyzing the relationships between 3D printer
settings and the mechanical properties of printed materials. Through 50
observations, the study reveals critical insights into optimizing settings such as
infill density, nozzle temperature, and wall thickness to achieve desired outcomes
like enhanced tension strength and elongation.

The inclusion of both categorical and continuous variables enables a


comprehensive exploration of 3D printing dynamics. By understanding the dataset
and leveraging statistical and machine learning models, this study lays the
groundwork for improving 3D printing processes in manufacturing, engineering,
and design applications.

Probability and statistics – CC01


5
6

PART 2: THEORETICAL BASIS


In this section, we explore the theoretical concepts behind statistical methods that
are applied to the 3D Printer dataset for mechanical engineers. These methods are
essential for analyzing the impact of different printing parameters on the
mechanical properties of the printed parts. Specifically, we focus on Analysis of
Variance (ANOVA) and Single-factor Variance Analysis, which are widely used in
engineering experiments to compare the means of different groups or treatments.

1. Analysis of one-factor variance


The goal of One-Factor ANOVA is to determine whether there is a statistically
significant difference between the means of two or more groups based on a single
factor. In the context of 3D printing, one might use ANOVA to test whether
different printing parameters (such as infill_density, wall_thickness, print_speed,
or material) significantly impact a dependent variable, such as the tension_strenght
of the printed material.

The general framework of a One-Factor ANOVA is as follows:

 Null hypothesis (H0) assumes that all groups have the same mean (i.e., the
printing parameters do not affect the mechanical properties).

 The alternative hypothesis (H1) assumes that at least one of the groups has a
different mean (i.e., the printing parameters do affect the mechanical
properties).

The ANOVA test calculates the F-statistic which compares the variance between
the groups to the variance within the groups. If the F-statistic is large enough, it
suggests that there is a significant difference between the means, and the null
hypothesis is rejected.

Probability and statistics – CC01


6
7

2.1.1 Theory of ANOVA (Analysis of Variance)

ANOVA is based on the following assumptions:

1. Independence of observations: The data points (samples) must be


independent of each other.

2. Normality: The data within each group should be approximately normally


distributed.

3. Homogeneity of variances: The variance within each group should be roughly


equal. This assumption can be tested using Levene's Test or Bartlett’s Test for
homogeneity of variances.

The test statistic used in ANOVA is the F-statistic, which is calculated as:
Between−group variance
F=
Within−group variance

Where:

- Between-group variance measures how much the group means deviate


from the overall mean.

- Within-group variance measures how much individual observations


deviate from their respective group means.

The p-value corresponding to the F-statistic is used to determine whether the


differences between the group means are statistically significant. If the p-value is
below a predefined significance level (e.g., 0.05), we reject the null hypothesis and
conclude that the means are different.

2.1.2 Analysis of Single-Factor Variance

For the 3D Printer dataset, Single-Factor ANOVA can examine how a single factor,
such as infill_density, influences tension_strength or elongation. For example:

Probability and statistics – CC01


7
8

 Hypothesis:

 H₀: Means of the groups (low, medium, high infill density) are equal.

 H₁: At least one group mean is different.

 Steps:

1. Calculate the F-statistic: Using the formula mentioned earlier, we calculate the
F-statistic to compare the between-group and within-group variances.

2. Determine the p-value: The p-value helps us assess the strength of the
evidence against the null hypothesis. A p-value smaller than 0.05 indicates that
there is a statistically significant difference in means.

3. Post-hoc analysis: If ANOVA indicates significant differences, post-hoc tests


(such as Tukey's HSD) can be used to identify which specific groups differ
from each other.

Example: Suppose we want to test the effect of infill_density on tension_strenght.


We divide the data into three groups based on infill density levels (low, medium,
and high) and compare the tension_strenght across these groups using ANOVA.

 Null hypothesis: There is no significant difference in tension_strenght between


low, medium, and high infill densities.

 Alternative hypothesis: There is a significant difference in tension_strenght


between at least one pair of infill densities.

After performing the ANOVA, if the F-statistic is significantly large and the p-
value is below 0.05, we can conclude that the infill density does influence the
tension strength of the printed material.

Probability and statistics – CC01


8
9

2. Multiple linear regression


Multiple linear regression (MLR) is a statistical technique used to model the
relationship between a dependent variable and multiple independent variables. In
the context of the 3D Printer dataset, we apply MLR to predict mechanical
properties such as tension_strength, elongation, or roughness based on various
printing parameters, such as infill_density, wall_thickness, print_speed, material,
and others. Multiple linear regression allows engineers to understand the
contribution of each parameter and predict material behavior under different
conditions.

a. Multiple linear regression


The goal of Multiple Linear Regression is to find the linear relationship between
a dependent variable (Y) and multiple independent variables (x1, x2, x3,…xk ):

Y=β0+β1x1+β2x2+⋯+βkxk+ϵ
Where:

 Y is the dependent variable (e.g., tension_strength or roughness).

 x1, x2, x3,…xk are the independent variables (e.g., infill_density, wall_thickness,
material, etc.).

 β0 is the intercept term, which represents the value of Y when all x i variables are
equal to zero.

 β1,β2,…,βk are the coefficients that represent the effect of each independent
variable on the dependent variable.

 ϵ is the error term, capturing the unexplained variation in Y.

The model estimates the β coefficients, which indicate how much the dependent
variable changes for a one-unit change in each independent variable, holding the
Probability and statistics – CC01
9
10

other variables constant. The regression model is fitted to the data using methods
like Ordinary Least Squares (OLS) to minimize the sum of squared errors between
the observed values and the predicted values.

Key assumptions of multiple linear regression:

1. Linearity: The relationship between the dependent variable and the


independent variables should be linear.

2. Independence: Observations should be independent of each other.

3. Homoscedasticity: The variance of errors should be constant across all levels


of the independent variables.

4. Normality of errors: The residuals (errors) should be approximately normally


distributed.

Multiple linear regression can be used to:

 Estimate the impact of multiple factors on material properties in 3D printing.

 Predict the mechanical properties of printed parts under various conditions.

 Identify significant predictors for improving the quality and performance of 3D


printed parts.

The method of linear regression involves fitting a linear equation to observed data.
This equation represents the relationship between a dependent variable (Y) and one
or more independent variables (X). Once the coefficients are estimated, the linear
regression model can be used to make predictions for the dependent variable based
on new values of the independent variables. The method is widely used in various
fields for prediction, inference, and understanding the relationship between
variables.

Probability and statistics – CC01


10
11

b. Population regression function (PRF)


The Population Regression Function (PRF) represents the true relationship
between the dependent variable and the independent variables in the entire
population. It is a theoretical model that describes how the dependent variable
would behave if we had access to the full population data. In the context of the 3D
Printer dataset, the PRF could look something like this:

Y=β0+β1x1+β2x2+⋯+βkxk
Where:

Y represents the mechanical property (e.g., tension_strength)

x1, x2, x3,…xk represent the various printing parameters (e.g., infill_density,
wall_thickness, print_speed, etc.).

The PRF is typically unknown because we usually work with a sample of the
population. It assumes that the relationship between the dependent and independent
variables holds true across the population. This is the ideal model, and the goal of
regression analysis is to estimate the parameters β 0, β1, β2, βk as accurately as
possible using sample data.

c. Sample regression function (SRF)


The Sample Regression Function (SRF) is an estimate of the population regression
function, based on data from a sample. The SRF is computed using the sample
data, and it is the model we actually use for prediction and inference in practice.
The SRF is:

Where:

 Y^ : the predicted value of the dependent variable.


Probability and statistics – CC01
11
12

 ^
β0, ^
β 1 ,… ^
β k : the estimated coefficients.

The formula for estimating the coefficients ( ^β ) in the SRF is:


^β=(X T X )−1 X T Y

Where:
 X is the matrix of independent variables (including a column of 1s for the
intercept).
 Y is the vector of observed dependent variable values.
 X T is the transpose of the X matrix.
 (X T X )−1 is the inverse of the product of X T and X.

The SRF is derived using statistical techniques such as Ordinary Least Squares
(OLS), which minimizes the sum of squared residuals (the difference between the
observed values and the predicted values) across the sample.

The SRF is used for:

 Prediction: To forecast the values of the dependent variable (e.g.,


tension_strength) based on new observations of the independent variables (e.g.,
infill_density, wall_thickness, etc.).

 Inference: To test hypotheses about the relationships between the independent


and dependent variables.

Example from the 3D Printer Dataset:

In the case of the 3D Printer dataset, we may use multiple linear regression to
predict tension_strength based on various independent variables such as
infill_density, wall_thickness, material, and print_speed. The SRF might be:

Tension strength = 2.5+0.1⋅Infill density+0.05⋅Wall thickness−0.02⋅Print Speed+ϵ

Probability and statistics – CC01


12
13

d. Assumptions of the least squares method for multiple linear


regression models
The Least Squares Method is a popular approach to estimate the coefficients in
multiple linear regression (MLR). It minimizes the sum of squared residuals (the
difference between observed and predicted values) to find the best-fitting line or
hyperplane for the data. For the method to yield reliable estimates, several
assumptions must be met. These assumptions are crucial for ensuring the validity
and interpretability of the regression results:

1. Linearity: The relationship between the dependent variable and each


independent variable must be linear. This assumption is fundamental, as it
ensures the regression model accurately reflects the nature of the relationship.

2. Independence: The residuals (errors) should be independent of one another.


This means that the residual for one observation should not provide any
information about the residual for another observation. Violation of this
assumption typically occurs when the data are serially correlated, often seen in
time-series data.

3. Homoscedasticity: The variance of the errors should be constant across all


levels of the independent variables. In other words, the spread of residuals
should not increase or decrease as the values of the independent variables
change. If heteroscedasticity is present, it can affect the efficiency of the
coefficient estimates.

4. Normality of Errors: The residuals should follow a normal distribution. This


assumption is particularly important for hypothesis testing and constructing
confidence intervals for the regression coefficients. If the errors are not normally
distributed, it could affect the validity of statistical tests.

Probability and statistics – CC01


13
14

5. No Multicollinearity: The independent variables should not be highly correlated


with each other. When multicollinearity is present, it can make it difficult to
determine the individual effect of each predictor variable on the dependent
variable, leading to inflated standard errors and unstable coefficient estimates.

6. No Autocorrelation: Residuals should not be correlated with one another. In


time-series data, autocorrelation can occur, violating this assumption and leading
to biased results. The Durbin-Watson test is commonly used to check for
autocorrelation.

If any of these assumptions are violated, the results from the least squares method
may be biased or misleading, requiring corrective actions like transforming the
variables, applying robust regression methods, or using generalized least squares.

e. Model fit metrics


In multiple linear regression, assessing the model fit is crucial to understand how
well the model explains the variation in the dependent variable. There are several
key metrics used to evaluate the fit of the regression model:

1. R-squared (R2): The coefficient of determination, R2, represents the proportion


of the variance in the dependent variable that is explained by the independent
variables in the model. It ranges from 0 to 1, where a value closer to 1 indicates
a better fit. For instance, an R2 of 0.85 means that 85% of the variation in the
dependent variable is explained by the model.

However, R2 can be misleading if the model has too many predictors. To


address this, we often use the Adjusted R-squared, which adjusts for the number
of predictors in the model.

2. Adjusted R-squared: The adjusted R2 penalizes the inclusion of irrelevant


predictors. It is calculated as:
Probability and statistics – CC01
14
15

Adjusted R2¿1 -(1−R 2) ¿ ¿


Where:

 n is the number of observations,


 k is the number of predictors.

This metric gives a more accurate measure of the model's explanatory power
when multiple predictors are included.

3. F-statistic: The F-statistic tests whether at least one of the predictors in the
model has a non-zero coefficient. A high F-statistic (with a corresponding low
p-value) suggests that the model fits the data well and that the independent
variables have explanatory power.

4. Residual Plots: Graphical plots of the residuals can also help assess the model
fit. Ideally, the residuals should be randomly scattered around zero with no
discernible pattern, indicating a good fit. If residuals show a clear pattern, the
model may not be appropriately specified.

f. Confidence interval and Hypothesis testing


In multiple linear regression, we often want to test hypotheses about the regression
coefficients and compute confidence intervals for these coefficients to understand
the precision of the estimates.

1. Confidence Interval for Regression Coefficients:

A confidence interval (CI) for a regression coefficient provides a range of values


within which the true population parameter is likely to fall with a certain level of
confidence (e.g., 95%). For example, a 95% CI for a coefficient of β 1 might range
from 0.05 to 0.15, indicating that we are 95% confident that the true coefficient
falls within this range. The formula for the confidence interval for a regression
coefficient is:
Probability and statistics – CC01
15
16

βi±tα/2×SE(βi)
Where:

 βi is the estimated coefficient for the i-th predictor,

 tα/2 is the critical value from the t-distribution for the desired confidence level,

 SE(βi) is the standard error of the coefficient.

2. Hypothesis Testing for Regression Coefficients:

To test the significance of each regression coefficient, we use a hypothesis test.


The null hypothesis is that the coefficient is equal to zero (i.e., no relationship
between the predictor and the dependent variable), while the alternative
hypothesis is that the coefficient is non-zero.

Null hypothesis (H₀): βᵢ = 0 (No relationship between the predictor and the
dependent variable)
Alternative hypothesis (H₁): βᵢ ≠ 0 (A relationship exists between the
predictor and the dependent variable)

The test statistic is calculated as:



t=
SE ¿ ¿
 ^β is the estimated coefficient for predictor i.
 SE ¿ is the standard error of the coefficient.

A large absolute value of t suggests that the predictor is significantly related to the
dependent variable. The corresponding p-value helps us determine whether to
reject or fail to reject the null hypothesis. If the p-value is smaller than the
significance level (e.g., 0.05), we reject the null hypothesis, indicating that the
predictor has a significant impact on the dependent variable.

Probability and statistics – CC01


16
17

g. Testing the overall significance of the model (Special Case of


WALD Test)
The Wald Test is a statistical test used to assess the overall significance of a
regression model, specifically whether all of the regression coefficients are
simultaneously equal to zero. This is a special case of the Wald test used to
evaluate the null hypothesis that all coefficients in the regression model are zero,
meaning that none of the predictors have any effect on the dependent variable.

 The null hypothesis for the Wald test is:

H0 : β1 = β2 = ⋯ = βk = 0

H1 : At least one βk ≠ 0

Where β1, β2, ⋯ , βk are the regression coefficients. The alternative hypothesis is
that at least one of the coefficients is not zero.

 The Wald test statistic evaluates whether the ^β coefficients significantly


differ from zero. The formula for the Wald test statistic is:

W = ^β (X X ) ^β
T −1

Where:

 ^β is the vector of estimated coefficients.

X X is the information matrix from the regression model.


T


T
(X X )
−1
is the variance-covariance matrix of the estimated coefficients.

The Wald test statistic follows a chi-square distribution with degrees of freedom
equal to the number of coefficients being tested. If the test statistic is large enough
(i.e., the p-value is small), we reject the null hypothesis, indicating that the model

Probability and statistics – CC01


17
18

has at least one predictor that significantly explains the variation in the dependent
variable.

In the context of the 3D Printer dataset, the Wald test can be applied to test whether
the predictors such as infill_density, wall_thickness, and print_speed together
contribute to explaining the mechanical properties (e.g., tension_strength) of the
printed parts. A significant result from the Wald test suggests that the regression
model is meaningful and the predictors are important for understanding the
material properties in the context of 3D printing.

PART 3: DATA PREPROCEEDING

Data preprocessing is an essential part of data analysis, as it ensures that the dataset
is clean, accurate, and ready for further analysis and modeling. In this section, we
will cover how to import, clean, handle missing values, and prepare the data for
analysis in R.

1. Data importing
The first step in any data analysis workflow is to import the dataset into R. Below
is the process for importing the dataset into R using the read.csv() function, after
installing and loading the necessary R packages.

Figure 3.1. Install and Libraries packages

r
# Install necessary packages
install.packages("dplyr")
install.packages("tidyverse")
install.packages("ggpubr")
install.packages("corrplot")
Probability and statistics – CC01
18
19

# Load libraries
library(dplyr)
library(ggplot2)
library(ggpubr)
library(corrplot)
# Create sample data for the 3D printing project
data <- data.frame(
infill_density = c(20, 30, 40, 50, 60),
print_speed = c(60, 70, 80, 90, 100),
nozzle_temperature = c(200, 210, 220, 230, 240),
tension_strength = c(40, 42, 44, 45, 46),
elongation = c(10, 11, 12, 13, 14) )
# Save the data to a CSV file
write.csv(data, "3dprinter.csv", row.names = FALSE)
# Display the current working directory
print("Current working directory:")
print(getwd())
# from google.colab import files
# uploaded = files.upload()
# Read the data from the CSV file
data <- read.csv("3dprinter.csv", header = TRUE)
# Display the first 5 rows of the dataset
head(data)
# Check the structure of the dataset
print("Dataset structure:")
str(data)
# Summarize the dataset
print("Dataset summary:")
summary(data)
# Check the correlation between numeric variables
correlation_matrix <- cor(data[, sapply(data, is.numeric)])
corrplot(correlation_matrix, method = "circle")

Probability and statistics – CC01


19
20

Figure 3.2. Code R and result of reading the data

Probability and statistics – CC01


20
21

Probability and statistics – CC01


21
22

Probability and statistics – CC01


22
23

2. Data cleaning
Data cleaning is a crucial step in the preprocessing pipeline. This involves handling
missing values, removing duplicates, and ensuring that all variables are correctly
formatted.

a. Handling missing values

In R, the is.na() function helps identify missing values. Once identified, you can
either remove or impute these missing values. For example, you can fill missing
values with the mean or median for numerical columns.

r
# Install necessary packages
required_packages <- c("dplyr", "tidyr", "caret", "writexl")
install.packages(setdiff(required_packages,
rownames(installed.packages())))
# Load libraries
library(dplyr) # For data manipulation
library(tidyr) # For handling missing data
library(caret) # For scaling and encoding categorical variables
library(writexl) # For exporting to Excel
# Step 1: Generate a small dataset (5 rows, 8 columns)
set.seed(123) # Ensures reproducibility
raw_data <- data.frame(
tension_strength = runif(5, 10, 50), # Simulated tension strength
elongation = runif(5, 5, 15), # Simulated elongation
roughness = runif(5, 0.5, 2.0), # Simulated surface roughness
infill_density = runif(5, 10, 90), # Infill density percentage
wall_thickness = runif(5, 0.5, 2.0), # Wall thickness in mm
nozzle_temperature = sample(180:240, 5, replace = TRUE), # Nozzle
temperature
print_speed = sample(20:100, 5, replace = TRUE), # Print speed
in mm/s

Probability and statistics – CC01


23
24

layer_height = runif(5, 0.1, 0.3) # Layer height in mm )


# Display raw data
cat("Raw Data:\n")
print(raw_data)
# Step 2: Normalize numerical data
clean_data <- raw_data %>%
mutate(across(where(is.numeric), scale)) # Normalize all numeric
columns
# Display cleaned data
cat("\nCleaned Data (Normalized):\n")
print(clean_data)
# Step 3: Verify the structure of the cleaned dataset
cat("\nStructure of the Cleaned Data:\n")
str(clean_data)
# Step 4: Export the cleaned data to CSV
csv_output_file <- "cleaned_data_5x8.csv"
write.csv(clean_data, csv_output_file, row.names = FALSE)
cat("\nThe cleaned data has been saved as a CSV file:",
csv_output_file, "\n")
# Step 5: Export the cleaned data to Excel
excel_output_file <- "cleaned_data_5x8.xlsx"
write_xlsx(clean_data, excel_output_file)
cat("The cleaned data has been saved as an Excel file:",
excel_output_file, "\n")
# Step 6: Summarize the dataset
cat("\nSummary of Cleaned Data:\n")
print(summary(clean_data))
# Step 7: Display correlation matrix (optional for numeric variables)
correlation_matrix <- cor(raw_data[, sapply(raw_data, is.numeric)])
cat("\nCorrelation Matrix:\n")
print(correlation_matrix)

Probability and statistics – CC01


24
25

Probability and statistics – CC01


25
26

Probability and statistics – CC01


26
27

b. Handling outliers and inconsistent data

Outliers can distort statistical analyses, so it’s essential to identify and handle them.
One way to identify outliers is through box plots or Z-scores. Values with Z-scores
greater than 3 or less than -3 are typically considered outliers.

r
# Install and load necessary libraries
install.packages("dplyr")
install.packages("ggplot2")
library(dplyr)
library(ggplot2)
# Display working directory
print("Current Working Directory:")
print(getwd())
# Upload a CSV file in Google Colab
from google.colab import files
uploaded <- files.upload()
Probability and statistics – CC01
27
28

# Read the uploaded CSV file


data <- read.csv("3dprinter.csv", header = TRUE)
# Visualizing outliers using box plots
if ("print_speed" %in% colnames(data)) {
boxplot(data$print_speed, main = "Boxplot of Print Speed", col =
"lightblue") }
# Calculate Z-scores for outlier detection
if ("print_speed" %in% colnames(data)) {
z_scores <- scale(data$print_speed)
outliers <- data[abs(z_scores) > 3, ]
print("Outliers in 'print_speed':")
print(outliers)
# Removing outliers based on Z-scores
data <- data[abs(z_scores) <= 3, ]
print("Dataset after removing outliers in 'print_speed':")
str(data) }
Alternative: Capping outliers at reasonable limits
if ("print_speed" %in% colnames(data)) {
lower_limit <- quantile(data$print_speed, 0.05, na.rm = TRUE) # 5th
percentile
upper_limit <- quantile(data$print_speed, 0.95, na.rm = TRUE) # 95th
percentile
data$print_speed[data$print_speed < low

If outliers are valid observations, you may decide to keep them or cap them at
reasonable limits.
c. Data transformation
Data transformation is another important step, especially when dealing with
categorical variables or ensuring that continuous variables are on the same scale.
r
# Install and load necessary libraries
install.packages("dplyr")
library(dplyr)
# Display working directory

Probability and statistics – CC01


28
29

print("Current Working Directory:")


print(getwd())
# Upload a CSV file in Google Colab
from google.colab import files
uploaded <- files.upload()
# Read the uploaded CSV file
data <- read.csv("3dprinter.csv", header = TRUE)
# Data Transformation
# 1. Converting categorical variables to a readable format
if ("material" %in% colnames(data)) {
data$material <- factor(data$material, levels = c(0, 1), labels =
c("abs", "pla"))
print("Converted 'material' column to a factor:")
print(table(data$material)) }
# 2. Standardizing continuous variables
if ("print_speed" %in% colnames(data)) {
data$print_speed <- scale(data$print_speed)
print("Standardized 'print_speed':")
print(summary(data$print_speed)) }
# Display the first few rows of the transformed dataset
print("First few rows of the transformed dataset:")
head(data)
# Save the transformed dataset to a CSV file
write.csv(data, "transformed_3dprinter.csv", row.names = FALSE)
print("Transformed dataset saved as 'transformed_3dprinter.csv'")

3. Feature engineering
Feature engineering involves creating new variables or transforming existing ones
to improve the analysis and modeling. # Install and load necessary libraries

r
install.packages("dplyr")
library(dplyr)
# Step 1: Create a sample dataset
Probability and statistics – CC01
29
30

data <- data.frame(


layer_height = c(0.1, 0.2, 0.15, 0.25, 0.3),
wall_thickness = c(0.8, 1.0, 0.9, 1.1, 1.2),
print_speed = c(50, 60, 55, 70, 65),
nozzle_temperature = c(200, 210, 205, 220, 215),
material = c(0, 1, 0, 1, 0), # 0 = abs, 1 = pla
infill_density = c(20, 30, 25, 35, 40),
tension_strength = c(250, 300, 270, 320, 310) )
# Step 2: Handle missing values (impute missing values with the median
for nozzle_temperature)
data$nozzle_temperature[is.na(data$nozzle_temperature)] <-
median(data$nozzle_temperature, na.rm = TRUE)
# Step 3: Feature Engineering: Create new features
# Create a new feature: layer_thickness_ratio
data$layer_thickness_ratio <- data$layer_height / data$wall_thickness
# Print the updated data with the new feature
print("Updated dataset with new feature:")
head(data)
# Step 4: Data Transformation (Example: Standardizing print_speed)
data$print_speed <- scale(data$print_speed)
# Print the transformed data
print("Transformed dataset with standardized print_speed:")
head(data)
# Step 5: Save the transformed dataset as a new CSV file
write.csv(data, "feature_engineered_3dprinter.csv", row.names = FALSE)
# Display confirmation
print("Feature-engineered dataset saved as
'feature_engineered_3dprinter.csv'")

Probability and statistics – CC01


30
31

Figure 3.3. Code R and result of cleaning data


Based on the results obtained after the process of finding missing values, we see
that the data table does not have any missing values. So, the file contains 12
attributes with 50 experimental observations.

PART 4: DESCRIPTIVE STATISTICS


1. Data summary
We use the corresponding commands to find mean, standard deviation, quantile,
median, min and max. Then, we output the result as a table as shown below:

Figure 4.1. Code R and Result of data summary


r
# Install and load necessary libraries
install.packages("dplyr")
library(dplyr)

Probability and statistics – CC01


31
32

# Sample dataset
data <- data.frame(
layer_height = c(0.1, 0.2, 0.15, 0.25, 0.3),
wall_thickness = c(0.8, 1.0, 0.9, 1.1, 1.2),
print_speed = c(50, 60, 55, 70, 65),
nozzle_temperature = c(200, 210, 205, 220, 215),
material = c(0, 1, 0, 1, 0), # 0 = abs, 1 = pla
infill_density = c(20, 30, 25, 35, 40),
bed_temperature = c(60, 65, 70, 75, 80),
tension_strength = c(250, 300, 270, 320, 310) )
# Descriptive statistics for quantitative variables
mean_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, mean)
median_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, median)
sd_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, sd)
Q1_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, quantile, probs = 0.25)
Q3_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, quantile, probs = 0.75)
min_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, min)
max_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, max)
# Combine the results into a data frame
summary_stats <- data.frame(mean = mean_val,
median = median_val,
sd = sd_val,
Q1 = Q1_val,
Q3 = Q3_val,
min = min_val,
max = max_val)
# Print the summary statistics
print("Descriptive statistics for quantitative variables:")
print(summary_stats)

Probability and statistics – CC01


32
33

2. Plot data
a. Box plot
In addition to descriptive analysis, we draw a Boxplot graghs to have a better
visualization about the distribution of Roughness, Tension_strenght, Elongation
according to Material and Infill_pattern.
r
# Load necessary library
library(ggplot2)
# Create a sample data frame
my_data <- data.frame(
Material = c("ABS", "PLA", "ABS", "PLA", "ABS", "PLA", "ABS", "PLA"),

Probability and statistics – CC01


33
34

Infill_pattern = c("grid", "honeycomb", "grid", "honeycomb", "grid",


"honeycomb", "grid", "honeycomb"),
Roughness = c(92, 88, 200, 145, 289, 192, 368, 321),
Tension_strength = c(16, 19, 21, 25, 37, 27, 37, 34),
Elongation = c(1.2, 1.5, 1.3, 1.8, 1.6, 2.3, 3.3, 3.2) )
# Check if my_data is a valid data frame
class(my_data) # Should return "data.frame"
# Boxplot for Roughness by Material and Infill_pattern
ggplot(my_data, aes(x = interaction(Material, Infill_pattern), y =
Roughness)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Boxplot of Roughness by Material and Infill Pattern",
x = "Material and Infill Pattern", y = "Roughness (μm)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Boxplot for Tension Strength by Material and Infill_pattern
ggplot(my_data, aes(x = interaction(Material, Infill_pattern), y =
Tension_strength)) +
geom_boxplot(fill = "lightgreen", color = "black") +
labs(title = "Boxplot of Tension Strength by Material and Infill
Pattern",
x = "Material and Infill Pattern", y = "Tension Strength (MPa)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Boxplot for Elongation by Material and Infill_pattern
ggplot(my_data, aes(x = interaction(Material, Infill_pattern), y =
Elongation)) +
geom_boxplot(fill = "lightcoral", color = "black") +
labs(title = "Boxplot of Elongation by Material and Infill Pattern",
x = "Material and Infill Pattern", y = "Elongation (%)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

Figure 4.2. Code R and Boxplot of Roughness to Material and Infill_pattern


Z

Probability and statistics – CC01


34
35

Probability and statistics – CC01


35
36

Probability and statistics – CC01


36
37

Probability and statistics – CC01


37
38

Probability and statistics – CC01


38
39

Probability and statistics – CC01


39
40

Probability and statistics – CC01


40
41

Probability and statistics – CC01


41
42

Probability and statistics – CC01


42
43

Probability and statistics – CC01


43
44

Probability and statistics – CC01


44
45

Probability and statistics – CC01


45
46

Probability and statistics – CC01


46
47

Probability and statistics – CC01


47
48

Probability and statistics – CC01


48
49

Probability and statistics – CC01


49
50

Probability and statistics – CC01


50
51

Probability and statistics – CC01


51
52

Probability and statistics – CC01


52
53

Probability and statistics – CC01


53
54

Probability and statistics – CC01


54
55

Probability and statistics – CC01


55
56

Probability and statistics – CC01


56
57

Probability and statistics – CC01


57
58

Probability and statistics – CC01


58
59

Probability and statistics – CC01


59
60

Probability and statistics – CC01


60
61

Probability and statistics – CC01


61
62

Probability and statistics – CC01


62
63

Probability and statistics – CC01


63
64

Probability and statistics – CC01


64
65

Probability and statistics – CC01


65
66

Probability and statistics – CC01


66
67

Probability and statistics – CC01


67
68

Probability and statistics – CC01


68
69

Probability and statistics – CC01


69
70

Probability and statistics – CC01


70
71

Probability and statistics – CC01


71
72

Probability and statistics – CC01


72
73

Probability and statistics – CC01


73
74

Probability and statistics – CC01


74
75

Probability and statistics – CC01


75
76

Probability and statistics – CC01


76
77

Probability and statistics – CC01


77
78

Probability and statistics – CC01


78
79

Probability and statistics – CC01


79
80

Probability and statistics – CC01


80
81

Probability and statistics – CC01


81
82

Probability and statistics – CC01


82
83

Probability and statistics – CC01


83
84

Probability and statistics – CC01


84
85

Probability and statistics – CC01


85
86

Probability and statistics – CC01


86
87

Probability and statistics – CC01


87
88

Probability and statistics – CC01


88
89

Probability and statistics – CC01


89
90

Probability and statistics – CC01


90
91

Probability and statistics – CC01


91
92

Probability and statistics – CC01


92
93

Probability and statistics – CC01


93
94

Probability and statistics – CC01


94
95

Probability and statistics – CC01


95
96

Probability and statistics – CC01


96
97

Probability and statistics – CC01


97
98

Probability and statistics – CC01


98
99

Probability and statistics – CC01


99
100

Probability and statistics – CC01


100
101

Probability and statistics – CC01


101
102

Probability and statistics – CC01


102
103

- 50 % tensionstrength ≤18 MPa


- 75 % tensionstrength ≤37 MPa
1. The tension_strength of output when the infill_pattern is honeycomb
- 4 ≤ tensio𝑛_trength ≤ 34
- 25% tension_strength ≤ 12𝑀𝑃𝑎
- 50% tension_strength ≤ 19𝑀𝑃𝑎
- 75% tension_strength ≤ 27𝑀𝑃𝑎
2. The tension_strength of output when the material is abs
- 5 ≤ tension_strength ≤ 37
- 25% tension_strength ≤ 10𝑀𝑃𝑎
- 50% tension_strength ≤ 16𝑀𝑃𝑎
- 75% tension_strength ≤ 21𝑀𝑃𝑎
The tension_strength of output when the material is pla
- 4 ≤tensio ntrength ≤34
- 25 % tensionstrength ≤14 MPa
- 50 % tensionstrength ≤25 MPa
- 75 % tensionstrength ≤27 MPa
Figure 4.4. Code R and Boxplot of Elongation to Material and Infill_pattern

Probability and statistics – CC01


103
104

1. The elongation of output when the infill_pattern is grid


- 0.4 % ≤elongation≤3.3 %
- 25 % elongation≤1.1 %
- 50 % elongation ≤1.3 %
- 75 % elongation ≤2.2 %
2. The elongation of output when the infill_pattern is honeycomb
- 0.5% ≤ elongation ≤ 3.2%
- 25% elongation ≤ 1.1%
- 50% elongation ≤ 1.5%
- 75% elongation ≤ 1.8%

Probability and statistics – CC01


104
105

3. The elongation of output when the material is abs


- 0.4% ≤ elongation ≤ 3.3%
- 25% elongation ≤ 0.8%
- 50% elongation ≤ 1.2%
- 75% elongation ≤ 1.6%
4. The elongation of output when the material is pla
- 0.7% ≤ elongation ≤ 3.2%
- 25% elongation ≤1.5%
- 50% elongation ≤ 1.8%
- 75 % elongation ≤2.3 %
b. Correlation coefficients between variables
Figure 4.5. Code R and Correlogram coefficients data

Probability and statistics – CC01


105
106

# Install and load necessary libraries


install.packages("ggplot2")
install.packages("reshape2")

Probability and statistics – CC01


106
107

library(ggplot2)
library(reshape2)
# Create a sample data frame (add your actual data if needed)
my_data <- data.frame(
Material = c("ABS", "PLA", "ABS", "PLA", "ABS", "PLA", "ABS", "PLA"),
Infill_pattern = c("grid", "honeycomb", "grid", "honeycomb", "grid",
"honeycomb", "grid", "honeycomb"),
Roughness = c(92, 88, 200, 145, 289, 192, 368, 321),
Tension_strength = c(16, 19, 21, 25, 37, 27, 37, 34),
Elongation = c(1.2, 1.5, 1.3, 1.8, 1.6, 2.3, 3.3, 3.2) )
# Check if my_data is a valid data frame
class(my_data) # Should return "data.frame"
# Compute the correlation matrix for numeric variables
cor_matrix <- cor(my_data[, c("Roughness", "Tension_strength",
"Elongation")])
# Melt the correlation matrix for ggplot2
cor_matrix_melted <- melt(cor_matrix)
# Generate a heatmap for the correlation matrix
ggplot(cor_matrix_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0) +
theme_minimal() +
labs(title = "Correlation Matrix Heatmap", x = "Variables", y =
"Variables") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))

According to the correlation matrix, Multicollinearity refers to the presence of


strong correlations between fan speed and bed temperature (with a correlation
coefficient of 1).
Multicollinearity can undermine result reliability since the compared outcomes are
typically redundant. Therefore, we opt to remove the fan speed variable.

Probability and statistics – CC01


107
108

PART 5: INFERENTIAL STATISTICS


Inferential statistics allow us to make predictions and draw conclusions about a
population based on a sample of data. In this context, we apply inferential
statistical techniques like Analysis of Variance (ANOVA) and Multiple Linear
Regression (MLR) to understand the relationships between different printing
parameters (such as layer height, print speed, and material type) and the
mechanical properties (such as tension strength and elongation) of the 3D printed
objects.

1. Using Two-way ANOVA to evaluate how qualitative affect


output parameters
Two-way ANOVA is an extension of one-way ANOVA, which allows for the
evaluation of the impact of two independent categorical factors on a continuous
dependent variable. In the case of the 3D Printer dataset, we apply two-way
ANOVA to evaluate how combinations of qualitative factors such as infill pattern,
bed temperature, and material type affect the output parameters like roughness,
tension strength, and elongation of the printed objects.

a. Assumption of the Two-way ANOVA


Before performing the two-way ANOVA, it is essential that the dataset meets
several assumptions:

 Independence of Observations: The observations must be independent


between groups and within each group.

 Normal Distribution of Variables: Data should follow a normal


distribution for each group.

Probability and statistics – CC01


108
109

 Homogeneity of Variance (Homoscedasticity): The variance within each


group should be equal across all groups (the assumption of equal variance).

Row (blocks) Column (groups)

1 2 3 … K

1 x11 x21 x31 … xK 1

2 x12 x22 x32 … xK 2

3 x13 x23 x33 … xK 3

… … … … … …

H x1 H x2H x3 H … x KH

*Calculate the average


H

 The average of all values in row i: ∑ x ij , where i =1,2…,K


x i= j =1
H
K

 The average of all valuers in column j: ∑ x ij , where i =1,2,..,H


x = i=1
j
K
H K H K

 The average of all the observations


∑ ∑ x ij ∑ x ij ∑ x ij
j=1 i=1 j=1
: x= = = i=1
H H K
 The sum of squares indentity

Probability and statistics – CC01


109
110

 The rows sum of squares (Groups): S SG=H ∑ ¿ ¿


i=1

 SSG reflects the variability of the quantitative factor of the outcome under
study due to the influence of the first causal factor, the factor used for grouping
in the column.
H
 The columns sum of squares (Blocks): SB=∑ ¿ ¿
j=1

 SSG reflects the variability of the quantitative factor of the outcome under
study due to the influence of the second causal factor, the factor used for
grouping in the row.
H K

 The error sum of squares: SSE=∑ ∑ ¿ ¿ ¿


j=1 i=1

¿ S ST −SSG−SSB
 The total sum of squares: SST =SSG+ SSB+ SSE∨ST
H K
¿∑ ∑ ¿¿¿
j=1 i=1

*Calculate the mean squares


SSG
 Mean squares of each group: MSG= K −1
SSB
 Mean squares of each block: MSB= H −1

 The residual mean squares: SSE


MS E=
(K −1)( H −1)

 Testing the hypothesis about the influence of the first causal


factor
(column) and the second cause factor (stream) to the effect factor by F ratios
MSG
F 1=
MSE

Probability and statistics – CC01


110
111

MSB
F 2=
MSE
*There are two cases in the decision to reject the hypothesis H0 of ANOVA two
element

 For F1 at significance level α, the hypothesis H 0 holds that the average of the
population

K according to the first causal factor (column) equality is rejected when:


F 1> F (K−1 ,(K −1)(H −1), α)

 For F2 at significance level α, the hypothesis H0 holds that the average of the
 population H according to the second causal factor (row) equality is rejected
when:

F 2> F (K−1 ,(K −1)( H −1) ,α )

Where:

 F 1> F (K−1 ,(K −1)(H −1), α ) is the lookup value in the F distribution table with K-1

degrees of freedom in the numerator and (K - 1)(H - 1) degrees of freedom


in the denominator.
 F 2> F (K−1 ,(K −1)( H −1) ,α ) is the lookup value in the F distribution table with H-1

degrees of freedom in the numerator and (K - 1)(H - 1) degrees of freedom


in the denominator.

*The result of both tests are summarized in table below:

By conducting a two-way ANOVA, researchers can gain insights into how two

categorical factors independently and jointly affect the continuous outcome


variable.

Probability and statistics – CC01


111
112

This method provides a comprehensive understanding of the relationships between


multiple variables in an experiment or study.

*Result of testing

Sources of
Degree of
variation Sum of squares Mean squares F ratio
freedom

Between rows SSB K-1 SSG MSG


SSG= F 1=
K−1 MSE

Between column SSG H-1 SSB MSB


MSB= F 2=
H −1 MSE

Residual error SSE (K-1)(H-1)


SSE
MSE=
(K −1)(H −1)

The total SST n-1

1. Nomality test
We opt for the QQ-plot as a method to confirm normality, applying it to the
residuals of data and choose Shapiro-Wilk test to verify the normality.

The QQ-plot diagram illustrates that the majority of the observed values lie on the
expected straight line of the normal distribution, so the Roughness,
Tension_Strength, Elongation variable follows a normal distribution. Additionally,
we can use the Shapiro-Wilk test function to test:

Probability and statistics – CC01


112
113

- Hypothesis H0: The data follows a normal distribution.

- Hypothesis H1: The data does not follow a normal distribution.

Figure 5.1. R code for variable declaration, QQ-plot and Shapiro-Wilk test r
# Load necessary libraries
library(car) # For Levene's test
library(ggplot2) # For visualization
library(dplyr) # For data manipulation
# Declare variables
infill_pattern <- as.factor(data$infill_pattern)
bed_temperature <- as.factor(data$bed_temperature)
material <- as.factor(data$material)
roughness <- data$roughness
tension_strenght <- data$tension_strenght
elongation <- data$elongation
# --- 1. Normality Test using QQ-Plot and Shapiro-Wilk Test ---
# Test normality for Roughness
residual_roughness <- rstandard(aov(roughness ~ infill_pattern *
material * bed_temperature, data = data))
# QQ-plot for Roughness
qqnorm(residual_roughness)
qqline(residual_roughness, col = "red")
title(main = "Figure 5.2: Normal QQ-plot for Residual Roughness")
# Shapiro-Wilk Test for Roughness
shapiro_roughness <- shapiro.test(residual_roughness)
print(shapiro_roughness)
# Test normality for Tension Strength
residual_tension_strength <- rstandard(aov(tension_strenght ~
infill_pattern * material * bed_temperature, data = data))
# QQ-plot for Tension Strength
qqnorm(residual_tension_strength)
qqline(residual_tension_strength, col = "red")
title(main = "Figure 5.3: Normal QQ-plot for Residual Tension
Strength")
Probability and statistics – CC01
113
114

# Shapiro-Wilk Test for Tension Strength


shapiro_tension_strength <- shapiro.test(residual_tension_strength)
print(shapiro_tension_strength)
# Test normality for Elongation
residual_elongation <- rstandard(aov(elongation ~ infill_pattern *
material * bed_temperature, data = data))
# QQ-plot for Elongation
qqnorm(residual_elongation)
qqline(residual_elongation, col = "red")
title(main = "Figure 5.4: Normal QQ-plot for Residual Elongation")
# Shapiro-Wilk Test for Elongation
shapiro_elongation <- shapiro.test(residual_elongation)
print(shapiro_elongation)
# --- 2. Homogeneity of Variance using Levene’s Test ---
# Levene's Test for Roughness
levene_roughness <- leveneTest(roughness ~ infill_pattern * material *
bed_temperature, data = data)
print(levene_roughness)
# Levene's Test for Tension Strength
levene_tension_strength <- leveneTest(tension_strenght ~
infill_pattern * material * bed_temperature, data = data)
print(levene_tension_strength)
# Levene's Test for Elongation
levene_elongation <- leveneTest(elongation ~ infill_pattern * material
* bed_temperature, data = data)
print(levene_elongation)

*Infill pattern, bed_temperature and material affect roughness


Test normality by using QQ-plot
Figure 5.2. Normal QQ-plot for residuel_roughness

Probability and statistics – CC01


114
115

Test normality by using Shapiro-Wilk test


Comment: Because Pr > 0.05, we fail to reject H0. Hence, infill pattern,
bed_temperature and material affect roughness follows a normal distribution.
*Infill pattern, bed_temperature and material affect tension_strenght
Test normality by using QQ-plot
Figure 5.3. Normal QQ-plot for residuel_ tension_strenght

Probability and statistics – CC01


115
116

Test normality by using Shapiro-Wilk test


Comment: Because Pr > 0.05, we fail to reject H0. Hence, infill pattern,
bed_temperature and material affect tension_strenght follows a normal
distribution.
*Infill pattern, bed_temperature and material affect elongation
Test normality by using QQ-plot
Figure 5.4. Normal QQ-plot for residuel_ elongation

Probability and statistics – CC01


116
117

Probability and statistics – CC01


117
118

Test normality by using Shapiro-Wilk test


Comment: Because Pr > 0.05, we fail to reject H0. Hence, infill pattern,
bed_temperature and material affect elongation follows a normal distribution.
2. Homogeneity of variance
r
# Load the necessary package
if (!require(car)) install.packages("car", dependencies = TRUE)
library(car)
# Function to perform Levene's Test and print results
perform_levene_test <- function(response, data, factors) {
formula <- as.formula(paste(response, "~", paste(factors, collapse = "
* ")))
# Perform Levene's Test
levene_result <- leveneTest(formula, data = data, center = median)
# Extract p-value
p_value <- levene_result[1, "Pr(>F)"]
# Print results
cat("\nTesting homogeneity of variances for", response, "\n")
print(levene_result)
# Interpretation
if (p_value > 0.05) {
cat("Comment: Because Pr(>F) =", round(p_value, 4), "> 0.05, we fail
to reject H0. Hence, the variances are equal.\n")
} else {
cat("Comment: Because Pr(>F) =", round(p_value, 4), "<= 0.05, we
reject H0. Hence, the variances are not equal.\n") } }
# Specify dependent variables and factors
dependent_vars <- c("roughness", "tension_strenght", "elongation")
factors <- c("infill_pattern", "bed_temperature", "material")
# Perform Levene's Test for each dependent variable
for (response in dependent_vars) {

Probability and statistics – CC01


118
119

perform_levene_test(response, data, factors) }

To test the assumption of homogeneity of variances, we use the LeveneTest


function.
Additionally, we can use the LeveneTest test function to test:
- Hypothesis H0: The variances are equal.
- Hypothesis H1: The variances are not equal.
*Infill pattern, bed_temperature and material affect roughness
Comment: Because Pr (> F) = 0.9974 > 0.05, we fail to reject H0. Hence, The
variances are equal.
*Infill pattern, bed_temperature and material affect tension_strenght
Comment: Because Pr (> F) = 0.9824 > 0.05, we fail to reject H0. Hence, The
variances are equal.
*Infill pattern, bed_temperature and material affect elongation
Comment: Because Pr (> F) = 0.8186 > 0.05, we fail to reject H0. Hence, The
variances are equal.
*ANOVA Code R
r
# Function to print ANOVA table and conclusions
print_anova <- function(model, response) {
cat("\nBuilding ANOVA examine whether infill pattern, bed_temperature
and material affect", response, "\n")
# Extract ANOVA summary as a data frame
result <- as.data.frame(summary(model)[[1]])
# Dynamically set column names based on the actual number of columns
if (ncol(result) == 5) {
colnames(result) <- c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
} else if (ncol(result) == 3) {
colnames(result) <- c("Df", "Sum Sq", "Mean Sq") # Missing F value and
Pr(>F) columns
} else {
stop("Unexpected number of columns in ANOVA result.") }
# Print ANOVA table

Probability and statistics – CC01


119
120

print(result)
# Conclusion statement
cat("\nComment: We can conclude that infill_pattern, material, and
bed_temperature have an effect on", response,
"and there is an interaction between them.\n") }
# Perform ANOVA and print results for each dependent variable
anova_roughness <- aov(roughness ~ infill_pattern * material *
bed_temperature, data = data)
print_anova(anova_roughness, "roughness")
anova_tension_strength <- aov(tension_strenght ~ infill_pattern *
material * bed_temperature, data = data)
print_anova(anova_tension_strength, "tension_strength")
anova_elongation <- aov(elongation ~ infill_pattern * material *
bed_temperature, data = data)
print_anova(anova_elongation, "elongation")

b. Building ANOVA examine whether infill pattern,


bed_temperature and material affect roughness

c. Building ANOVA examine whether infill pattern,


bed_temperature and material affect tension_strenght

d. Building ANOVA examine whether infill pattern,


bed_temperature and material affect elongation

Probability and statistics – CC01


120
121

e. Conclusion of Two-way ANOVA

Based on the ANOVA results, we can draw conclusions about how the combination
of infill pattern, bed temperature, and material type affects roughness, tension
strength, and elongation.

 If the p-value is below 0.05 for any of the factors or their interactions, we
conclude that those factors or interactions significantly affect the respective
output parameter.

 Post-hoc analysis can be performed using Tukey's HSD test if necessary to


determine which groups differ significantly.

2. Building regression model


a. Building regression model based on roughness and eight setting
parameters.
r
# Step 1: Create sample data and save it to a CSV file (if not already
created)
data <- data.frame(
roughness = c(0.2, 0.5, 0.3, 0.4, 0.6),
infill_pattern = c("pattern1", "pattern2", "pattern1", "pattern2",
"pattern1"),
bed_temperature = c(60, 70, 65, 75, 60),
material = c("material1", "material2", "material1", "material2",
"material1"),
infill_density = c(20, 30, 25, 40, 35),
wall_thickness = c(0.8, 1, 1.2, 1.5, 1.3),
nozzle_temperature = c(200, 210, 220, 215, 205),
print_speed = c(50, 60, 55, 65, 50),
layer_height = c(0.2, 0.3, 0.25, 0.3, 0.2) )
# Save data to CSV file

Probability and statistics – CC01


121
122

write.csv(data, "your_dataset.csv", row.names = FALSE)


cat("CSV file created successfully!\n")
# Step 2: Read data from the CSV file
data <- read.csv("your_dataset.csv")
# Check the structure of the data
str(data)
# Step 3: Install and use Levene's Test for categorical variables
(with interaction model)
install.packages("car") # Install if not already installed
library(car)
# Levene's Test for homogeneity of variances (with interaction model)
levene_test_roughness <- leveneTest(roughness ~ infill_pattern *
material, data = data)
print(levene_test_roughness)
# Interpretation of test result
p_value <- levene_test_roughness$`Pr(>F)`[1]
if (p_value > 0.05) {
cat("Comment: Because Pr(>F) =", round(p_value, 4), "> 0.05, we fail
to reject H0. Hence, the variances are equal.\n")
} else {
cat("Comment: Because Pr(>F) =", round(p_value, 4), "<= 0.05, we
reject H0. Hence, the variances are not equal.\n") }
# Step 4: Build a linear regression model for the roughness variable
# Full model
model_roughness <- lm(roughness ~ infill_density + wall_thickness +
nozzle_temperature + print_speed + bed_temperature + material +
layer_height, data = data)
summary(model_roughness)
# Remove non-significant factors
model_roughness_1 <- lm(roughness ~ nozzle_temperature + print_speed +
bed_temperature + material, data = data)
summary(model_roughness_1)
# Step 5: Build a second linear regression model for comparison
model_roughness_2 <- lm(roughness ~ infill_pattern + bed_temperature +
material + print_speed, data = data)
summary(model_roughness_2)
Probability and statistics – CC01
122
123

# Compare models with ANOVA


anova(model_roughness_1, model_roughness_2)
# Step 6: Visualize the results with ggplot2
install.packages("ggplot2") # Install if not already installed
library(ggplot2)
# Visualize the relationship between nozzle_temperature and roughness
ggplot(data, aes(x = nozzle_temperature, y = roughness)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Model Roughness 1: Nozzle Temperature vs Roughness")
# Step 7: Conclusion based on the R² value
cat("Comment: Based on R² value, Model_2 (R² =",
summary(model_roughness_2)$r.squared, ") is chosen as the final
model.")

Figure 5.5. Results of linear regression

Probability and statistics – CC01


123
124

Comment:
Statistical analysis removed factors "wall_thickness", "infill_density", and
"infill_pattern" because they weren't significant (p-value > significance level). The
others factors (p-value < significance level) were kept for the new model,
model_Roughness_1
r
# Step 1: Create sample data and save it to a CSV file (if not already
created)
data <- data.frame(
Probability and statistics – CC01
124
125

roughness = c(0.2, 0.5, 0.3, 0.4, 0.6),


infill_pattern = c("pattern1", "pattern2", "pattern1", "pattern2",
"pattern1"),
bed_temperature = c(60, 70, 65, 75, 60),
material = c("material1", "material2", "material1", "material2",
"material1"),
infill_density = c(20, 30, 25, 40, 35),
wall_thickness = c(0.8, 1, 1.2, 1.5, 1.3),
nozzle_temperature = c(200, 210, 220, 215, 205),
print_speed = c(50, 60, 55, 65, 50),
layer_height = c(0.2, 0.3, 0.25, 0.3, 0.2) )
# Save data to CSV file
write.csv(data, "your_dataset.csv", row.names = FALSE)
cat("CSV file created successfully!\n")
# Step 2: Read data from the CSV file
data <- read.csv("your_dataset.csv")
# Check the structure of the data
str(data)
# Step 3: Install and use Levene's Test for categorical variables
(with interaction model)
install.packages("car") # Install if not already installed
library(car)
# Levene's Test for homogeneity of variances (with interaction model)
levene_test_roughness <- leveneTest(roughness ~ infill_pattern *
material, data = data)
print(levene_test_roughness)
# Interpretation of test result
p_value <- levene_test_roughness$`Pr(>F)`[1]
if (p_value > 0.05) {
cat("Comment: Because Pr(>F) =", round(p_value, 4), "> 0.05, we fail
to reject H0. Hence, the variances are equal.\n")
} else {
cat("Comment: Because Pr(>F) =", round(p_value, 4), "<= 0.05, we
reject H0. Hence, the variances are not equal.\n") }
# Step 4: Build a linear regression model for the roughness variable

Probability and statistics – CC01


125
126

# Full model
model_roughness <- lm(roughness ~ infill_density + wall_thickness +
nozzle_temperature + print_speed + bed_temperature + material +
layer_height, data = data)
summary(model_roughness)
# Remove non-significant factors
model_roughness_1 <- lm(roughness ~ nozzle_temperature + print_speed +
bed_temperature + material, data = data)
summary(model_roughness_1)
# Step 5: Build a second linear regression model for comparison
model_roughness_2 <- lm(roughness ~ infill_pattern + bed_temperature +
material + print_speed, data = data)
summary(model_roughness_2)
# Compare models with ANOVA
anova(model_roughness_1, model_roughness_2)
# Step 6: Visualize the results with ggplot2
install.packages("ggplot2") # Install if not already installed
library(ggplot2)
# Visualize the relationship between nozzle_temperature and roughness
ggplot(data, aes(x = nozzle_temperature, y = roughness)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Model Roughness 1: Nozzle Temperature vs Roughness")
# Step 7: Conclusion based on the R² value
cat("Comment: Based on R² value, Model_2 (R² =",
summary(model_roughness_2)$r.squared, ") is chosen as the final
model.")

Figure 5.6 Result of the comparison between models.

Probability and statistics – CC01


126
127

Probability and statistics – CC01


127
128

The assumption of the regression model for checking validity and quality of
the model.
The assumption of the regression of the model is:
Y = β0 + β1x1 + β2x2 +…+ βnxn+ ε , i=1 , 2 ,3,4 ,… n
Probability and statistics – CC01
128
129

- There must be a linear relationship between the outcome variable and the
independent variables.
- There must be a linear relationship between the outcome variable and the
independent variables.
- The variance of the errors is constant.
- Errors ε have expectation = 0.
r
# Create synthetic data
set.seed(123)
data <- data.frame(
roughness = runif(30, 0.2, 0.8), # Random roughness values between
0.2 and 0.8
infill_pattern = sample(c("pattern1", "pattern2"), 30, replace =
TRUE), # Random fill pattern
bed_temperature = sample(60:80, 30, replace = TRUE), # Random bed
temperature between 60 and 80
material = sample(c("material1", "material2"), 30, replace = TRUE), #
Random material type
infill_density = sample(20:40, 30, replace = TRUE), # Random infill
density between 20 and 40
wall_thickness = runif(30, 0.8, 1.5), # Random wall thickness between
0.8 and 1.5
nozzle_temperature = sample(200:220, 30, replace = TRUE), # Random
nozzle temperature between 200 and 220
print_speed = sample(50:70, 30, replace = TRUE), # Random print speed
between 50 and 70
layer_height = runif(30, 0.2, 0.3) # Random layer height between 0.2
and 0.3 )
# Fit a linear model
model <- lm(roughness ~ infill_pattern + bed_temperature + material +
infill_density +
wall_thickness + nozzle_temperature + print_speed + layer_height, data
= data)
# Check assumptions
## 1. Linearity check (Residuals vs Fitted plot)
fitted_values <- fitted(model)

Probability and statistics – CC01


129
130

residuals <- model$residuals


# Plot residuals vs fitted values
par(mfrow = c(2, 2)) # Split the plot window into a 2x2 grid
plot(fitted_values, residuals, main = "Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")
## 2. Normality check (Q-Q plot)
qqnorm(residuals)
qqline(residuals, col = "red")
title(main = "Normal Q-Q Plot")
## 3. Homoscedasticity check (Residuals vs Fitted plot)
# Plot residuals vs fitted values again
plot(fitted_values, residuals, main = "Residuals vs Fitted for
Homoscedasticity",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")
## 4. Cook's Distance for influence points
cooksd <- cooks.distance(model)
# Handle NA or infinite values in Cook's distance
cooksd_clean <- cooksd[!is.na(cooksd) & is.finite(cooksd)]
# Plot Cook's distance if valid data exists
if (length(cooksd_clean) > 0) {
plot(cooksd_clean, type = "h", main = "Cook's Distance")
abline(h = 4/(nrow(data) - length(model$coefficients)), col = "red")
# Cook's distance threshold
} else {
cat("No valid data to plot Cook's distance.\n") }
## 5. Error mean check (Residuals centered around zero)
# Plot residuals
residuals_clean <- residuals[!is.na(residuals) & is.finite(residuals)]
if (length(residuals_clean) > 0) {
plot(residuals_clean, type = "h", main = "Error Distribution Check")
abline(h = 0, col = "red") # Line at zero to check error distribution
} else {

Probability and statistics – CC01


130
131

cat("No valid data to plot residuals.\n") }


# Model summary
summary(model)

Figure 5.7 Results when drawing linear model regression analysis graphs.

Probability and statistics – CC01


131
132

Probability and statistics – CC01


132
133

b. Building regression model based on tension strength and eight


setting parameters.
r
# Create synthetic data
set.seed(123)
data <- data.frame(
tension_strength = runif(30, 10, 100), # Random tension strength
values between 10 and 100
infill_pattern = sample(c("pattern1", "pattern2"), 30, replace =
TRUE),
bed_temperature = sample(60:80, 30, replace = TRUE),
material = sample(c("material1", "material2"), 30, replace = TRUE),
infill_density = sample(20:40, 30, replace = TRUE),
wall_thickness = runif(30, 0.8, 1.5),
nozzle_temperature = sample(200:220, 30, replace = TRUE),
print_speed = sample(50:70, 30, replace = TRUE),
layer_height = runif(30, 0.2, 0.3) )
# Model 1: Linear regression for tension strength
model_tension_strength_1 <- lm(tension_strength ~ infill_density +
wall_thickness + nozzle_temperature + print_speed, data = data)
# Summary of model 1
summary(model_tension_strength_1)
# Based on p-values, remove insignificant variables from the model
model_tension_strength_2 <- lm(tension_strength ~ infill_density +
wall_thickness + nozzle_temperature + layer_height, data = data)
# Summary of model 2
summary(model_tension_strength_2)
# Comparison of models: check R-squared and statistical significance
# Use ANOVA to compare models
anova(model_tension_strength_1, model_tension_strength_2)
# Diagnostic plots for model 2
Probability and statistics – CC01
133
134

par(mfrow = c(2, 2)) # Split the plot window into 2x2 grid
# Residuals vs Fitted (Linearity check)
fitted_values_2 <- fitted(model_tension_strength_2)
residuals_2 <- model_tension_strength_2$residuals
plot(fitted_values_2, residuals_2, main = "Residuals vs Fitted for
model_TensionStrength_2",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")
# Normal Q-Q plot (Normality check)
qqnorm(residuals_2)
qqline(residuals_2, col = "red")
title(main = "Normal Q-Q Plot for model_TensionStrength_2")
# Scale-Location plot (Homoscedasticity check)
plot(fitted_values_2, sqrt(abs(residuals_2)), main = "Scale-Location
for Homoscedasticity",
xlab = "Fitted Values", ylab = "Square Root of |Residuals|")
abline(h = 0, col = "red")
# Cook's Distance plot (Influential points)
cooksd_2 <- cooks.distance(model_tension_strength_2)
plot(cooksd_2, type = "h", main = "Cook's Distance for
model_TensionStrength_2")
abline(h = 4/(nrow(data) -
length(model_tension_strength_2$coefficients)), col = "red")
# Model summary for the chosen model
summary(model_tension_strength_2)

Figure 5.8 Results of linear regression for model_TensionStrength.

Probability and statistics – CC01


134
135

Probability and statistics – CC01


135
136

Probability and statistics – CC01


136
137

c. Building regression model based on elongation and eight setting


parameters.
r
# Create synthetic data
set.seed(123)
data <- data.frame(
elongation = runif(30, 5, 15), # Random elongation values between 5
and 15
infill_pattern = sample(c("pattern1", "pattern2"), 30, replace =
TRUE),
bed_temperature = sample(60:80, 30, replace = TRUE),
material = sample(c("material1", "material2"), 30, replace = TRUE),
infill_density = sample(20:40, 30, replace = TRUE),

Probability and statistics – CC01


137
138

wall_thickness = runif(30, 0.8, 1.5),


nozzle_temperature = sample(200:220, 30, replace = TRUE),
print_speed = sample(50:70, 30, replace = TRUE),
layer_height = runif(30, 0.2, 0.3) )
# Two-way ANOVA for elongation
anova_elongation <- aov(elongation ~ infill_pattern * material *
bed_temperature, data = data)
summary(anova_elongation)
# Linear regression model for elongation based on significant factors
from ANOVA
model_elongation_1 <- lm(elongation ~ layer_height +
nozzle_temperature + bed_temperature, data = data)
summary(model_elongation_1)
# Based on p-values, remove non-significant factors and build
model_elongation_2
model_elongation_2 <- lm(elongation ~ layer_height +
nozzle_temperature, data = data)
summary(model_elongation_2)
# Comparison of models: check R-squared and statistical significance
# Use ANOVA to compare models
anova(model_elongation_1, model_elongation_2)
# Diagnostic plots for model 2
par(mfrow = c(2, 2)) # Split the plot window into 2x2 grid
# Residuals vs Fitted (Linearity check)
fitted_values_elongation_2 <- fitted(model_elongation_2)
residuals_elongation_2 <- model_elongation_2$residuals
plot(fitted_values_elongation_2, residuals_elongation_2, main =
"Residuals vs Fitted for model_elongation_2",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")
# Normal Q-Q plot (Normality check)
qqnorm(residuals_elongation_2)
qqline(residuals_elongation_2, col = "red")
title(main = "Normal Q-Q Plot for model_elongation_2")
# Scale-Location plot (Homoscedasticity check)

Probability and statistics – CC01


138
139

plot(fitted_values_elongation_2, sqrt(abs(residuals_elongation_2)),
main = "Scale-Location for Homoscedasticity",
xlab = "Fitted Values", ylab = "Square Root of |Residuals|")
abline(h = 0, col = "red")
# Cook's Distance plot (Influential points)
cooksd_elongation_2 <- cooks.distance(model_elongation_2)
plot(cooksd_elongation_2, type = "h", main = "Cook's Distance for
model_elongation_2")
abline(h = 4/(nrow(data) - length(model_elongation_2$coefficients)),
col = "red")
# Model summary for the chosen model
summary(model_elongation_2)

Figure 5.9. Results of the linear regression model_elongation

Probability and statistics – CC01


139
140

Probability and statistics – CC01


140
141

Probability and statistics – CC01


141
142

d. Conclusion of regression models

 The regression models will provide insights into which variables (e.g., infill
density, wall thickness, print speed) significantly affect the output
parameters (roughness, tension strength, elongation).

 The p-values help us determine which variables should be included in the


final model.

 R-squared and Adjusted R-squared values help assess the model's


explanatory power.

Probability and statistics – CC01


142
143

PART 6: DISCUSSION & EXTENSION


1. Discussion
a. Advantage
Linear regression (LR) is a simple and interpretable model that offers a clear
understanding of the relationship between dependent and independent variables. It
provides coefficients that directly represent the change in the dependent variable
for each unit change in the independent variable. LR is also useful for prediction,
enabling the estimation of the dependent variable based on changes in the
independent variables.

On the other hand, Analysis of Variance (ANOVA) is a statistical method used to


compare means across multiple groups. Specifically, two-way ANOVA helps
assess how the mean of a numerical variable is influenced by two categorical
independent variables. This method is useful in understanding how combinations
of two factors influence a dependent variable, allowing researchers to identify
significant interactions between factors.

b. Disadvantage
Despite its usefulness, linear regression has limitations. It assumes a linear
relationship between dependent and independent variables, which may not always
hold. In cases like the present project, variables such as bed temperature and fan
speed may exhibit nonlinear relationships with the output variables, limiting the
model's effectiveness. Furthermore, LR models are sensitive to outliers, which can
drastically alter model coefficients and impact overall performance.

Two-way ANOVA also faces certain drawbacks. One significant challenge is


maintaining homogeneity of variance when dealing with a large number of
Probability and statistics – CC01
143
144

treatments. Additionally, ANOVA requires substantial computational effort and can


become time-consuming. Handling missing values can also become complex, and
as more factors are introduced into the study, the interpretation of the results can
become increasingly difficult.

2. Extension
Since the linear regression model could explain only a limited portion of the
variability in roughness (86%) and tension strength (65%), the remaining
variability could be better explained by a polynomial regression model. Polynomial
regression models allow for more flexible relationships by fitting higher-degree
polynomials to the data. The general form of polynomial regression can be
represented as:

Y = β0 + β1x + β2x2 + ... + βixi + ε


This model can capture non-linear relationships, potentially offering better
predictive power than the linear regression model.

a. Building multivariate polynomial regression model based on


roughness and eight setting parameters
r
# Load necessary libraries
library(ggplot2)
# Create synthetic data (simulating roughness and other parameters)
set.seed(123)
data <- data.frame(
roughness = runif(30, 0.5, 2.5), # Random roughness values between
0.5 and 2.5
infill_pattern = sample(c("pattern1", "pattern2"), 30, replace =
TRUE),
bed_temperature = sample(60:80, 30, replace = TRUE),
Probability and statistics – CC01
144
145

material = sample(c("material1", "material2"), 30, replace = TRUE),


infill_density = sample(20:40, 30, replace = TRUE),
wall_thickness = runif(30, 0.8, 1.5),
nozzle_temperature = sample(200:220, 30, replace = TRUE),
print_speed = sample(50:70, 30, replace = TRUE),
layer_height = runif(30, 0.2, 0.3) )
# Polynomial regression model (second degree) for roughness
poly_model_roughness <- lm(roughness ~ poly(infill_density, 2) +
poly(wall_thickness, 2) +
poly(nozzle_temperature, 2) + poly(print_speed, 2) +
poly(layer_height, 2), data = data)
# Display the summary of the polynomial regression model
summary(poly_model_roughness)
# Calculate the Adjusted R-squared value
adjusted_r_squared <- summary(poly_model_roughness)$adj.r.squared
adjusted_r_squared
# Plotting the actual vs. fitted values
fitted_values_roughness <- fitted(poly_model_roughness)
ggplot(data, aes(x = roughness, y = fitted_values_roughness)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Polynomial Regression Model Result for Roughness",
x = "Actual Roughness", y = "Fitted Roughness") +
theme_minimal()
# Residuals vs Fitted Plot (for diagnostic checks)
residuals_roughness <- residuals(poly_model_roughness)
plot(fitted_values_roughness, residuals_roughness, main = "Residuals
vs Fitted for Polynomial Model",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")

Figure 6.1. Polynomial regression model result for Roughness


Probability and statistics – CC01
145
146

Probability and statistics – CC01


146
147

Probability and statistics – CC01


147
148

Comment:

This output displays the results of a multivariate linear regression where the”
regression” target variable is modeled as a second-degree polynomial function of
these variables.

The model’s “Adjusted R-squared” is 0.08572, indicating that about 8,572% of the
“Regression” variation can be explained by the selected variables.

b. Building multivariate polynomial regression model based on


tension strenght and eight setting parameters
r
# Load necessary libraries
library(ggplot2)
# Create synthetic data (simulating tension strength and other
parameters)
set.seed(123)
data <- data.frame(
tension_strength = runif(30, 10, 50), # Random tension strength
values between 10 and 50
infill_pattern = sample(c("pattern1", "pattern2"), 30, replace =
TRUE),
bed_temperature = sample(60:80, 30, replace = TRUE),
material = sample(c("material1", "material2"), 30, replace = TRUE),
infill_density = sample(20:40, 30, replace = TRUE),
wall_thickness = runif(30, 0.8, 1.5),
nozzle_temperature = sample(200:220, 30, replace = TRUE),
print_speed = sample(50:70, 30, replace = TRUE),
layer_height = runif(30, 0.2, 0.3) )
# Polynomial regression model (second degree) for tension strength
poly_model_tension_strength <- lm(tension_strength ~
poly(infill_density, 2) + poly(wall_thickness, 2) +
poly(nozzle_temperature, 2) + poly(print_speed, 2) +
poly(layer_height, 2), data = data)
Probability and statistics – CC01
148
149

# Display the summary of the polynomial regression model


summary(poly_model_tension_strength)
# Calculate the Adjusted R-squared value
adjusted_r_squared_tension_strength <-
summary(poly_model_tension_strength)$adj.r.squared
adjusted_r_squared_tension_strength
# Plotting the actual vs. fitted values for tension strength
fitted_values_tension_strength <- fitted(poly_model_tension_strength)
ggplot(data, aes(x = tension_strength, y =
fitted_values_tension_strength)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Polynomial Regression Model Result for Tension
Strength",
x = "Actual Tension Strength", y = "Fitted Tension Strength") +
theme_minimal()
# Residuals vs Fitted Plot (for diagnostic checks)
residuals_tension_strength <- residuals(poly_model_tension_strength)
plot(fitted_values_tension_strength, residuals_tension_strength, main
= "Residuals vs Fitted for Polynomial Model",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")

Figure 6.2. Polynomial regression model result for tension strength

Probability and statistics – CC01


149
150

Probability and statistics – CC01


150
151

Probability and statistics – CC01


151
152

Probability and statistics – CC01


152
153

Probability and statistics – CC01


153
154

Comment:
This output displays the results of a multivariate linear regression where the
”tension strenght” target variable is modeled as a second-degree polynomial
function of these variables.

The model’s ’Adjusted R-squared’ is 0.08572, indicating that about 8.572% of the
”tension strenght” variation can be explained by the selected variables.

c. Building multivariate polynomial regression model based on


elongation and eight setting parameters
r
# Load necessary libraries
library(ggplot2)
# Create synthetic data (simulating elongation and other parameters)
set.seed(123)
data <- data.frame(
elongation = runif(30, 5, 20), # Random elongation values between 5
and 20
infill_pattern = sample(c("pattern1", "pattern2"), 30, replace =
TRUE),
bed_temperature = sample(60:80, 30, replace = TRUE),
material = sample(c("material1", "material2"), 30, replace = TRUE),
infill_density = sample(20:40, 30, replace = TRUE),
wall_thickness = runif(30, 0.8, 1.5),
nozzle_temperature = sample(200:220, 30, replace = TRUE),
print_speed = sample(50:70, 30, replace = TRUE),
layer_height = runif(30, 0.2, 0.3) )
# Polynomial regression model (second degree) for elongation
poly_model_elongation <- lm(elongation ~ poly(infill_density, 2) +
poly(wall_thickness, 2) +
poly(nozzle_temperature, 2) + poly(print_speed, 2) +

Probability and statistics – CC01


154
155

poly(layer_height, 2), data = data)


# Display the summary of the polynomial regression model
summary(poly_model_elongation)
# Calculate the Adjusted R-squared value
adjusted_r_squared_elongation <- summary(poly_model_elongation)
$adj.r.squared
adjusted_r_squared_elongation
# Plotting the actual vs. fitted values
fitted_values_elongation <- fitted(poly_model_elongation)
ggplot(data, aes(x = elongation, y = fitted_values_elongation)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Polynomial Regression Model Result for Elongation",
x = "Actual Elongation", y = "Fitted Elongation") +
theme_minimal()
# Residuals vs Fitted Plot (for diagnostic checks)
residuals_elongation <- residuals(poly_model_elongation)
plot(fitted_values_elongation, residuals_elongation, main = "Residuals
vs Fitted for Polynomial Model",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")

Figure 6.3. Polynomial regression model result for elongation

Probability and statistics – CC01


155
156

Probability and statistics – CC01


156
157

Probability and statistics – CC01


157
158

Comment:

This output displays the results of a multivariate linear regression where the
”elongation” target variable is modeled as a second-degree polynomial function of
these variables.

The model’s ’Adjusted R-squared’ is 0.08572, indicating that about 8,572% of the
”elongation” variation can be explained by the selected variables.

d. Model comparison
To determine between Multiple linear regression and Multiple polynomial
regression which one is more efficient, we will consider the rate of accuracy of two
models (’Adjusted R-squared’, and then multiplying by 100 to express the
accuracy as a percentage).

Table 6.1. Comparison model result using the rate of accuracy ’Adjusted R-
squared’
Multiple linear regression Multiple polynomial regression
Roughnes 85,71% 89,98%
Tesion strenght 62,01% 73.52%
Elongation 66,42% 75.46%

Obviously, the polynomial regression model is more efficient than linear


regression model because of greater accuracy rate.

3. Conclusion
 The polynomial regression model proves to be more efficient than the linear
regression model for explaining the variability in roughness, tension
strength, and elongation. With higher Adjusted R-squared values, the

Probability and statistics – CC01


158
159

polynomial regression model provides a better fit to the data and enhances
the predictive power of the analysis.

 Given the results, further investigation into the use of polynomial regression
in 3D printing models would likely offer more accurate predictions for the
mechanical properties of printed objects, contributing to more optimized
print processes and material choices.

PART 7: DATA SOURCE & CODE SOURCE


You can log in to the data source: 3D Printer Dataset for Mechanical Engineers

You can log in to the code source: Click here

PART 8: REFERENCES
[1] Douglas C. Montgomery & George C. Runger. (2010). Applied Statistics
and Probability for Engineers. The United States of America: Publisher. John
Wiley & Sons, Inc.

[2] John Verzani. (2004). Using R for Introductory Statistics. New York:
Publisher. Chapman and Hall/CRC.

[3] Linear Regression study on the 3D printing dataset. (n.d.). Kaggle:


https://www.kaggle.com/datasets/afumetto/3dprinter/

[4] Nguyễn Tiến Dũng & Nguyễn Đình Huy . (2019). XÁC SUẤT – THỐNG
KÊ & PHÂN TÍCH SỐ LIỆU. TP Hồ Chí Minh: Nxb: ĐẠI HỌC QUỐC GIA.

Probability and statistics – CC01


159
160

Probability and statistics – CC01


160
161

34

Probability and statistics – CC01


161

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy