Probability and Statistics CC01 Group 3

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY
HO CHI MINH UNIVERSITY OF TECHNOLOGY

--------
PROBABILITY AND STATISTICS

CLASS: CC01
GROUP 03
3D PRINTER DATASET
FOR MECHANICAL ENGINEERS
Instructor: Nguyễn Tiến Dũng
No. Member ID
1 Nguyễn Hòa Hiệp 2352342
2 Hồ Minh Hoàng 2352348
3 Đặng Hữu Huy 2352369

2
TABLE OF CONTENTS
PART 1: DATA INTRODUCTION.............................................................................................4
1. Dataset description...................................................................................................................4
2. Variable description.................................................................................................................5
3. Summary..................................................................................................................................5
PART 2: THEORETICAL BASIS...............................................................................................6
1. Analysis of one-factor variance...............................................................................................6
2. Multiple Linear Regression......................................................................................................9
a. Multiple linear regression.....................................................................................................9
b. Population regression function (PRF)................................................................................11
c. Sample regression function (SRF)......................................................................................12
d. Assumptions of the least squares method for multiple linear regression models..............13
e. Model fit metrics................................................................................................................14
f. Confidence interval and Hypothesis testing........................................................................16
g. Testing the overall significance of the model (Special Case of WALD Test)...................17
PART 3: DATA PREPROCEEDING........................................................................................18
1. Data importing.......................................................................................................................19
2. Data cleaning..........................................................................................................................24
a. Handling missing values.....................................................................................................24
b. Handling outliers and inconsistent data..............................................................................28
c. Data transformation............................................................................................................29
3. Feature Engineering...............................................................................................................30
PART 4: DESCRIPTIVE STATISTICS...................................................................................32
1. Data summary........................................................................................................................32
2. Plot data..................................................................................................................................34
a. Box plot..............................................................................................................................34
b. Correlation coefficients between variables........................................................................41
PART 5: INFERENTIAL STATISTICS...................................................................................44
1. Using Two-way ANOVA to evaluate how qualitative affect output parameters..................44
Probability and statistics – CC01

2
3
a. Assumption of the Two-way ANOVA...............................................................................44

b. Building ANOVA examine whether infill pattern, bed_temperature and material affect
roughness................................................................................................................................56
c. Building ANOVA examine whether infill pattern, bed_temperature and material affect
tension_strenght......................................................................................................................56
d. Building ANOVA examine whether infill pattern, bed_temperature and material affect
elongation...............................................................................................................................56
e. Conclusion of Two-way ANOVA......................................................................................57
2. Building regression model.....................................................................................................57
a. Building regression model based on roughness and eight setting parameters...................57
b. Building regression model based on tension strength and eight setting parameters..........69
c. Building regression model based on elongation and eight setting parameters...................73
d. Conclusion of regression models.......................................................................................78
PART 6: DISCUSSION & EXTENSION..................................................................................79
1. Discussion..............................................................................................................................79
a. Advantage...........................................................................................................................79
b. Disadvantage......................................................................................................................79
2. Extension................................................................................................................................80
a. Building multivariate polynomial regression model based on roughness and eight setting
parameters..............................................................................................................................80
b. Building multivariate polynomial regression model based on tension strenght and eight
setting parameters...................................................................................................................84
c. Building multivariate polynomial regression model based on elongation and eight setting
parameters..............................................................................................................................90
d. Model comparison..............................................................................................................94
3. Conclusion.............................................................................................................................94
PART 7: DATA SOURCE & CODE SOURCE........................................................................95
PART 8: REFERENCES............................................................................................................95

3
4
PART 1: DATA INTRODUCTION

This report analyzes a 3D printer dataset to understand the performance and
properties of 3D-printed materials and the impact of various printing parameters.
The dataset originates from research conducted by the TR/Selcuk University
Mechanical Engineering Department, focusing on the Ultimaker S5 3D printer.
The aim is to explore how adjustments in printer settings influence print quality,
accuracy, and mechanical strength.
Key components of the dataset include nine input parameters, such as infill
density, nozzle temperature, and print speed, alongside three output parameters:
roughness, tension strength, and elongation. With 50 samples, the dataset provides
insights into optimizing printing settings for producing high-quality, durable parts.
1. Dataset description
The dataset consists of experimental results from 3D printing tests performed on
the Ultimaker S5 3D printer, with material strength tested using a Sincotec GMBH
tester capable of applying up to 20 kN. Each observation represents a unique 3D-
printed sample.
- Population: All 3D-printed samples produced using the Ultimaker S5 printer.

- Sample Size: 50 observations, including nine input parameters and three
measured output parameters.
- Objective: To identify relationships between 3D printing settings, output
parameters, and material properties.
2. Variable description
Variable Data Type Units Description
Setting parameter

4
5
layer_height x ∈ R — 0.02 ≤ x ≤ 0.2, cont mm

wall_thickness x ∈ N — 1 ≤ x ≤ 10, cont mm
infill_density x ∈ N — 10 ≤ x ≤ 90, cont %
0 = “grid”, 1 =
infill_pattern x = 0 or x = 1, categ none
“honeycomb”
nozzle_temperature x ∈ N — 200 ≤ x ≤ 250, cont ◦C
bed_temperature x ∈ N — 60 ≤ x ≤ 80, cont ◦C
print_speed x ∈ N — 40 ≤ x ≤ 120, cont mm/s
material x = 0 or x = 1, categ none 0 = “abs”, 1 = “pla”
fan_speed x ∈ N — 0 ≤ x ≤ 100, cont %
Output parameters (Measured)
roughness x ∈ N — 21 ≤ x ≤ 368, cont µm
Table 1.1. Setting parameters & output parameters
3. Summary
The dataset from the TR/Selcuk University Mechanical Engineering Department
provides a robust foundation for analyzing the relationships between 3D printer
settings and the mechanical properties of printed materials. Through 50
observations, the study reveals critical insights into optimizing settings such as
infill density, nozzle temperature, and wall thickness to achieve desired outcomes
like enhanced tension strength and elongation.
The inclusion of both categorical and continuous variables enables a

comprehensive exploration of 3D printing dynamics. By understanding the dataset
and leveraging statistical and machine learning models, this study lays the
groundwork for improving 3D printing processes in manufacturing, engineering,
and design applications.

5
6
PART 2: THEORETICAL BASIS

In this section, we explore the theoretical concepts behind statistical methods that
are applied to the 3D Printer dataset for mechanical engineers. These methods are
essential for analyzing the impact of different printing parameters on the
mechanical properties of the printed parts. Specifically, we focus on Analysis of
Variance (ANOVA) and Single-factor Variance Analysis, which are widely used in
engineering experiments to compare the means of different groups or treatments.
1. Analysis of one-factor variance

The goal of One-Factor ANOVA is to determine whether there is a statistically
significant difference between the means of two or more groups based on a single
factor. In the context of 3D printing, one might use ANOVA to test whether
different printing parameters (such as infill_density, wall_thickness, print_speed,
or material) significantly impact a dependent variable, such as the tension_strenght
of the printed material.
The general framework of a One-Factor ANOVA is as follows:
 Null hypothesis (H0) assumes that all groups have the same mean (i.e., the
printing parameters do not affect the mechanical properties).
 The alternative hypothesis (H1) assumes that at least one of the groups has a
different mean (i.e., the printing parameters do affect the mechanical
properties).
The ANOVA test calculates the F-statistic which compares the variance between
the groups to the variance within the groups. If the F-statistic is large enough, it
suggests that there is a significant difference between the means, and the null
hypothesis is rejected.

6
7
2.1.1 Theory of ANOVA (Analysis of Variance)
ANOVA is based on the following assumptions:
1. Independence of observations: The data points (samples) must be

independent of each other.
2. Normality: The data within each group should be approximately normally

distributed.
3. Homogeneity of variances: The variance within each group should be roughly

equal. This assumption can be tested using Levene's Test or Bartlett’s Test for
homogeneity of variances.
The test statistic used in ANOVA is the F-statistic, which is calculated as:
Between−group variance
F=
Within−group variance
Where:
- Between-group variance measures how much the group means deviate

from the overall mean.
- Within-group variance measures how much individual observations

deviate from their respective group means.
The p-value corresponding to the F-statistic is used to determine whether the

differences between the group means are statistically significant. If the p-value is
below a predefined significance level (e.g., 0.05), we reject the null hypothesis and
conclude that the means are different.
2.1.2 Analysis of Single-Factor Variance
For the 3D Printer dataset, Single-Factor ANOVA can examine how a single factor,
such as infill_density, influences tension_strength or elongation. For example:

7
8
 Hypothesis:
 H₀: Means of the groups (low, medium, high infill density) are equal.
 H₁: At least one group mean is different.
 Steps:
1. Calculate the F-statistic: Using the formula mentioned earlier, we calculate the
F-statistic to compare the between-group and within-group variances.
2. Determine the p-value: The p-value helps us assess the strength of the
evidence against the null hypothesis. A p-value smaller than 0.05 indicates that
there is a statistically significant difference in means.
3. Post-hoc analysis: If ANOVA indicates significant differences, post-hoc tests

(such as Tukey's HSD) can be used to identify which specific groups differ
from each other.
Example: Suppose we want to test the effect of infill_density on tension_strenght.

We divide the data into three groups based on infill density levels (low, medium,
and high) and compare the tension_strenght across these groups using ANOVA.
 Null hypothesis: There is no significant difference in tension_strenght between

low, medium, and high infill densities.
 Alternative hypothesis: There is a significant difference in tension_strenght

between at least one pair of infill densities.
After performing the ANOVA, if the F-statistic is significantly large and the p-
value is below 0.05, we can conclude that the infill density does influence the
tension strength of the printed material.

8
9
2. Multiple linear regression

Multiple linear regression (MLR) is a statistical technique used to model the
relationship between a dependent variable and multiple independent variables. In
the context of the 3D Printer dataset, we apply MLR to predict mechanical
properties such as tension_strength, elongation, or roughness based on various
printing parameters, such as infill_density, wall_thickness, print_speed, material,
and others. Multiple linear regression allows engineers to understand the
contribution of each parameter and predict material behavior under different
conditions.
a. Multiple linear regression

The goal of Multiple Linear Regression is to find the linear relationship between
a dependent variable (Y) and multiple independent variables (x1, x2, x3,…xk ):
Y=β0+β1x1+β2x2+⋯+βkxk+ϵ
Where:
 Y is the dependent variable (e.g., tension_strength or roughness).
 x1, x2, x3,…xk are the independent variables (e.g., infill_density, wall_thickness,
material, etc.).
 β0 is the intercept term, which represents the value of Y when all x i variables are
equal to zero.
 β1,β2,…,βk are the coefficients that represent the effect of each independent
variable on the dependent variable.
 ϵ is the error term, capturing the unexplained variation in Y.
The model estimates the β coefficients, which indicate how much the dependent
variable changes for a one-unit change in each independent variable, holding the
9
10
other variables constant. The regression model is fitted to the data using methods
like Ordinary Least Squares (OLS) to minimize the sum of squared errors between
the observed values and the predicted values.
Key assumptions of multiple linear regression:
1. Linearity: The relationship between the dependent variable and the

independent variables should be linear.
2. Independence: Observations should be independent of each other.
3. Homoscedasticity: The variance of errors should be constant across all levels

of the independent variables.
4. Normality of errors: The residuals (errors) should be approximately normally

distributed.
Multiple linear regression can be used to:
 Estimate the impact of multiple factors on material properties in 3D printing.
 Predict the mechanical properties of printed parts under various conditions.
 Identify significant predictors for improving the quality and performance of 3D

printed parts.
The method of linear regression involves fitting a linear equation to observed data.
This equation represents the relationship between a dependent variable (Y) and one
or more independent variables (X). Once the coefficients are estimated, the linear
regression model can be used to make predictions for the dependent variable based
on new values of the independent variables. The method is widely used in various
fields for prediction, inference, and understanding the relationship between
variables.

10
11
b. Population regression function (PRF)

The Population Regression Function (PRF) represents the true relationship
between the dependent variable and the independent variables in the entire
population. It is a theoretical model that describes how the dependent variable
would behave if we had access to the full population data. In the context of the 3D
Printer dataset, the PRF could look something like this:
Y=β0+β1x1+β2x2+⋯+βkxk
Where:
Y represents the mechanical property (e.g., tension_strength)
x1, x2, x3,…xk represent the various printing parameters (e.g., infill_density,
wall_thickness, print_speed, etc.).
The PRF is typically unknown because we usually work with a sample of the
population. It assumes that the relationship between the dependent and independent
variables holds true across the population. This is the ideal model, and the goal of
regression analysis is to estimate the parameters β 0, β1, β2, βk as accurately as
possible using sample data.
c. Sample regression function (SRF)

The Sample Regression Function (SRF) is an estimate of the population regression
function, based on data from a sample. The SRF is computed using the sample
data, and it is the model we actually use for prediction and inference in practice.
The SRF is:
Where:
 Y^ : the predicted value of the dependent variable.

11
12
 ^
β0, ^
β 1 ,… ^
β k : the estimated coefficients.
The formula for estimating the coefficients ( ^β ) in the SRF is:

^β=(X T X )−1 X T Y
Where:
 X is the matrix of independent variables (including a column of 1s for the
intercept).
 Y is the vector of observed dependent variable values.
 X T is the transpose of the X matrix.
 (X T X )−1 is the inverse of the product of X T and X.
The SRF is derived using statistical techniques such as Ordinary Least Squares
(OLS), which minimizes the sum of squared residuals (the difference between the
observed values and the predicted values) across the sample.
The SRF is used for:
 Prediction: To forecast the values of the dependent variable (e.g.,

tension_strength) based on new observations of the independent variables (e.g.,
infill_density, wall_thickness, etc.).
 Inference: To test hypotheses about the relationships between the independent

and dependent variables.
Example from the 3D Printer Dataset:
In the case of the 3D Printer dataset, we may use multiple linear regression to
predict tension_strength based on various independent variables such as
infill_density, wall_thickness, material, and print_speed. The SRF might be:
Tension strength = 2.5+0.1⋅Infill density+0.05⋅Wall thickness−0.02⋅Print Speed+ϵ

12
13
d. Assumptions of the least squares method for multiple linear

regression models
The Least Squares Method is a popular approach to estimate the coefficients in
multiple linear regression (MLR). It minimizes the sum of squared residuals (the
difference between observed and predicted values) to find the best-fitting line or
hyperplane for the data. For the method to yield reliable estimates, several
assumptions must be met. These assumptions are crucial for ensuring the validity
and interpretability of the regression results:
1. Linearity: The relationship between the dependent variable and each

independent variable must be linear. This assumption is fundamental, as it
ensures the regression model accurately reflects the nature of the relationship.
2. Independence: The residuals (errors) should be independent of one another.

This means that the residual for one observation should not provide any
information about the residual for another observation. Violation of this
assumption typically occurs when the data are serially correlated, often seen in
time-series data.
3. Homoscedasticity: The variance of the errors should be constant across all

levels of the independent variables. In other words, the spread of residuals
should not increase or decrease as the values of the independent variables
change. If heteroscedasticity is present, it can affect the efficiency of the
coefficient estimates.
4. Normality of Errors: The residuals should follow a normal distribution. This

assumption is particularly important for hypothesis testing and constructing
confidence intervals for the regression coefficients. If the errors are not normally
distributed, it could affect the validity of statistical tests.

13
14
5. No Multicollinearity: The independent variables should not be highly correlated

with each other. When multicollinearity is present, it can make it difficult to
determine the individual effect of each predictor variable on the dependent
variable, leading to inflated standard errors and unstable coefficient estimates.
6. No Autocorrelation: Residuals should not be correlated with one another. In

time-series data, autocorrelation can occur, violating this assumption and leading
to biased results. The Durbin-Watson test is commonly used to check for
autocorrelation.
If any of these assumptions are violated, the results from the least squares method
may be biased or misleading, requiring corrective actions like transforming the
variables, applying robust regression methods, or using generalized least squares.
e. Model fit metrics

In multiple linear regression, assessing the model fit is crucial to understand how
well the model explains the variation in the dependent variable. There are several
key metrics used to evaluate the fit of the regression model:
1. R-squared (R2): The coefficient of determination, R2, represents the proportion

of the variance in the dependent variable that is explained by the independent
variables in the model. It ranges from 0 to 1, where a value closer to 1 indicates
a better fit. For instance, an R2 of 0.85 means that 85% of the variation in the
dependent variable is explained by the model.
However, R2 can be misleading if the model has too many predictors. To

address this, we often use the Adjusted R-squared, which adjusts for the number
of predictors in the model.
2. Adjusted R-squared: The adjusted R2 penalizes the inclusion of irrelevant

predictors. It is calculated as:
14
15
Adjusted R2¿1 -(1−R 2) ¿ ¿

Where:
 n is the number of observations,

 k is the number of predictors.
This metric gives a more accurate measure of the model's explanatory power
when multiple predictors are included.
3. F-statistic: The F-statistic tests whether at least one of the predictors in the
model has a non-zero coefficient. A high F-statistic (with a corresponding low
p-value) suggests that the model fits the data well and that the independent
variables have explanatory power.
4. Residual Plots: Graphical plots of the residuals can also help assess the model
fit. Ideally, the residuals should be randomly scattered around zero with no
discernible pattern, indicating a good fit. If residuals show a clear pattern, the
model may not be appropriately specified.
f. Confidence interval and Hypothesis testing

In multiple linear regression, we often want to test hypotheses about the regression
coefficients and compute confidence intervals for these coefficients to understand
the precision of the estimates.
1. Confidence Interval for Regression Coefficients:
A confidence interval (CI) for a regression coefficient provides a range of values

within which the true population parameter is likely to fall with a certain level of
confidence (e.g., 95%). For example, a 95% CI for a coefficient of β 1 might range
from 0.05 to 0.15, indicating that we are 95% confident that the true coefficient
falls within this range. The formula for the confidence interval for a regression
coefficient is:
15
16
βi±tα/2×SE(βi)
Where:
 βi is the estimated coefficient for the i-th predictor,
 tα/2 is the critical value from the t-distribution for the desired confidence level,
 SE(βi) is the standard error of the coefficient.
2. Hypothesis Testing for Regression Coefficients:
To test the significance of each regression coefficient, we use a hypothesis test.

The null hypothesis is that the coefficient is equal to zero (i.e., no relationship
between the predictor and the dependent variable), while the alternative
hypothesis is that the coefficient is non-zero.
Null hypothesis (H₀): βᵢ = 0 (No relationship between the predictor and the
dependent variable)
Alternative hypothesis (H₁): βᵢ ≠ 0 (A relationship exists between the
predictor and the dependent variable)
The test statistic is calculated as:

^β
t=
SE ¿ ¿
 ^β is the estimated coefficient for predictor i.
 SE ¿ is the standard error of the coefficient.
A large absolute value of t suggests that the predictor is significantly related to the
dependent variable. The corresponding p-value helps us determine whether to
reject or fail to reject the null hypothesis. If the p-value is smaller than the
significance level (e.g., 0.05), we reject the null hypothesis, indicating that the
predictor has a significant impact on the dependent variable.

16
17
g. Testing the overall significance of the model (Special Case of

WALD Test)
The Wald Test is a statistical test used to assess the overall significance of a
regression model, specifically whether all of the regression coefficients are
simultaneously equal to zero. This is a special case of the Wald test used to
evaluate the null hypothesis that all coefficients in the regression model are zero,
meaning that none of the predictors have any effect on the dependent variable.
 The null hypothesis for the Wald test is:
H0 : β1 = β2 = ⋯ = βk = 0
H1 : At least one βk ≠ 0
Where β1, β2, ⋯ , βk are the regression coefficients. The alternative hypothesis is
that at least one of the coefficients is not zero.
 The Wald test statistic evaluates whether the ^β coefficients significantly

differ from zero. The formula for the Wald test statistic is:
W = ^β (X X ) ^β
T −1
Where:
 ^β is the vector of estimated coefficients.
X X is the information matrix from the regression model.

T


T
(X X )
−1
is the variance-covariance matrix of the estimated coefficients.
The Wald test statistic follows a chi-square distribution with degrees of freedom
equal to the number of coefficients being tested. If the test statistic is large enough
(i.e., the p-value is small), we reject the null hypothesis, indicating that the model

17
18
has at least one predictor that significantly explains the variation in the dependent
variable.
In the context of the 3D Printer dataset, the Wald test can be applied to test whether
the predictors such as infill_density, wall_thickness, and print_speed together
contribute to explaining the mechanical properties (e.g., tension_strength) of the
printed parts. A significant result from the Wald test suggests that the regression
model is meaningful and the predictors are important for understanding the
material properties in the context of 3D printing.
PART 3: DATA PREPROCEEDING
Data preprocessing is an essential part of data analysis, as it ensures that the dataset
is clean, accurate, and ready for further analysis and modeling. In this section, we
will cover how to import, clean, handle missing values, and prepare the data for
analysis in R.
1. Data importing
The first step in any data analysis workflow is to import the dataset into R. Below
is the process for importing the dataset into R using the read.csv() function, after
installing and loading the necessary R packages.
Figure 3.1. Install and Libraries packages
r
# Install necessary packages
install.packages("dplyr")
install.packages("tidyverse")
install.packages("ggpubr")
install.packages("corrplot")
18
19
# Load libraries
library(dplyr)
library(ggplot2)
library(ggpubr)
library(corrplot)
# Create sample data for the 3D printing project
data <- data.frame(
infill_density = c(20, 30, 40, 50, 60),
print_speed = c(60, 70, 80, 90, 100),
nozzle_temperature = c(200, 210, 220, 230, 240),
tension_strength = c(40, 42, 44, 45, 46),
elongation = c(10, 11, 12, 13, 14) )
# Save the data to a CSV file
write.csv(data, "3dprinter.csv", row.names = FALSE)
# Display the current working directory
print("Current working directory:")
print(getwd())
# from google.colab import files
# uploaded = files.upload()
# Read the data from the CSV file
data <- read.csv("3dprinter.csv", header = TRUE)
# Display the first 5 rows of the dataset
head(data)
# Check the structure of the dataset
print("Dataset structure:")
str(data)
# Summarize the dataset
print("Dataset summary:")
summary(data)
# Check the correlation between numeric variables
correlation_matrix <- cor(data[, sapply(data, is.numeric)])
corrplot(correlation_matrix, method = "circle")

19
20
Figure 3.2. Code R and result of reading the data

20
21

21
22

22
23
2. Data cleaning
Data cleaning is a crucial step in the preprocessing pipeline. This involves handling
missing values, removing duplicates, and ensuring that all variables are correctly
formatted.
a. Handling missing values
In R, the is.na() function helps identify missing values. Once identified, you can
either remove or impute these missing values. For example, you can fill missing
values with the mean or median for numerical columns.
r
# Install necessary packages
required_packages <- c("dplyr", "tidyr", "caret", "writexl")
install.packages(setdiff(required_packages,
rownames(installed.packages())))
# Load libraries
library(dplyr) # For data manipulation
library(tidyr) # For handling missing data
library(caret) # For scaling and encoding categorical variables
library(writexl) # For exporting to Excel
# Step 1: Generate a small dataset (5 rows, 8 columns)
set.seed(123) # Ensures reproducibility
raw_data <- data.frame(
tension_strength = runif(5, 10, 50), # Simulated tension strength
elongation = runif(5, 5, 15), # Simulated elongation
roughness = runif(5, 0.5, 2.0), # Simulated surface roughness
infill_density = runif(5, 10, 90), # Infill density percentage
wall_thickness = runif(5, 0.5, 2.0), # Wall thickness in mm
nozzle_temperature = sample(180:240, 5, replace = TRUE), # Nozzle
temperature
print_speed = sample(20:100, 5, replace = TRUE), # Print speed
in mm/s

23
24
layer_height = runif(5, 0.1, 0.3) # Layer height in mm )

# Display raw data
cat("Raw Data:\n")
print(raw_data)
# Step 2: Normalize numerical data
clean_data <- raw_data %>%
mutate(across(where(is.numeric), scale)) # Normalize all numeric
columns
# Display cleaned data
cat("\nCleaned Data (Normalized):\n")
print(clean_data)
# Step 3: Verify the structure of the cleaned dataset
cat("\nStructure of the Cleaned Data:\n")
str(clean_data)
# Step 4: Export the cleaned data to CSV
csv_output_file <- "cleaned_data_5x8.csv"
write.csv(clean_data, csv_output_file, row.names = FALSE)
cat("\nThe cleaned data has been saved as a CSV file:",
csv_output_file, "\n")
# Step 5: Export the cleaned data to Excel
excel_output_file <- "cleaned_data_5x8.xlsx"
write_xlsx(clean_data, excel_output_file)
cat("The cleaned data has been saved as an Excel file:",
excel_output_file, "\n")
# Step 6: Summarize the dataset
cat("\nSummary of Cleaned Data:\n")
print(summary(clean_data))
# Step 7: Display correlation matrix (optional for numeric variables)
correlation_matrix <- cor(raw_data[, sapply(raw_data, is.numeric)])
cat("\nCorrelation Matrix:\n")
print(correlation_matrix)

24
25

25
26

26
27
b. Handling outliers and inconsistent data
Outliers can distort statistical analyses, so it’s essential to identify and handle them.
One way to identify outliers is through box plots or Z-scores. Values with Z-scores
greater than 3 or less than -3 are typically considered outliers.
r
# Install and load necessary libraries
install.packages("ggplot2")
library(dplyr)
library(ggplot2)
# Display working directory
print("Current Working Directory:")
print(getwd())
# Upload a CSV file in Google Colab
from google.colab import files
uploaded <- files.upload()
27
28
# Read the uploaded CSV file

# Visualizing outliers using box plots
if ("print_speed" %in% colnames(data)) {
boxplot(data$print_speed, main = "Boxplot of Print Speed", col =
"lightblue") }
# Calculate Z-scores for outlier detection
z_scores <- scale(data$print_speed)
outliers <- data[abs(z_scores) > 3, ]
print("Outliers in 'print_speed':")
print(outliers)
# Removing outliers based on Z-scores
data <- data[abs(z_scores) <= 3, ]
print("Dataset after removing outliers in 'print_speed':")
str(data) }
Alternative: Capping outliers at reasonable limits
lower_limit <- quantile(data$print_speed, 0.05, na.rm = TRUE) # 5th
percentile
upper_limit <- quantile(data$print_speed, 0.95, na.rm = TRUE) # 95th
percentile
data$print_speed[data$print_speed < low
If outliers are valid observations, you may decide to keep them or cap them at
reasonable limits.
c. Data transformation
Data transformation is another important step, especially when dealing with
categorical variables or ensuring that continuous variables are on the same scale.
r
library(dplyr)
# Display working directory

28
29
print("Current Working Directory:")

print(getwd())
# Upload a CSV file in Google Colab
from google.colab import files
uploaded <- files.upload()
# Read the uploaded CSV file
# Data Transformation
# 1. Converting categorical variables to a readable format
if ("material" %in% colnames(data)) {
data$material <- factor(data$material, levels = c(0, 1), labels =
c("abs", "pla"))
print("Converted 'material' column to a factor:")
print(table(data$material)) }
# 2. Standardizing continuous variables
data$print_speed <- scale(data$print_speed)
print("Standardized 'print_speed':")
print(summary(data$print_speed)) }
# Display the first few rows of the transformed dataset
print("First few rows of the transformed dataset:")
head(data)
# Save the transformed dataset to a CSV file
write.csv(data, "transformed_3dprinter.csv", row.names = FALSE)
print("Transformed dataset saved as 'transformed_3dprinter.csv'")
3. Feature engineering
Feature engineering involves creating new variables or transforming existing ones
to improve the analysis and modeling. # Install and load necessary libraries
r
library(dplyr)
# Step 1: Create a sample dataset
29
30
data <- data.frame(

layer_height = c(0.1, 0.2, 0.15, 0.25, 0.3),
wall_thickness = c(0.8, 1.0, 0.9, 1.1, 1.2),
print_speed = c(50, 60, 55, 70, 65),
material = c(0, 1, 0, 1, 0), # 0 = abs, 1 = pla
infill_density = c(20, 30, 25, 35, 40),
tension_strength = c(250, 300, 270, 320, 310) )
# Step 2: Handle missing values (impute missing values with the median
for nozzle_temperature)
data$nozzle_temperature[is.na(data$nozzle_temperature)] <-
median(data$nozzle_temperature, na.rm = TRUE)
# Step 3: Feature Engineering: Create new features
# Create a new feature: layer_thickness_ratio
data$layer_thickness_ratio <- data$layer_height / data$wall_thickness
# Print the updated data with the new feature
print("Updated dataset with new feature:")
head(data)
# Step 4: Data Transformation (Example: Standardizing print_speed)
data$print_speed <- scale(data$print_speed)
# Print the transformed data
print("Transformed dataset with standardized print_speed:")
head(data)
# Step 5: Save the transformed dataset as a new CSV file
write.csv(data, "feature_engineered_3dprinter.csv", row.names = FALSE)
# Display confirmation
print("Feature-engineered dataset saved as
'feature_engineered_3dprinter.csv'")

30
31
Figure 3.3. Code R and result of cleaning data

Based on the results obtained after the process of finding missing values, we see
that the data table does not have any missing values. So, the file contains 12
attributes with 50 experimental observations.
PART 4: DESCRIPTIVE STATISTICS

1. Data summary
We use the corresponding commands to find mean, standard deviation, quantile,
median, min and max. Then, we output the result as a table as shown below:
Figure 4.1. Code R and Result of data summary

r
library(dplyr)

31
32
# Sample dataset
data <- data.frame(
layer_height = c(0.1, 0.2, 0.15, 0.25, 0.3),
wall_thickness = c(0.8, 1.0, 0.9, 1.1, 1.2),
print_speed = c(50, 60, 55, 70, 65),
material = c(0, 1, 0, 1, 0), # 0 = abs, 1 = pla
infill_density = c(20, 30, 25, 35, 40),
bed_temperature = c(60, 65, 70, 75, 80),
tension_strength = c(250, 300, 270, 320, 310) )
# Descriptive statistics for quantitative variables
mean_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, mean)
median_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, median)
sd_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, sd)
Q1_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, quantile, probs = 0.25)
Q3_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, quantile, probs = 0.75)
min_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, min)
max_val <- apply(data[,c(1,2,3,5,6,7,8)], 2, max)
# Combine the results into a data frame
summary_stats <- data.frame(mean = mean_val,
median = median_val,
sd = sd_val,
Q1 = Q1_val,
Q3 = Q3_val,
min = min_val,
max = max_val)
# Print the summary statistics
print("Descriptive statistics for quantitative variables:")
print(summary_stats)

32
33
2. Plot data
a. Box plot
In addition to descriptive analysis, we draw a Boxplot graghs to have a better
visualization about the distribution of Roughness, Tension_strenght, Elongation
according to Material and Infill_pattern.
r
# Load necessary library
library(ggplot2)
# Create a sample data frame
my_data <- data.frame(
Material = c("ABS", "PLA", "ABS", "PLA", "ABS", "PLA", "ABS", "PLA"),

33
34
Infill_pattern = c("grid", "honeycomb", "grid", "honeycomb", "grid",

"honeycomb", "grid", "honeycomb"),
Roughness = c(92, 88, 200, 145, 289, 192, 368, 321),
Tension_strength = c(16, 19, 21, 25, 37, 27, 37, 34),
Elongation = c(1.2, 1.5, 1.3, 1.8, 1.6, 2.3, 3.3, 3.2) )
# Check if my_data is a valid data frame
class(my_data) # Should return "data.frame"
# Boxplot for Roughness by Material and Infill_pattern
ggplot(my_data, aes(x = interaction(Material, Infill_pattern), y =
Roughness)) +
geom_boxplot(fill = "lightblue", color = "black") +
labs(title = "Boxplot of Roughness by Material and Infill Pattern",
x = "Material and Infill Pattern", y = "Roughness (μm)") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
# Boxplot for Tension Strength by Material and Infill_pattern
Tension_strength)) +
geom_boxplot(fill = "lightgreen", color = "black") +
labs(title = "Boxplot of Tension Strength by Material and Infill
Pattern",
x = "Material and Infill Pattern", y = "Tension Strength (MPa)") +
# Boxplot for Elongation by Material and Infill_pattern
Elongation)) +
geom_boxplot(fill = "lightcoral", color = "black") +
labs(title = "Boxplot of Elongation by Material and Infill Pattern",
x = "Material and Infill Pattern", y = "Elongation (%)") +
Figure 4.2. Code R and Boxplot of Roughness to Material and Infill_pattern

Z

34
35

35
36

36
37

37
38

38
39

39
40

40
41

41
42

42
43

43
44

44
45

45
46

46
47

47
48

48
49

49
50

50
51

51
52

52
53

53
54

54
55

55
56

56
57

57
58

58
59

59
60

60
61

61
62

62
63

63
64

64
65

65
66

66
67

67
68

68
69

69
70

70
71

71
72

72
73

73
74

74
75

75
76

76
77

77
78

78
79

79
80

80
81

81
82

82
83

83
84

84
85

85
86

86
87

87
88

88
89

89
90

90
91

91
92

92
93

93
94

94
95

95
96

96
97

97
98

98
99

99
100

100
101

101
102

102
103
- 50 % tensionstrength ≤18 MPa

1. The tension_strength of output when the infill_pattern is honeycomb
- 4 ≤ tensio𝑛_trength ≤ 34
- 25% tension_strength ≤ 12𝑀𝑃𝑎
2. The tension_strength of output when the material is abs
- 5 ≤ tension_strength ≤ 37
The tension_strength of output when the material is pla
- 4 ≤tensio ntrength ≤34
Figure 4.4. Code R and Boxplot of Elongation to Material and Infill_pattern

103
104
1. The elongation of output when the infill_pattern is grid

- 0.4 % ≤elongation≤3.3 %
- 25 % elongation≤1.1 %
- 50 % elongation ≤1.3 %
2. The elongation of output when the infill_pattern is honeycomb
- 0.5% ≤ elongation ≤ 3.2%
- 25% elongation ≤ 1.1%

104
105
3. The elongation of output when the material is abs

- 0.4% ≤ elongation ≤ 3.3%
4. The elongation of output when the material is pla
- 0.7% ≤ elongation ≤ 3.2%
- 25% elongation ≤1.5%
b. Correlation coefficients between variables
Figure 4.5. Code R and Correlogram coefficients data

105
106

install.packages("ggplot2")
install.packages("reshape2")

106
107
library(ggplot2)
library(reshape2)
# Create a sample data frame (add your actual data if needed)
my_data <- data.frame(
Material = c("ABS", "PLA", "ABS", "PLA", "ABS", "PLA", "ABS", "PLA"),
Infill_pattern = c("grid", "honeycomb", "grid", "honeycomb", "grid",
"honeycomb", "grid", "honeycomb"),
Roughness = c(92, 88, 200, 145, 289, 192, 368, 321),
Tension_strength = c(16, 19, 21, 25, 37, 27, 37, 34),
Elongation = c(1.2, 1.5, 1.3, 1.8, 1.6, 2.3, 3.3, 3.2) )
# Check if my_data is a valid data frame
class(my_data) # Should return "data.frame"
# Compute the correlation matrix for numeric variables
cor_matrix <- cor(my_data[, c("Roughness", "Tension_strength",
"Elongation")])
# Melt the correlation matrix for ggplot2
cor_matrix_melted <- melt(cor_matrix)
# Generate a heatmap for the correlation matrix
ggplot(cor_matrix_melted, aes(Var1, Var2, fill = value)) +
geom_tile() +
scale_fill_gradient2(low = "blue", high = "red", mid = "white",
midpoint = 0) +
theme_minimal() +
labs(title = "Correlation Matrix Heatmap", x = "Variables", y =
"Variables") +
According to the correlation matrix, Multicollinearity refers to the presence of

strong correlations between fan speed and bed temperature (with a correlation
coefficient of 1).
Multicollinearity can undermine result reliability since the compared outcomes are
typically redundant. Therefore, we opt to remove the fan speed variable.

107
108
PART 5: INFERENTIAL STATISTICS

Inferential statistics allow us to make predictions and draw conclusions about a
population based on a sample of data. In this context, we apply inferential
statistical techniques like Analysis of Variance (ANOVA) and Multiple Linear
Regression (MLR) to understand the relationships between different printing
parameters (such as layer height, print speed, and material type) and the
mechanical properties (such as tension strength and elongation) of the 3D printed
objects.
1. Using Two-way ANOVA to evaluate how qualitative affect

output parameters
Two-way ANOVA is an extension of one-way ANOVA, which allows for the
evaluation of the impact of two independent categorical factors on a continuous
dependent variable. In the case of the 3D Printer dataset, we apply two-way
ANOVA to evaluate how combinations of qualitative factors such as infill pattern,
bed temperature, and material type affect the output parameters like roughness,
tension strength, and elongation of the printed objects.
a. Assumption of the Two-way ANOVA

Before performing the two-way ANOVA, it is essential that the dataset meets
several assumptions:
 Independence of Observations: The observations must be independent

between groups and within each group.
 Normal Distribution of Variables: Data should follow a normal

distribution for each group.

108
109
 Homogeneity of Variance (Homoscedasticity): The variance within each

group should be equal across all groups (the assumption of equal variance).
Row (blocks) Column (groups)
1 2 3 … K
1 x11 x21 x31 … xK 1
2 x12 x22 x32 … xK 2
3 x13 x23 x33 … xK 3
… … … … … …
H x1 H x2H x3 H … x KH
*Calculate the average

H
 The average of all values in row i: ∑ x ij , where i =1,2…,K

x i= j =1
H
K
 The average of all valuers in column j: ∑ x ij , where i =1,2,..,H

x = i=1
j
K
H K H K
 The average of all the observations

∑ ∑ x ij ∑ x ij ∑ x ij
j=1 i=1 j=1
: x= = = i=1
H H K
 The sum of squares indentity

109
110
 The rows sum of squares (Groups): S SG=H ∑ ¿ ¿

i=1
 SSG reflects the variability of the quantitative factor of the outcome under
study due to the influence of the first causal factor, the factor used for grouping
in the column.
H
 The columns sum of squares (Blocks): SB=∑ ¿ ¿
j=1
 SSG reflects the variability of the quantitative factor of the outcome under
study due to the influence of the second causal factor, the factor used for
grouping in the row.
H K
 The error sum of squares: SSE=∑ ∑ ¿ ¿ ¿

j=1 i=1
¿ S ST −SSG−SSB
 The total sum of squares: SST =SSG+ SSB+ SSE∨ST
H K
¿∑ ∑ ¿¿¿
j=1 i=1
*Calculate the mean squares

SSG
 Mean squares of each group: MSG= K −1
SSB
 Mean squares of each block: MSB= H −1
 The residual mean squares: SSE

MS E=
(K −1)( H −1)
 Testing the hypothesis about the influence of the first causal

factor
(column) and the second cause factor (stream) to the effect factor by F ratios
MSG
F 1=
MSE

110
111
MSB
F 2=
MSE
*There are two cases in the decision to reject the hypothesis H0 of ANOVA two
element
 For F1 at significance level α, the hypothesis H 0 holds that the average of the
population
K according to the first causal factor (column) equality is rejected when:

F 1> F (K−1 ,(K −1)(H −1), α)
 For F2 at significance level α, the hypothesis H0 holds that the average of the
 population H according to the second causal factor (row) equality is rejected
when:
F 2> F (K−1 ,(K −1)( H −1) ,α )
Where:
 F 1> F (K−1 ,(K −1)(H −1), α ) is the lookup value in the F distribution table with K-1
degrees of freedom in the numerator and (K - 1)(H - 1) degrees of freedom

in the denominator.
 F 2> F (K−1 ,(K −1)( H −1) ,α ) is the lookup value in the F distribution table with H-1
degrees of freedom in the numerator and (K - 1)(H - 1) degrees of freedom

in the denominator.
*The result of both tests are summarized in table below:
By conducting a two-way ANOVA, researchers can gain insights into how two
categorical factors independently and jointly affect the continuous outcome

variable.

111
112
This method provides a comprehensive understanding of the relationships between

multiple variables in an experiment or study.
*Result of testing
Sources of
Degree of
variation Sum of squares Mean squares F ratio
freedom
Between rows SSB K-1 SSG MSG

SSG= F 1=
K−1 MSE
Between column SSG H-1 SSB MSB

MSB= F 2=
H −1 MSE
Residual error SSE (K-1)(H-1)

SSE
MSE=
(K −1)(H −1)
The total SST n-1
1. Nomality test
We opt for the QQ-plot as a method to confirm normality, applying it to the
residuals of data and choose Shapiro-Wilk test to verify the normality.
The QQ-plot diagram illustrates that the majority of the observed values lie on the
expected straight line of the normal distribution, so the Roughness,
Tension_Strength, Elongation variable follows a normal distribution. Additionally,
we can use the Shapiro-Wilk test function to test:

112
113
- Hypothesis H0: The data follows a normal distribution.
- Hypothesis H1: The data does not follow a normal distribution.
Figure 5.1. R code for variable declaration, QQ-plot and Shapiro-Wilk test r
# Load necessary libraries
library(car) # For Levene's test
library(ggplot2) # For visualization
library(dplyr) # For data manipulation
# Declare variables
infill_pattern <- as.factor(data$infill_pattern)
bed_temperature <- as.factor(data$bed_temperature)
material <- as.factor(data$material)
roughness <- data$roughness
tension_strenght <- data$tension_strenght
elongation <- data$elongation
# --- 1. Normality Test using QQ-Plot and Shapiro-Wilk Test ---
# Test normality for Roughness
residual_roughness <- rstandard(aov(roughness ~ infill_pattern *
material * bed_temperature, data = data))
# QQ-plot for Roughness
qqnorm(residual_roughness)
qqline(residual_roughness, col = "red")
title(main = "Figure 5.2: Normal QQ-plot for Residual Roughness")
# Shapiro-Wilk Test for Roughness
shapiro_roughness <- shapiro.test(residual_roughness)
print(shapiro_roughness)
# Test normality for Tension Strength
residual_tension_strength <- rstandard(aov(tension_strenght ~
infill_pattern * material * bed_temperature, data = data))
# QQ-plot for Tension Strength
qqnorm(residual_tension_strength)
qqline(residual_tension_strength, col = "red")
title(main = "Figure 5.3: Normal QQ-plot for Residual Tension
Strength")
113
114
# Shapiro-Wilk Test for Tension Strength

shapiro_tension_strength <- shapiro.test(residual_tension_strength)
print(shapiro_tension_strength)
# Test normality for Elongation
residual_elongation <- rstandard(aov(elongation ~ infill_pattern *
material * bed_temperature, data = data))
# QQ-plot for Elongation
qqnorm(residual_elongation)
qqline(residual_elongation, col = "red")
title(main = "Figure 5.4: Normal QQ-plot for Residual Elongation")
# Shapiro-Wilk Test for Elongation
shapiro_elongation <- shapiro.test(residual_elongation)
print(shapiro_elongation)
# --- 2. Homogeneity of Variance using Levene’s Test ---
# Levene's Test for Roughness
levene_roughness <- leveneTest(roughness ~ infill_pattern * material *
bed_temperature, data = data)
print(levene_roughness)
# Levene's Test for Tension Strength
levene_tension_strength <- leveneTest(tension_strenght ~
infill_pattern * material * bed_temperature, data = data)
print(levene_tension_strength)
# Levene's Test for Elongation
levene_elongation <- leveneTest(elongation ~ infill_pattern * material
* bed_temperature, data = data)
print(levene_elongation)
*Infill pattern, bed_temperature and material affect roughness

Test normality by using QQ-plot
Figure 5.2. Normal QQ-plot for residuel_roughness

114
115
Test normality by using Shapiro-Wilk test

Comment: Because Pr > 0.05, we fail to reject H0. Hence, infill pattern,
bed_temperature and material affect roughness follows a normal distribution.
*Infill pattern, bed_temperature and material affect tension_strenght
Figure 5.3. Normal QQ-plot for residuel_ tension_strenght

115
116

bed_temperature and material affect tension_strenght follows a normal
distribution.
*Infill pattern, bed_temperature and material affect elongation
Figure 5.4. Normal QQ-plot for residuel_ elongation

116
117

117
118

bed_temperature and material affect elongation follows a normal distribution.
2. Homogeneity of variance
r
# Load the necessary package
if (!require(car)) install.packages("car", dependencies = TRUE)
library(car)
# Function to perform Levene's Test and print results
perform_levene_test <- function(response, data, factors) {
formula <- as.formula(paste(response, "~", paste(factors, collapse = "
* ")))
# Perform Levene's Test
levene_result <- leveneTest(formula, data = data, center = median)
# Extract p-value
p_value <- levene_result[1, "Pr(>F)"]
# Print results
cat("\nTesting homogeneity of variances for", response, "\n")
print(levene_result)
# Interpretation
if (p_value > 0.05) {
cat("Comment: Because Pr(>F) =", round(p_value, 4), "> 0.05, we fail
to reject H0. Hence, the variances are equal.\n")
} else {
cat("Comment: Because Pr(>F) =", round(p_value, 4), "<= 0.05, we
reject H0. Hence, the variances are not equal.\n") } }
# Specify dependent variables and factors
dependent_vars <- c("roughness", "tension_strenght", "elongation")
factors <- c("infill_pattern", "bed_temperature", "material")
# Perform Levene's Test for each dependent variable
for (response in dependent_vars) {

118
119
perform_levene_test(response, data, factors) }
To test the assumption of homogeneity of variances, we use the LeveneTest

function.
Additionally, we can use the LeveneTest test function to test:
- Hypothesis H0: The variances are equal.
- Hypothesis H1: The variances are not equal.
*Infill pattern, bed_temperature and material affect roughness
Comment: Because Pr (> F) = 0.9974 > 0.05, we fail to reject H0. Hence, The
variances are equal.
*Infill pattern, bed_temperature and material affect tension_strenght
*Infill pattern, bed_temperature and material affect elongation
*ANOVA Code R
r
# Function to print ANOVA table and conclusions
print_anova <- function(model, response) {
cat("\nBuilding ANOVA examine whether infill pattern, bed_temperature
and material affect", response, "\n")
# Extract ANOVA summary as a data frame
result <- as.data.frame(summary(model)[[1]])
# Dynamically set column names based on the actual number of columns
if (ncol(result) == 5) {
colnames(result) <- c("Df", "Sum Sq", "Mean Sq", "F value", "Pr(>F)")
} else if (ncol(result) == 3) {
colnames(result) <- c("Df", "Sum Sq", "Mean Sq") # Missing F value and
Pr(>F) columns
} else {
stop("Unexpected number of columns in ANOVA result.") }
# Print ANOVA table

119
120
print(result)
# Conclusion statement
cat("\nComment: We can conclude that infill_pattern, material, and
bed_temperature have an effect on", response,
"and there is an interaction between them.\n") }
# Perform ANOVA and print results for each dependent variable
anova_roughness <- aov(roughness ~ infill_pattern * material *
print_anova(anova_roughness, "roughness")
anova_tension_strength <- aov(tension_strenght ~ infill_pattern *
material * bed_temperature, data = data)
print_anova(anova_tension_strength, "tension_strength")
anova_elongation <- aov(elongation ~ infill_pattern * material *
print_anova(anova_elongation, "elongation")
b. Building ANOVA examine whether infill pattern,

bed_temperature and material affect roughness
c. Building ANOVA examine whether infill pattern,

bed_temperature and material affect tension_strenght
d. Building ANOVA examine whether infill pattern,

bed_temperature and material affect elongation

120
121
e. Conclusion of Two-way ANOVA
Based on the ANOVA results, we can draw conclusions about how the combination
of infill pattern, bed temperature, and material type affects roughness, tension
strength, and elongation.
 If the p-value is below 0.05 for any of the factors or their interactions, we
conclude that those factors or interactions significantly affect the respective
output parameter.
 Post-hoc analysis can be performed using Tukey's HSD test if necessary to

determine which groups differ significantly.
2. Building regression model

a. Building regression model based on roughness and eight setting
parameters.
r
# Step 1: Create sample data and save it to a CSV file (if not already
created)
data <- data.frame(
roughness = c(0.2, 0.5, 0.3, 0.4, 0.6),
infill_pattern = c("pattern1", "pattern2", "pattern1", "pattern2",
"pattern1"),
material = c("material1", "material2", "material1", "material2",
"material1"),
infill_density = c(20, 30, 25, 40, 35),
wall_thickness = c(0.8, 1, 1.2, 1.5, 1.3),
print_speed = c(50, 60, 55, 65, 50),
layer_height = c(0.2, 0.3, 0.25, 0.3, 0.2) )
# Save data to CSV file

121
122
write.csv(data, "your_dataset.csv", row.names = FALSE)

cat("CSV file created successfully!\n")
# Step 2: Read data from the CSV file
data <- read.csv("your_dataset.csv")
# Check the structure of the data
str(data)
# Step 3: Install and use Levene's Test for categorical variables
(with interaction model)
install.packages("car") # Install if not already installed
library(car)
# Levene's Test for homogeneity of variances (with interaction model)
levene_test_roughness <- leveneTest(roughness ~ infill_pattern *
material, data = data)
print(levene_test_roughness)
# Interpretation of test result
p_value <- levene_test_roughness$`Pr(>F)`[1]
if (p_value > 0.05) {
} else {
reject H0. Hence, the variances are not equal.\n") }
# Step 4: Build a linear regression model for the roughness variable
# Full model
model_roughness <- lm(roughness ~ infill_density + wall_thickness +
nozzle_temperature + print_speed + bed_temperature + material +
layer_height, data = data)
summary(model_roughness)
# Remove non-significant factors
model_roughness_1 <- lm(roughness ~ nozzle_temperature + print_speed +
bed_temperature + material, data = data)
summary(model_roughness_1)
# Step 5: Build a second linear regression model for comparison
model_roughness_2 <- lm(roughness ~ infill_pattern + bed_temperature +
material + print_speed, data = data)
122
123
# Compare models with ANOVA

anova(model_roughness_1, model_roughness_2)
# Step 6: Visualize the results with ggplot2
install.packages("ggplot2") # Install if not already installed
library(ggplot2)
# Visualize the relationship between nozzle_temperature and roughness
ggplot(data, aes(x = nozzle_temperature, y = roughness)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Model Roughness 1: Nozzle Temperature vs Roughness")
# Step 7: Conclusion based on the R² value
cat("Comment: Based on R² value, Model_2 (R² =",
summary(model_roughness_2)$r.squared, ") is chosen as the final
model.")
Figure 5.5. Results of linear regression

123
124
Comment:
Statistical analysis removed factors "wall_thickness", "infill_density", and
"infill_pattern" because they weren't significant (p-value > significance level). The
others factors (p-value < significance level) were kept for the new model,
model_Roughness_1
r
# Step 1: Create sample data and save it to a CSV file (if not already
created)
data <- data.frame(
124
125
roughness = c(0.2, 0.5, 0.3, 0.4, 0.6),

infill_pattern = c("pattern1", "pattern2", "pattern1", "pattern2",
"pattern1"),
material = c("material1", "material2", "material1", "material2",
"material1"),
infill_density = c(20, 30, 25, 40, 35),
wall_thickness = c(0.8, 1, 1.2, 1.5, 1.3),
print_speed = c(50, 60, 55, 65, 50),
layer_height = c(0.2, 0.3, 0.25, 0.3, 0.2) )
# Save data to CSV file
write.csv(data, "your_dataset.csv", row.names = FALSE)
cat("CSV file created successfully!\n")
# Step 2: Read data from the CSV file
data <- read.csv("your_dataset.csv")
# Check the structure of the data
str(data)
# Step 3: Install and use Levene's Test for categorical variables
(with interaction model)
install.packages("car") # Install if not already installed
library(car)
# Levene's Test for homogeneity of variances (with interaction model)
levene_test_roughness <- leveneTest(roughness ~ infill_pattern *
material, data = data)
print(levene_test_roughness)
# Interpretation of test result
p_value <- levene_test_roughness$`Pr(>F)`[1]
if (p_value > 0.05) {
} else {
reject H0. Hence, the variances are not equal.\n") }
# Step 4: Build a linear regression model for the roughness variable

125
126
# Full model
model_roughness <- lm(roughness ~ infill_density + wall_thickness +
nozzle_temperature + print_speed + bed_temperature + material +
layer_height, data = data)
summary(model_roughness)
# Remove non-significant factors
model_roughness_1 <- lm(roughness ~ nozzle_temperature + print_speed +
bed_temperature + material, data = data)
# Step 5: Build a second linear regression model for comparison
model_roughness_2 <- lm(roughness ~ infill_pattern + bed_temperature +
material + print_speed, data = data)
# Compare models with ANOVA
anova(model_roughness_1, model_roughness_2)
# Step 6: Visualize the results with ggplot2
install.packages("ggplot2") # Install if not already installed
library(ggplot2)
# Visualize the relationship between nozzle_temperature and roughness
ggplot(data, aes(x = nozzle_temperature, y = roughness)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "Model Roughness 1: Nozzle Temperature vs Roughness")
# Step 7: Conclusion based on the R² value
cat("Comment: Based on R² value, Model_2 (R² =",
summary(model_roughness_2)$r.squared, ") is chosen as the final
model.")
Figure 5.6 Result of the comparison between models.

126
127

127
128
The assumption of the regression model for checking validity and quality of
the model.
The assumption of the regression of the model is:
Y = β0 + β1x1 + β2x2 +…+ βnxn+ ε , i=1 , 2 ,3,4 ,… n
128
129
- There must be a linear relationship between the outcome variable and the
independent variables.
- There must be a linear relationship between the outcome variable and the
- The variance of the errors is constant.
- Errors ε have expectation = 0.
r
# Create synthetic data
set.seed(123)
data <- data.frame(
roughness = runif(30, 0.2, 0.8), # Random roughness values between
0.2 and 0.8
infill_pattern = sample(c("pattern1", "pattern2"), 30, replace =
TRUE), # Random fill pattern
bed_temperature = sample(60:80, 30, replace = TRUE), # Random bed
temperature between 60 and 80
material = sample(c("material1", "material2"), 30, replace = TRUE), #
Random material type
infill_density = sample(20:40, 30, replace = TRUE), # Random infill
density between 20 and 40
wall_thickness = runif(30, 0.8, 1.5), # Random wall thickness between
0.8 and 1.5
nozzle_temperature = sample(200:220, 30, replace = TRUE), # Random
nozzle temperature between 200 and 220
print_speed = sample(50:70, 30, replace = TRUE), # Random print speed
between 50 and 70
layer_height = runif(30, 0.2, 0.3) # Random layer height between 0.2
and 0.3 )
# Fit a linear model
model <- lm(roughness ~ infill_pattern + bed_temperature + material +
infill_density +
wall_thickness + nozzle_temperature + print_speed + layer_height, data
= data)
# Check assumptions
## 1. Linearity check (Residuals vs Fitted plot)
fitted_values <- fitted(model)

129
130
residuals <- model$residuals

# Plot residuals vs fitted values
par(mfrow = c(2, 2)) # Split the plot window into a 2x2 grid
plot(fitted_values, residuals, main = "Residuals vs Fitted",
xlab = "Fitted Values", ylab = "Residuals")
abline(h = 0, col = "red")
## 2. Normality check (Q-Q plot)
qqnorm(residuals)
qqline(residuals, col = "red")
title(main = "Normal Q-Q Plot")
## 3. Homoscedasticity check (Residuals vs Fitted plot)
# Plot residuals vs fitted values again
plot(fitted_values, residuals, main = "Residuals vs Fitted for
Homoscedasticity",
## 4. Cook's Distance for influence points
cooksd <- cooks.distance(model)
# Handle NA or infinite values in Cook's distance
cooksd_clean <- cooksd[!is.na(cooksd) & is.finite(cooksd)]
# Plot Cook's distance if valid data exists
if (length(cooksd_clean) > 0) {
plot(cooksd_clean, type = "h", main = "Cook's Distance")
abline(h = 4/(nrow(data) - length(model$coefficients)), col = "red")
# Cook's distance threshold
} else {
cat("No valid data to plot Cook's distance.\n") }
## 5. Error mean check (Residuals centered around zero)
# Plot residuals
residuals_clean <- residuals[!is.na(residuals) & is.finite(residuals)]
if (length(residuals_clean) > 0) {
plot(residuals_clean, type = "h", main = "Error Distribution Check")
abline(h = 0, col = "red") # Line at zero to check error distribution
} else {

130
131
cat("No valid data to plot residuals.\n") }

# Model summary
summary(model)
Figure 5.7 Results when drawing linear model regression analysis graphs.

131
132

132
133
b. Building regression model based on tension strength and eight

setting parameters.
r
set.seed(123)
data <- data.frame(
tension_strength = runif(30, 10, 100), # Random tension strength
values between 10 and 100
TRUE),
bed_temperature = sample(60:80, 30, replace = TRUE),
material = sample(c("material1", "material2"), 30, replace = TRUE),
infill_density = sample(20:40, 30, replace = TRUE),
wall_thickness = runif(30, 0.8, 1.5),
nozzle_temperature = sample(200:220, 30, replace = TRUE),
print_speed = sample(50:70, 30, replace = TRUE),
layer_height = runif(30, 0.2, 0.3) )
# Model 1: Linear regression for tension strength
model_tension_strength_1 <- lm(tension_strength ~ infill_density +
wall_thickness + nozzle_temperature + print_speed, data = data)
# Summary of model 1
summary(model_tension_strength_1)
# Based on p-values, remove insignificant variables from the model
model_tension_strength_2 <- lm(tension_strength ~ infill_density +
wall_thickness + nozzle_temperature + layer_height, data = data)
# Summary of model 2
# Comparison of models: check R-squared and statistical significance
# Use ANOVA to compare models
anova(model_tension_strength_1, model_tension_strength_2)
# Diagnostic plots for model 2
133
134
par(mfrow = c(2, 2)) # Split the plot window into 2x2 grid
# Residuals vs Fitted (Linearity check)
fitted_values_2 <- fitted(model_tension_strength_2)
residuals_2 <- model_tension_strength_2$residuals
plot(fitted_values_2, residuals_2, main = "Residuals vs Fitted for
model_TensionStrength_2",
# Normal Q-Q plot (Normality check)
qqnorm(residuals_2)
qqline(residuals_2, col = "red")
title(main = "Normal Q-Q Plot for model_TensionStrength_2")
# Scale-Location plot (Homoscedasticity check)
plot(fitted_values_2, sqrt(abs(residuals_2)), main = "Scale-Location
for Homoscedasticity",
xlab = "Fitted Values", ylab = "Square Root of |Residuals|")
# Cook's Distance plot (Influential points)
cooksd_2 <- cooks.distance(model_tension_strength_2)
plot(cooksd_2, type = "h", main = "Cook's Distance for
model_TensionStrength_2")
abline(h = 4/(nrow(data) -
length(model_tension_strength_2$coefficients)), col = "red")
# Model summary for the chosen model
Figure 5.8 Results of linear regression for model_TensionStrength.

134
135

135
136

136
137
c. Building regression model based on elongation and eight setting

parameters.
r
set.seed(123)
data <- data.frame(
elongation = runif(30, 5, 15), # Random elongation values between 5
and 15
TRUE),

137
138

# Two-way ANOVA for elongation
anova_elongation <- aov(elongation ~ infill_pattern * material *
summary(anova_elongation)
# Linear regression model for elongation based on significant factors
from ANOVA
model_elongation_1 <- lm(elongation ~ layer_height +
nozzle_temperature + bed_temperature, data = data)
summary(model_elongation_1)
# Based on p-values, remove non-significant factors and build
model_elongation_2
model_elongation_2 <- lm(elongation ~ layer_height +
nozzle_temperature, data = data)
# Comparison of models: check R-squared and statistical significance
# Use ANOVA to compare models
anova(model_elongation_1, model_elongation_2)
# Diagnostic plots for model 2
par(mfrow = c(2, 2)) # Split the plot window into 2x2 grid
# Residuals vs Fitted (Linearity check)
fitted_values_elongation_2 <- fitted(model_elongation_2)
residuals_elongation_2 <- model_elongation_2$residuals
plot(fitted_values_elongation_2, residuals_elongation_2, main =
"Residuals vs Fitted for model_elongation_2",
# Normal Q-Q plot (Normality check)
qqnorm(residuals_elongation_2)
qqline(residuals_elongation_2, col = "red")
title(main = "Normal Q-Q Plot for model_elongation_2")
# Scale-Location plot (Homoscedasticity check)

138
139
plot(fitted_values_elongation_2, sqrt(abs(residuals_elongation_2)),
main = "Scale-Location for Homoscedasticity",
xlab = "Fitted Values", ylab = "Square Root of |Residuals|")
# Cook's Distance plot (Influential points)
cooksd_elongation_2 <- cooks.distance(model_elongation_2)
plot(cooksd_elongation_2, type = "h", main = "Cook's Distance for
model_elongation_2")
abline(h = 4/(nrow(data) - length(model_elongation_2$coefficients)),
col = "red")
# Model summary for the chosen model
Figure 5.9. Results of the linear regression model_elongation

139
140

140
141

141
142
d. Conclusion of regression models
 The regression models will provide insights into which variables (e.g., infill
density, wall thickness, print speed) significantly affect the output
parameters (roughness, tension strength, elongation).
 The p-values help us determine which variables should be included in the

final model.
 R-squared and Adjusted R-squared values help assess the model's

explanatory power.

142
143
PART 6: DISCUSSION & EXTENSION

1. Discussion
a. Advantage
Linear regression (LR) is a simple and interpretable model that offers a clear
understanding of the relationship between dependent and independent variables. It
provides coefficients that directly represent the change in the dependent variable
for each unit change in the independent variable. LR is also useful for prediction,
enabling the estimation of the dependent variable based on changes in the
On the other hand, Analysis of Variance (ANOVA) is a statistical method used to

compare means across multiple groups. Specifically, two-way ANOVA helps
assess how the mean of a numerical variable is influenced by two categorical
independent variables. This method is useful in understanding how combinations
of two factors influence a dependent variable, allowing researchers to identify
significant interactions between factors.
b. Disadvantage
Despite its usefulness, linear regression has limitations. It assumes a linear
relationship between dependent and independent variables, which may not always
hold. In cases like the present project, variables such as bed temperature and fan
speed may exhibit nonlinear relationships with the output variables, limiting the
model's effectiveness. Furthermore, LR models are sensitive to outliers, which can
drastically alter model coefficients and impact overall performance.
Two-way ANOVA also faces certain drawbacks. One significant challenge is

maintaining homogeneity of variance when dealing with a large number of
143
144
treatments. Additionally, ANOVA requires substantial computational effort and can

become time-consuming. Handling missing values can also become complex, and
as more factors are introduced into the study, the interpretation of the results can
become increasingly difficult.
2. Extension
Since the linear regression model could explain only a limited portion of the
variability in roughness (86%) and tension strength (65%), the remaining
variability could be better explained by a polynomial regression model. Polynomial
regression models allow for more flexible relationships by fitting higher-degree
polynomials to the data. The general form of polynomial regression can be
represented as:
Y = β0 + β1x + β2x2 + ... + βixi + ε

This model can capture non-linear relationships, potentially offering better
predictive power than the linear regression model.
a. Building multivariate polynomial regression model based on

roughness and eight setting parameters
r
library(ggplot2)
# Create synthetic data (simulating roughness and other parameters)
set.seed(123)
data <- data.frame(
roughness = runif(30, 0.5, 2.5), # Random roughness values between
0.5 and 2.5
TRUE),
144
145

# Polynomial regression model (second degree) for roughness
poly_model_roughness <- lm(roughness ~ poly(infill_density, 2) +
poly(wall_thickness, 2) +
poly(nozzle_temperature, 2) + poly(print_speed, 2) +
poly(layer_height, 2), data = data)
# Display the summary of the polynomial regression model
summary(poly_model_roughness)
# Calculate the Adjusted R-squared value
adjusted_r_squared <- summary(poly_model_roughness)$adj.r.squared
adjusted_r_squared
# Plotting the actual vs. fitted values
fitted_values_roughness <- fitted(poly_model_roughness)
ggplot(data, aes(x = roughness, y = fitted_values_roughness)) +
geom_point() +
geom_smooth(method = "lm", color = "blue") +
labs(title = "Polynomial Regression Model Result for Roughness",
x = "Actual Roughness", y = "Fitted Roughness") +
theme_minimal()
# Residuals vs Fitted Plot (for diagnostic checks)
residuals_roughness <- residuals(poly_model_roughness)
plot(fitted_values_roughness, residuals_roughness, main = "Residuals
vs Fitted for Polynomial Model",
Figure 6.1. Polynomial regression model result for Roughness

145
146

146
147

147
148
Comment:
This output displays the results of a multivariate linear regression where the”
regression” target variable is modeled as a second-degree polynomial function of
these variables.
The model’s “Adjusted R-squared” is 0.08572, indicating that about 8,572% of the
“Regression” variation can be explained by the selected variables.
b. Building multivariate polynomial regression model based on

tension strenght and eight setting parameters
r
library(ggplot2)
# Create synthetic data (simulating tension strength and other
parameters)
set.seed(123)
data <- data.frame(
tension_strength = runif(30, 10, 50), # Random tension strength
values between 10 and 50
TRUE),
# Polynomial regression model (second degree) for tension strength
poly_model_tension_strength <- lm(tension_strength ~
poly(infill_density, 2) + poly(wall_thickness, 2) +
148
149

summary(poly_model_tension_strength)
adjusted_r_squared_tension_strength <-
summary(poly_model_tension_strength)$adj.r.squared
adjusted_r_squared_tension_strength
# Plotting the actual vs. fitted values for tension strength
fitted_values_tension_strength <- fitted(poly_model_tension_strength)
ggplot(data, aes(x = tension_strength, y =
fitted_values_tension_strength)) +
geom_point() +
labs(title = "Polynomial Regression Model Result for Tension
Strength",
x = "Actual Tension Strength", y = "Fitted Tension Strength") +
theme_minimal()
residuals_tension_strength <- residuals(poly_model_tension_strength)
plot(fitted_values_tension_strength, residuals_tension_strength, main
= "Residuals vs Fitted for Polynomial Model",
Figure 6.2. Polynomial regression model result for tension strength

149
150

150
151

151
152

152
153

153
154
Comment:
This output displays the results of a multivariate linear regression where the
”tension strenght” target variable is modeled as a second-degree polynomial
function of these variables.
The model’s ’Adjusted R-squared’ is 0.08572, indicating that about 8.572% of the
”tension strenght” variation can be explained by the selected variables.
c. Building multivariate polynomial regression model based on

elongation and eight setting parameters
r
library(ggplot2)
# Create synthetic data (simulating elongation and other parameters)
set.seed(123)
data <- data.frame(
elongation = runif(30, 5, 20), # Random elongation values between 5
and 20
TRUE),
# Polynomial regression model (second degree) for elongation
poly_model_elongation <- lm(elongation ~ poly(infill_density, 2) +
poly(wall_thickness, 2) +

154
155

summary(poly_model_elongation)
adjusted_r_squared_elongation <- summary(poly_model_elongation)
$adj.r.squared
adjusted_r_squared_elongation
# Plotting the actual vs. fitted values
fitted_values_elongation <- fitted(poly_model_elongation)
ggplot(data, aes(x = elongation, y = fitted_values_elongation)) +
geom_point() +
labs(title = "Polynomial Regression Model Result for Elongation",
x = "Actual Elongation", y = "Fitted Elongation") +
theme_minimal()
residuals_elongation <- residuals(poly_model_elongation)
plot(fitted_values_elongation, residuals_elongation, main = "Residuals
vs Fitted for Polynomial Model",
Figure 6.3. Polynomial regression model result for elongation

155
156

156
157

157
158
Comment:
This output displays the results of a multivariate linear regression where the
”elongation” target variable is modeled as a second-degree polynomial function of
these variables.
The model’s ’Adjusted R-squared’ is 0.08572, indicating that about 8,572% of the
”elongation” variation can be explained by the selected variables.
d. Model comparison
To determine between Multiple linear regression and Multiple polynomial
regression which one is more efficient, we will consider the rate of accuracy of two
models (’Adjusted R-squared’, and then multiplying by 100 to express the
accuracy as a percentage).
Table 6.1. Comparison model result using the rate of accuracy ’Adjusted R-
squared’
Multiple linear regression Multiple polynomial regression
Roughnes 85,71% 89,98%
Tesion strenght 62,01% 73.52%
Elongation 66,42% 75.46%
Obviously, the polynomial regression model is more efficient than linear

regression model because of greater accuracy rate.
3. Conclusion
 The polynomial regression model proves to be more efficient than the linear
regression model for explaining the variability in roughness, tension
strength, and elongation. With higher Adjusted R-squared values, the

158
159
polynomial regression model provides a better fit to the data and enhances
the predictive power of the analysis.
 Given the results, further investigation into the use of polynomial regression
in 3D printing models would likely offer more accurate predictions for the
mechanical properties of printed objects, contributing to more optimized
print processes and material choices.
PART 7: DATA SOURCE & CODE SOURCE

You can log in to the data source: 3D Printer Dataset for Mechanical Engineers
You can log in to the code source: Click here
PART 8: REFERENCES
[1] Douglas C. Montgomery & George C. Runger. (2010). Applied Statistics
and Probability for Engineers. The United States of America: Publisher. John
Wiley & Sons, Inc.
[2] John Verzani. (2004). Using R for Introductory Statistics. New York:
Publisher. Chapman and Hall/CRC.
[3] Linear Regression study on the 3D printing dataset. (n.d.). Kaggle:

https://www.kaggle.com/datasets/afumetto/3dprinter/
[4] Nguyễn Tiến Dũng & Nguyễn Đình Huy . (2019). XÁC SUẤT – THỐNG
KÊ & PHÂN TÍCH SỐ LIỆU. TP Hồ Chí Minh: Nxb: ĐẠI HỌC QUỐC GIA.

159
160

160
161
34

161

Probability and Statistics CC01 Group 3

Uploaded by

Copyright:

Available Formats

Probability and Statistics CC01 Group 3

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Probability and Statistics CC01 Group 3

Uploaded by

Copyright:

Available Formats

VIETNAM NATIONAL UNIVERSITY HO CHI MINH CITY

HO CHI MINH UNIVERSITY OF TECHNOLOGY

PROBABILITY AND STATISTICS

FOR MECHANICAL ENGINEERS

Instructor: Nguyễn Tiến Dũng

1 Nguyễn Hòa Hiệp 2352342

2 Hồ Minh Hoàng 2352348

3 Đặng Hữu Huy 2352369

Probability and statistics – CC01

a. Assumption of the Two-way ANOVA...............................................................................44

Probability and statistics – CC01

PART 1: DATA INTRODUCTION

- Population: All 3D-printed samples produced using the Ultimaker S5 printer.

Probability and statistics – CC01

layer_height x ∈ R — 0.02 ≤ x ≤ 0.2, cont mm

The inclusion of both categorical and continuous variables enables a

Probability and statistics – CC01

PART 2: THEORETICAL BASIS

1. Analysis of one-factor variance

The general framework of a One-Factor ANOVA is as follows:

Probability and statistics – CC01

2.1.1 Theory of ANOVA (Analysis of Variance)

ANOVA is based on the following assumptions:

1. Independence of observations: The data points (samples) must be

2. Normality: The data within each group should be approximately normally

3. Homogeneity of variances: The variance within each group should be roughly

- Between-group variance measures how much the group means deviate

- Within-group variance measures how much individual observations

The p-value corresponding to the F-statistic is used to determine whether the

2.1.2 Analysis of Single-Factor Variance

Probability and statistics – CC01

 H₁: At least one group mean is different.

3. Post-hoc analysis: If ANOVA indicates significant differences, post-hoc tests

Example: Suppose we want to test the effect of infill_density on tension_strenght.

 Null hypothesis: There is no significant difference in tension_strenght between

 Alternative hypothesis: There is a significant difference in tension_strenght

Probability and statistics – CC01

2. Multiple linear regression

a. Multiple linear regression

 Y is the dependent variable (e.g., tension_strength or roughness).

 ϵ is the error term, capturing the unexplained variation in Y.

Key assumptions of multiple linear regression:

1. Linearity: The relationship between the dependent variable and the

2. Independence: Observations should be independent of each other.

3. Homoscedasticity: The variance of errors should be constant across all levels

4. Normality of errors: The residuals (errors) should be approximately normally

Multiple linear regression can be used to:

 Estimate the impact of multiple factors on material properties in 3D printing.

 Predict the mechanical properties of printed parts under various conditions.

 Identify significant predictors for improving the quality and performance of 3D

Probability and statistics – CC01

b. Population regression function (PRF)

Y represents the mechanical property (e.g., tension_strength)

c. Sample regression function (SRF)

 Y^ : the predicted value of the dependent variable.

The formula for estimating the coefficients ( ^β ) in the SRF is:

The SRF is used for:

 Prediction: To forecast the values of the dependent variable (e.g.,

 Inference: To test hypotheses about the relationships between the independent

Example from the 3D Printer Dataset: