This section answers the research questions following a statistical analysis of the data. The DATAtab statistical calculator was used for data analysis. Box plots were chosen for data visualisation, as they provide a clear representation of the descriptive statistics, as well as the outliers, indicated by individual points that fall outside the box and whiskers. Correlation analysis was performed using Jupyter Notebook.
7.1. The Short-Term Effects of the Blue Yeti GBL and Answer for Research Question RQ1 (a)
The participants in the experimental group were required to complete two multiple-choice tests consisting of 15 questions each, administered immediately before and after the gameplay session. These tests aimed to assess the immediate impact of the game by evaluating the number of correct answers. The tests were marked by two experienced teachers.
All 31 participants completed both tests; the descriptive statistics of these are presented in
Table 2.
The mean score increased from the pre-test to the post-test, suggesting a performance improvement. The median also increased, which indicates that the central tendency of the scores moved upwards, reinforcing the results of the mean. The mode remained the same in both tests (5), indicating that this score was the most frequently occurring in the dataset for the pre-test. The post-test has a mode of 5 as well, but it does not fully represent the post-test distribution, especially with the higher mean. The standard deviation increased in the post-test, suggesting that the variability of the scores was greater after the intervention than before. Among the 31 participants, 18 students demonstrated improved performance in the post-test with a median improvement of three points. The highest increase in scores was of eight points. Two participants maintained identical scores in both the pre- and post-tests. On the other hand, 11 students experienced a decline, scoring one or two points lower in the post-test. This variation in scores may be attributed to factors such as individual learning styles or potential fatigue. The range of scores in the post-test (3 to 15) shows that participants achieved higher scores compared to the pre-test, where scores ranged from 1 to 10. This further supports the trend of improvement.
The mean ± standard deviation represents the range within which approximately 68% of the data points lie, provided that the data follow a normal distribution. In the pre-test, the typical score range was from 2.91 to 7.55, indicating that while some participants had lower scores, most scores were somewhat clustered around the mean of 5.23. After the intervention, the typical score range in the post-test spanned from 4.37 to 10.27. This range shows that scores improved significantly, with many participants achieving higher scores after the intervention.
Figure 7 illustrates the essential features of the sample for both tests. In the box plot, the box contains three vertical lines that represent the lower, middle, and upper quartiles. The line in the centre of the box represents the median, which is the midpoint of the values and divides the data into two halves. The whiskers extending from the box denote the extreme values, also known as outliers, which fall outside the upper and lower quartiles. The position of the whiskers and the length of the box offer insights into the variability of the variable; the further the whiskers are from the box, the greater the spread of the data.
The analysis indicates a significant improvement in mean, median, and overall score distribution from the pre-test to the post-test, suggesting that the intervention between these two tests was effective. However, while scores improved, the increased variability in the post-test also indicates that some participants performed much better while others did not improve as much. Hypothesis testing was utilised to evaluate the hypothesis formulated from descriptive statistics to determine if there is a statistically significant difference between the variables being examined. To answer the research question RQ1 (a), the following hypotheses were formulated:
A paired
t-test was used for hypothesis testing. The primary condition for applying the paired
t-test is that the sample is normally distributed. This condition is particularly important when the sample size is small (usually n < 30), as the statistical methods of the
t-test rely on a normal distribution. Therefore, the normality of the samples was first tested (see
Table 3). High
p-values, that is, greater than 0.05, indicate that the data do not significantly differ from a normal distribution.
We used the
t-test for paired samples to test our hypothesis. This test is beneficial when the same subjects are measured under two different conditions, such as before and after an intervention.
Table 4 presents the results of a paired samples
t-test.
In this case, the calculated t-statistic (
t) is 4.36. The t-statistic measures the difference between the two tests relative to the variation in the data. A higher absolute t-statistic value indicates a more significant difference between pre- and post-tests. The sign of the t-statistic shows the direction of the difference, indicating that the mean score for the pre-test is lower than that of the post-test. The value of degrees of freedom (
) for this test is 30. A
p-value of less than 0.001 indicates a 0.1% chance of obtaining these test results if the null hypothesis is true. Typically, a
p-value of less than 0.05 is regarded as statistically significant (
Tenny & Abdelgawad, 2023), suggesting that there is a statistically significant difference between the means of the pre-test and post-test. Additionally, Cohen’s
d of 0.78 suggests a large effect size (
Cohen, 1988).
We reject the null hypothesis based on the provided data and the p-value. Consequently, the answer to research question RQ1 (a) is that Blue Yeti demonstrates moderately high effectiveness as a tool in the short term, specifically immediately after playing the game.
7.2. Questionnaire Results
All 31 participants in the experimental group completed the online questionnaire, answering all the questions. In the first part, students were asked to indicate their agreement with ten statements on a 5-point Likert scale. The statements are presented in
Table 5.
For the evaluation of the results, answers 4 and 5 (agree and strongly agree) are considered agreement, answers 1 and 2 (strongly disagree and disagree) are categorised as disagreement, while response 3 (neither agree, nor disagree) is considered an indicator of uncertainty.
Table 6 contains descriptive statistical data for the sample and
Figure 8 displays the box plots.
Most students in the experimental group unanimously agreed with statement S1, expressing a clear preference for problem solving over learning mathematics through definitions, highlighting the inclination of Generation Z students towards a practical approach in their educational experience. The descriptive statistical data analysis of statement S2 suggests that while the average sentiment about practising Calculus at home is moderately low (mean of 3.35), a significant portion of respondents (median and mode both at 4) felt more positive about their capacity to practice. The relatively high standard deviation indicates diverse opinions, suggesting that, while many feel capable, there are also students who may struggle more with practising at home. Moreover, in response to statement S3, 20 students (64.5%) reported experiencing higher levels of anxiety specifically related to midterms and exams in Calculus compared to their other subjects. This finding may indicate the presence of mathematical anxiety within the experimental group. Research has shown that students with elevated levels of anxiety in mathematics are more susceptible to underperforming in their math-related studies (
Jameson et al., 2022;
Lau et al., 2022;
Maldonado Moscoso et al., 2022;
Scheibe et al., 2023). Consequently, it becomes crucial to explore strategies to alleviate these anxieties and promote a positive learning environment.
The mean score for the statement about playing board games (S4) is 3.81, indicating that overall, participants held a neutral to moderately positive attitude towards playing board games. This score is closer to the middle of the scale, suggesting a positive but not overwhelmingly strong impact. The median value is 4, which supports a central tendency towards a slightly positive outlook. The mode, however, is the highest score (5), reflecting a strong preference for board games among the experimental group. The mean score of 4.1 for statement S5 indicates a strong interest in incorporating skill-building games into practical lessons. This implies that participants find this concept valuable and enjoyable. A median score of 4 and a mode of 5 reinforce the preference for skill-building games, with most responses leaning positively and a significant number of students strongly favouring this learning method. The standard deviation of 0.98 indicates that the responses are relatively consistent, with less variation from the mean. This suggests that most of the student ratings are closely grouped around the mean, pointing to a strong consensus about the value of skill-building games. Responses for statement S6 indicate a strong preference for engaging in mathematics skill-building games outside the classroom. The consistent ratings suggest this interest is robust, with most responses clustering positively around the mean. The data reflect a clear enjoyment and value placed on these activities.
Statement S7 measured the general attitude toward GBL. The mean score of 4.48 indicates a very high level of agreement with the effectiveness of GBL. Most participants agreed on the effectiveness of GBL, showcasing a positive overall attitude towards integrating games into the educational experience. The standard deviation of 0.81 indicates that the responses are relatively consistent, with low variability from the mean. This suggests that there is a general agreement in perceptions regarding the effectiveness of GBL.
The last three statements addressed the experience of learning about improper integrals in a traditional classroom environment. The average score of 2.55 for statement S8 indicates a slightly below-average understanding of the topic. It was found that only three students (9.68%) believed that lectures alone provided a sufficient understanding of this course material. The responses to statement S9 highlight the importance of problem solving in practical lessons. The average score of 3.1 marks an improvement over the average score for S8. However, a mean of 3.1, accompanied by a median and mode of 3, indicates that students believe they have only a moderate understanding of the topic of improper integrals, even after attending lectures and practical lessons. In the last statement (S10), participants were asked to share their opinions on the difficulty level of improper integrals. The average score of 3.35 suggests that students generally perceive improper integrals as somewhat difficult but not overwhelmingly so.
A correlation analysis was conducted using Kendall tau non-parametric correlation coefficients to uncover the relationship between the statements. The heatmap visualizes the correlation matrix using Kendall tau values, effectively illustrating the strength of relationships between the variables (
Figure 9).
This approach was selected for several reasons. First, the relatively small sample size renders traditional parametric methods less reliable, as they can produce unstable results with limited observations. Additionally, our dataset does not meet the normality assumption (see
Table A1 in
Appendix B), which is essential for many parametric tests. By employing the Kendall tau, we can analyse correlations without this requirement, leading to a more accurate reflection of the relationships among the variables. The Kendall tau focuses on the ordinal ranks of the data, making it particularly suitable for analysing Likert scale responses, which are ranked but not necessarily equidistant. This method allows us to gain insights into the strength and direction of associations while accommodating the nature of our data. The main diagonal features all values of 1. The colour scale of the heatmap distinctly differentiates strong, moderate, and weak correlations, facilitating the interpretation of the data. Strong correlations are not observed. Moderate correlations, typically defined as those between 0.3 and 0.7, show interesting relationships, such as between S8 and S9 (0.59) and between S5 and S6 (0.58). Other moderate correlations are observed between S5 and S7 (0.45), S4 and S7 (0.38), and S1 and S10 (0.34). The remaining correlations are weak.
The moderate correlation observed between S8 and S9 refers to a link between students’ perceived understanding of the topic of improper integrals after attending only the lectures and their understanding following both lectures and practical lessons. This finding aligns with expectations, as after the lectures, practical lessons provide a better understanding of the given topic. A correlation coefficient of 0.58 between S5 and S6 suggests that students who express an interest in engaging with skill-building games in the classroom are more likely to want to play beyond the confines of the classroom as well. Additionally, the analysis revealed that students who perceive GBL as effective (S7) often express an interest in playing skill-building games in practical lessons (S5). The findings indicate that personal preferences may influence students’ attitudes toward GBL. Specifically, those who enjoy board games as a pastime (S4) are more likely to consider GBL effective (S7). The moderate correlation between S1 and S10 suggests a relationship between students’ preference for practical problem-solving over theoretical learning and their perception of improper integrals as one of the most challenging topics in their Calculus II course. This observation may indicate that the instruction on this topic lacks sufficient practical applications and is overly focused on theoretical aspects.
The second section of the questionnaire focused on the specific aspects of Blue Yeti. The questions are shown in
Table 7. The first nine questions were answered on a 5-point Likert scale. The results indicate high levels of satisfaction among the majority of respondents.
Table 8 contains descriptive statistical data for the sample, and
Figure 10 shows box plots. The last question (Q10) was open-ended.
Specifically, the game’s design and enjoyability both received positive feedback. The high mean of 4.61 for question Q1, combined with a median and mode of 5, strongly indicates that the majority of respondents viewed the graphics and design of Blue Yeti positively. The relatively low standard deviation (0.67) and variance (0.45) suggest that most respondents have similar views on the topic, reflecting a consensus on the quality of the graphics. The ratings for Q2 suggest that Blue Yeti is considered an enjoyable game by most players, with a good concentration of scores around 4. The mean score of 4.23 indicates that, on average, players find the gameplay quite enjoyable. This is a positive sign for the game’s reception. A standard deviation of 0.72 indicates moderate variability in the ratings.
According to the participants’ responses to question Q3, the game was perceived as neither easy nor difficult. The mean score of 3.39 indicates a moderate level of difficulty. However, the standard deviation of 1.02 reveals considerable variability in the ratings, suggesting that perceptions of the game’s difficulty ranged widely, with some participants finding it easier and others much more challenging. Responses to question Q4, which addressed the competitive aspect of Blue Yeti, reflect a generally positive outlook. The mean rating of 3.87 indicates that participants found this aspect somewhat motivating, while the median of 4 suggests that at least half the respondents rated it positively. However, the standard deviation of 0.88 and the range of responses from a minimum of 2 to a maximum of 5 indicate that not all participants felt equally driven by the competitive element of the game.
Students rated the interest level of Blue Yeti very positively, with a mean score of 4.52 and a median of 5, which suggests that at least half the respondents found the game highly interesting. Despite a minimum score of 2, the majority rated the game’s interest level favourably, indicating that Blue Yeti effectively captures participants’ interest, making it a compelling alternative to traditional problem-solving activities.
The descriptive statistical data for Q6 suggests that playing Blue Yeti was perceived as helpful for learning or reviewing the comparison test for improper integrals. The mean, median and mode all indicate that participants rated their experience positively, with most scores clustered around 4 or higher. However, the standard deviation of 0.81 indicates some variability in responses, meaning not everyone had the same level of satisfaction. The minimum score of 2 highlights that while the majority found value in the game, a few respondents were less convinced of its effectiveness. Despite this variability, 24 students (77%) responded to Q6 with a score higher than 3, indicating that Blue Yeti generally had a favourable impact on participants’ learning experiences related to improper integrals.
The responses regarding the challenge level of the game Blue Yeti—in the absence of colour coding for convergent and divergent integrals—reveal a strong consensus among participants. Most students agree on the necessity of colour coding for clarity. The mean rating of 4.45 and the variance of 0.52 indicate that most respondents believed the game would become significantly more difficult without this feature. Here, we note that no students affected by colour blindness or other visual impairments participated in this experiment. These results reinforce the idea that the colour marking of cards facilitates the understanding of convergent and divergent integrals during the game. In Blue Yeti GBL, the colour coding of cards advances the learning process, enabling players to bridge the gap between their current understanding and higher levels of comprehension more quickly within the context of ZPD.
Participants generally perceived their experience with Blue Yeti positively. The mean rating of 4.61 for Q8 suggests that most students found the game enjoyable and beneficial. The mode of 5 shows that the highest rating was the most frequently chosen one, while the standard deviation of 0.62 indicates low variability in the responses.
The data for Q9 reflect a strong overall interest in playing Blue Yeti or similar educational games in the future, as evidenced by a mean rating of 4.29. The standard deviation of 0.74 indicates that some participants expressed lower levels of interest, pointing to a degree of diversity in individual preferences.
The second part of the questionnaire was analysed using Kendall tau non-parametric correlation coefficients to examine the relationships among the questions, because the dataset does not meet the normality assumption (see
Table A2 in
Appendix B). The heatmap (
Figure 11) illustrates the correlation matrix, which highlights the Kendall tau values. Strong correlations were not observed. Several moderate correlations exist, including the following: Q5 and Q8 (0.530), Q6 and Q9 (0.512), Q2 and Q8 (0.439), Q2 and Q5 (0.409), and Q2 and Q4 (0.370). The remaining correlations are weaker.
The correlations observed between Q5 and Q8, Q2 and Q8, and Q2 and Q5 suggest that participants’ overall experience with Blue Yeti is connected to their enjoyment of the game and their levels of interest. Specifically, students who found Blue Yeti more interesting compared to traditional problem solving (Q5) also tended to report higher enjoyment (Q2) and a more positive overall game experience (Q8). Additionally, a relationship was identified between the perceived enjoyability of the game (Q2) and its motivational impact (Q4). The moderate correlation between Q6 and Q9 suggests that students who perceived Blue Yeti as beneficial to their learning process exhibited a greater interest in playing educational games in the future.
7.3. Answer for Research Question RQ2
Regarding Q10, most participants either left the open-ended question unanswered, indicating no additional comments, or used the text field to provide positive feedback about the game, noting that its design and functionality are well developed and require no changes. Only four students proposed minor adjustments to the visual elements of the card deck and the logistics of the gameplay session. No suggestions were made concerning the game’s rules or educational content. Some of the responses received include the following:
“I was very satisfied with the appearance of the game”.
“I don’t think it needs any changes”.
“I think it is skillfully executed and well crafted”.
“The drawing of the Yeti turned out surprisingly well”.
The research question RQ2, concerning student perceptions of Blue Yeti, was answered using participants’ responses to the open-ended question and descriptive statistics from Q1, Q2, Q3, and Q8. For Q1 and Q2, 28 students (90.32%) assigned a score of 4 or 5, rating the design and enjoyability of the game very favourably. The difficulty level of Blue Yeti (Q3) was perceived as moderate by most students, with 18 participants (58%) providing a rating below 4. Additionally, 29 students (93.55%) gave the overall game experience a score of 4 or 5. These data indicate generally favourable student perceptions, with overwhelmingly positive views on the game design (Q1) and the overall experience (Q8), a moderately difficult game experience (Q3) with considerable individual variation, and a high level of enjoyment (Q2).
Additionally, Q6 suggests that the game effectively supports learning, with most students (77.42%) rating the contribution of Blue Yeti to their learning process 4 or 5. This reinforces the favorable student perceptions regarding the game’s value as an educational tool.
In summary, students display strong approval of the Blue Yeti game, with minor suggestions for improvement, affirming its effectiveness in providing an enjoyable and educational experience.
7.5. The Medium-Term Effects of the Blue Yeti GBL and Answer for Research Question RQ1 (b)
To measure the medium-term effects of the Blue Yeti GBL, we examined the distribution of scores obtained by students on a three-part task included in the second midterm exam. The highest possible score achieved in the assessment was six points. In order to address research question RQ1 (b), the following hypotheses were formulated:
Null hypothesis: There is no difference between the experimental and control groups with respect to the dependent variable.
Alternative hypothesis: There is a difference between the experimental and control groups regarding the dependent variable.
Table 9 presents the descriptive statistical data for both the experimental and control groups, which facilitates easy comparison between their performances. This table includes key metrics such as means, standard deviations, and sample sizes, offering insight into the overall performance of each group. The data summarised in
Table 9 are important for understanding the distribution of scores and establishing the foundations for the statistical analysis.
The average score of the experimental group (3.03) is significantly higher than that of the control group (1.78), indicating that participants in the experimental group generally outperformed those in the control group on the exam task. Additionally, the medians of the two groups reflect a higher central tendency for the experimental group. The experimental group also had a higher maximum score (6) compared to the control group (4). Furthermore, with a standard deviation of 1.78 and a variance of 3.17, the experimental group displayed more variability in scores than the control group.
Figure 12 illustrates the box plots for both groups, providing a visual representation of their score distributions.
These results clearly indicate a difference in performance between the experimental and control groups, supporting the alternative hypothesis that the two groups yield different outcomes for the dependent variable. Statistical testing was conducted to further validate this conclusion.
The first step involved determining whether the samples followed a normal distribution. The Kolmogorov–Smirnov, Shapiro–Wilk, and Anderson–Darling tests were used (see
Table 10), revealing that the data do not conform to a normal distribution. The low
p-values (lower than 0.05) obtained for the Shapiro–Wilk and Anderson–Darling tests suggest that the data significantly deviates from normality.
As the normality assumption was not met, the Mann–Whitney U-test was used to compare the performance of the experimental and control groups. This test is robust because it is not sensitive to the absence of a normal distribution and to differences in variances.
Table 11 shows the results of the Mann–Whitney U-test.
The results of the Mann–Whitney U-test reveal a statistically significant difference between the performance of the two groups. The U statistic obtained was 303.5, which corresponds to a z score of −2.69. This negative z-value signifies that the ranks of one group are significantly lower than those of the other, pointing to a noteworthy disparity in performance. The asymptotic p-value of 0.007, along with the exact p-value of 0.009, both fall below the conventional alpha level of 0.05. This strong statistical evidence allows for the rejection of the null hypothesis, which posited that there would be no difference in the median scores between the groups. Consequently, these findings support the alternative hypothesis, asserting that there is a significant difference in the outcome measure. Furthermore, the effect size, measured by r, is 0.34, indicating a medium effect.
The difference between the experimental and control groups is statistically significant. The Blue Yeti GBL was a medium-effective intervention associated with the experimental group, which answers research question RQ1 (b). The survey results highlight the potential positive influence of the Blue Yeti GBL on students’ scores, warranting further investigation and consideration in future research.