Música - Intelligent auxiliary system for music
Música - Intelligent auxiliary system for music
Música - Intelligent auxiliary system for music
RESEARCH ARTICLE
* YvonneWang@ku.edu
a1111111111
a1111111111
a1111111111
a1111111111
Abstract
a1111111111 Music performance action generation can be applied in multiple real-world scenarios as a
research hotspot in computer vision and cross-sequence analysis. However, the current
generation methods of music performance actions have consistently ignored the connection
between music and performance actions, resulting in a strong sense of separation between
OPEN ACCESS visual and auditory content. This paper first analyzes the attention mechanism, Recurrent
Citation: Wang Y (2023) Intelligent auxiliary Neural Network (RNN), and long and short-term RNN. The long and short-term RNN is suit-
system for music performance under edge able for sequence data with a strong temporal correlation. Based on this, the current learn-
computing and long short-term recurrent neural
ing method is improved. A new model that combines attention mechanisms and long and
networks. PLoS ONE 18(5): e0285496. https://doi.
org/10.1371/journal.pone.0285496 short-term RNN is proposed, which can generate performance actions based on music beat
sequences. In addition, image description generative models with attention mechanisms
Editor: Muhammad Fazal Ijaz, Sejong University,
REPUBLIC OF KOREA are adopted technically. Combined with the RNN abstract structure that does not consider
recursion, the abstract network structure of RNN-Long Short-Term Memory (LSTM) is opti-
Received: October 31, 2022
mized. Through music beat recognition and dance movement extraction technology, data
Accepted: April 24, 2023
resources are allocated and adjusted in the edge server architecture. The metric for experi-
Published: May 8, 2023 mental results and evaluation is the model loss function value. The superiority of the pro-
Peer Review History: PLOS recognizes the posed model is mainly reflected in the high accuracy and low consumption rate of dance
benefits of transparency in the peer review movement recognition. The experimental results show that the result of the loss function of
process; therefore, we enable the publication of
the model is at least 0.00026, and the video effect is the best when the number of layers of
all of the content of peer review and author
responses alongside final, published articles. The the LSTM module in the model is 3, the node value is 256, and the Lookback value is 15.
editorial history of this article is available here: The new model can generate harmonious and prosperous performance action sequences
https://doi.org/10.1371/journal.pone.0285496
based on ensuring the stability of performance action generation compared with the other
Copyright: © 2023 Yi Wang. This is an open access three models of cross-domain sequence analysis. The new model has an excellent perfor-
article distributed under the terms of the Creative
mance in combining music and performance actions. This paper has practical reference
Commons Attribution License, which permits
unrestricted use, distribution, and reproduction in value for promoting the application of edge computing technology in intelligent auxiliary sys-
any medium, provided the original author and tems for music performance.
source are credited.
background of the discussion on music performance and dance movement recognition. Sec-
tion 2 introduces the attention mechanism and neural network models. Section 3 analyzes the
training and test results of the model through the establishment and experiment of network
model. Section 4 draws research conclusions through systematic induction and summary. The
research has practical reference value for promoting the intelligent level of music performance.
2.2 RNN
The earliest research on RNN started in the 1980s and 1990s. RNN has become one of the
most common algorithms in DL. RNN takes sequence data as input. The input is recursively
carried out in the evolution direction of the sequence, and all the recurrent units are connected
in a chain. The primary function is to perform repetitive operations on a set of sequential data
inputs [7]. There is no absolute correlation between the previous input and the following input
of the CNN. While it is possible to make all inputs relatively independent, the outputs can also
be completely uncorrelated. Therefore, the connection before and after cannot be considered
in processing time-series data. Each layer of RNN can memorize all previous unit information,
indicating that it can review infinitely. In practice, RNN can only remember the information
of a few steps back, so most scholars improve the memory ability of previous information by
improving the structure of RNN [8]. Fig 2 displays the simple structure of RNN.
In Fig 2, the vectors X, S, and O represent the values of the input, hidden, and output layers.
The matrix U and matrix V represent the calculation weight of the data from the input layer to
the hidden layer and the weight of the hidden layer to the output layer. If the cycle layer is not
considered, the specific structure analysis is shown in Fig 3.
From Fig 3, this is a simple, fully connected NN. X as the input layer is a three-dimensional
vector, and U is the parameter matrix from the input layer to the hidden layer. The dimension
in the figure is 3×4, and S is the vector of the hidden layer. In the figure above, the dimension
is 4×2, and V is the parameter matrix from the hidden layer to the output layer. O is the vector
Fig 3. The specific structure of the RNN network without considering the recurrent layer.
https://doi.org/10.1371/journal.pone.0285496.g003
of the output layer. In the above figure, the dimension is 2 [9, 10]. Fig 2 is expanded according
to the timeline. Then, Fig 4 shows its structure.
From Fig 4, when the input value Xt is received at t, the value of the hidden layer is St, and
the output value is Ot. Therefore, RNN can solve the principle of sequence problems. It can
remember every moment of information. The hidden layer St at each moment is determined
by the input layer Xt at that moment and the hidden layer St−1 at the previous moment [11].
The way each layer of the RNN is calculated is as follows.
� �
Ot ¼ g Vst ð1Þ
The RNN output layer can be calculated according to Eq (1). Ot represents the output at
time t. The output layer is fully connected. The hidden layer can be calculated according to Eq
(2). St represents the value of the hidden layer at time t, while V and W represent the corre-
sponding weight matrix [12]. Eq (2) is substituted into Eq (1) repeatedly, and there are:
Ot ¼ Vf ðUXt þ WSt 1 Þ ð3Þ
In Eqs (3)–(6), finding the value of the output layer 0t of RNN needs to consider the influ-
ence of all the previous input layers X Xt, Xt−1, Xt−2, Xt−3,. . .. . .. This is exactly why RNNs
are suitable for time series data. However, the scale of model parameters becomes extremely
large as RNN increases with the number of network layers. It leads to the problem of gradient
explosion and disappearance during training. As a result, the memory of the current layer to
the distant previous layer decays rapidly [13]. The problem of gradient disappearance can be
solved by adjusting the initialized weight, changing the activation function, or using a variant
of RNN. The most suitable one is the long and short-term RNN.
In LSTM structure network, the functions of the three control gates are represented by
switches. The three control gates are: forget gate, input gate, and output gate. The forget gate
determines the information retained by the cell state ct-1 at the previous time t-1 to the current
time ct. The input gate determines the amount of information from the input xt at present t to
the cell state ct. The output gate determines the amount of information from the long-term
cell state ct to the output value ht of the model at this moment. The function of gating is to
allow information to pass selectively. The output range of the activation function Sigmoid is
from zero to one to define the degree of passing through the gate. Zero represents that all
information is not accessible, while one means that all information is fully accessible [16]. The
specific calculation process is as follows.
� �
ft ¼ s wft xt þ wfh ht 1 þ bf ð7Þ
� �
it ¼ s wix xt þ wih ht 1 þ bi ð8Þ
The forget and input gates can be calculated according to Eqs (7) and (8). Eq (9) represents
the calculation when the state is updated. c represents the internal state. h represents the sys-
tem state. b represents the weight. f, i, and o each represent a forget gate, an input gate, and an
output gate. σ represents the Sigmoid function, and Tanh represents the hyperbolic tangent
Fig 7. The framework structure diagram of the proposed intelligent assistance system.
https://doi.org/10.1371/journal.pone.0285496.g007
function. Two influencing factors of the current input unit state c~t are the output of the previ-
ous unit state and the input given by the existing state network [17]. Besides, the unit state c~t is
obtained according to Eq (10).
ct ¼ ft � ct 1 þ it � c~t Þ ð10Þ
In Eq (10), ct-1 represents the unit’s state at the previous moment. ft represents the forget
gate. it represents the input at the current moment. The symbol � represents the multiplication
of the corresponding elements of two vectors. The current state memory c~t in the long and
short-term RNN can be combined with the previously accumulated memory ct−1 for a long
time through the above operations. The calculation of the output gate is shown below.
ct ¼ sðwox xt þ woh ht 1 þ bo Þ ð11Þ
The final output of the long and short-term RNN is jointly determined by the output gate
and the unit state.
ht ¼ ot � tanhðct Þ ð12Þ
From the above calculations, the long and short-term RNN introduces the forget gate,
which can retain useful information and forget other useless information, so it can save space
to memorize the front data in the time series. The input gate is mainly to continuously input
new information to iterate the forward timing information for efficient processing. The output
gate will analyze the value of the information stored in the previous period. The final output is
determined jointly by the output gate and the unit state. Long and short-term RNNs can be
crucial in developing RNNs to generate variants [18].
This paper regards the dancer’s dance actions as a collection of different human body parts
coordinate points to simplify the actions and obtain critical dance features. Therefore, the
action information can be obtained by recognizing the human body posture of each frame
image in the video so that the position information can be saved as the dance feature. Here,
OpenPos, an open-source system for multi-person pose detection, is used for initial human
pose estimation, which can detect the main parts of the human body, including limbs and
faces. Feature extraction is revealed in Fig 9.
From Fig 9, extracting the Mel spectrogram requires adjusting the value of the number of
frames. The goal is to have the same frame rate for both the music and dance features. After
the OpenPose system processes the video, the 2D coordinate positions of 14 key points can be
obtained, including the simulated robot dance actions in each frame, as shown in Fig 10.
Fig 10 shows the specific positions of the 14 limb critical points of the simulated robot
when performing dance movements, and the information is normalized to limit it to the inter-
val [-1,1]. On the one hand, subsequent data processing is convenient. On the other hand, the
model runs with a fast convergence rate.
edge layer, and field layer. On the monitoring platform of music and dance movements, musi-
cal features are jointly positioned through multiple microphones. Communication and data
transmission between the individual system modules are connected using routers. The hard-
ware structure of the established edge server system is shown in Fig 11.
From Fig 12, the NN structure is an improved model based on encoder-decoder long and
short-term RNN. Besides, an attention mechanism is introduced to deal with the problem that
the long and short-term RNN ignores the relationship between elements in the music sequence
and the same semantic code of each frame of its output. On the one hand, all states in the
encoder are preserved, and each element is assigned its weighted average. The semantic codes
of each frame corresponding to the output are different. On the other hand, a new attention
mechanism can be added when the sequence is processed to obtain the relationship between
the sequence elements. The new model is divided into three modules. The function of the
LSTM module is to process the input information. The function of the Dense module is to out-
put a sequence. The function of the Attention module is to change the decoding process of the
decoder. The whole process is as follows. First, the beat and position features in OpenPose are
extracted from the music and dance sequences. Then, the feature information of the two is
input into the encoder network and goes through each layer in turn. The encoder result is fed
into the decoder network and goes through each layer. When a new model is trained, music
data is used as features and dance data as labels. The connection between music and dance
actions is represented by adding the parameter Lookback. During training, the parameters of
the NN are initialized first, and the forward propagation of the NN is performed. The network
back-propagation calculation is implemented through the obtained loss function value, and
the weight parameters of each layer are updated. When the number of iterations exceeds the
maximum number of iterations, or the loss function is less than the error threshold, the train-
ing ends, and all training samples are saved for prediction.
model, the network structure characteristics of RNN and LSTM are analyzed. Input is 50, Ten-
sor is 100, and Kernel size is 2*2. Stride 2 is also used. The type of activation function is Sig-
moid. The proposed model executes 500 iterations with an initial learning rate of 0.005. The
model saturates after 400 iterations. Combined with the relevant literature information [20],
the specific settings of the network structure parameters are given in Table 1.
Fig 13. Loss functions and visual effects with different layers of LSTM modules.
https://doi.org/10.1371/journal.pone.0285496.g013
smallest, and the change is noticeable. Finally, the Lookback value is set to 15 to strengthen the
connection between music and dance actions.
Table 2. Loss function and visual effect change data of different modules of Loss.
Model LSTM-1 LSTM-3 LSTM-5 LSTM (nots = 36) LSTM (nots = 128) LSTM (nots = 256)
Loss 0.0036 0.0028 0.0028 0.00433 0.00282 0.00279
https://doi.org/10.1371/journal.pone.0285496.t002
Fig 14. Loss functions and visual effects with different numbers of nodes in the LSTM module.
https://doi.org/10.1371/journal.pone.0285496.g014
Table 3 suggests that as the lookback value increases, the loss values of different cross
domain sequence models all increase to a certain extent. However, the improved model always
maintains a loss value around 0.0022 and is lower than other models. This indicates that the
improved model has less loss and better performance in establishing a connection between
music and dance movements.
4. Discussion
The edge server structure and resource allocation strategy are analyzed. The results of this
paper are compared with those of previous literature. Sun (2020) [21] studied the vocal teach-
ing system of mobile edge computing. A system resource allocation method based on power
iteration was proposed through the allocation and research of system teaching resources. The
throughput of the unloading process was set as the objective function. It had practical refer-
ence value for promoting the heterogeneity of edge servers and the optimization of teaching
resources. Hong et al. (2022) [22] researched the role of machine learning and artificial intelli-
gence in music education for online games through the optimization of decision support sys-
tems. The results showed that innovative and complex methods based on artificial intelligence
and machine learning were being used to improve music teaching. Hu et al. (2019) [23] stud-
ied artificial intelligence-assisted in-vehicle networks and proposed a strategy that integrated
communication, caching, and computing to achieve cost-effectiveness of in-vehicle networks.
In addition, the statistical test results show that the proposed model structure based on edge
computing and RNN has the highest accuracy of movement recognition. In summary, the pro-
posed edge cache strategy can optimize the structure of the system, and the proposed edge
server hardware structure can improve the network structure model. The model loss value of
the system is greatly reduced, and the accuracy of extraction and recognition of music and
dance movements is greatly improved.
5. Conclusion
This paper focuses on the problem of motion generation in music performances. The rhythm
of music is characterized to realize the interaction between music and dance actions. In addi-
tion, a new network model is constructed based on the attention mechanism and the long and
short-term RNN. Model training and prediction are carried out based on videos on Music&-
Dance2019. The results indicate that the loss function result is the smallest, and the video effect
is the best when the number of layers of the LSTM module in the model is 3, the node value is
256, and the lookback value is 15. Besides, edge computing and long short-term RNNs are
integrated into the optimization of music performance and intelligent assistance systems. The
main contribution is to solve the problems of long access time and limited network bandwidth
during multi-user access through edge server architecture and system resource allocation.
Also, the efficiency of data collection is improved. The necessity of this paper is to provide a
reference for the optimal design of music teaching system. The results indicate that the new
model can generate harmonious and rich performance action sequences based on ensuring
stability. However, the research content is still limited and needs further improvement. The
first is that the recognition of music beats is not ideal enough. The second is that the dance
moves are still too monotonous and not smooth. The third is that an extensive database has
not been established to achieve matching with music. The last point is that no suitable action
character is designed, and the dance is too simplistic and abstract. It is hoped that in future
studies and research, research in this area can be conducted to realize the generation of intelli-
gent and humanized music performance actions.
Table 3. Loss function and visual effect change data of different modules of Loss.
Model
Lookback value CNN AM SA The improved model
10 0.0026 0.0028 0.0029 0.0021
20 0.0027 0.0029 0.0031 0.0022
30 0.0028 0.0031 0.0033 0.0026
40 0.0031 0.0035 0.0037 0.0022
50 0.0049 0.0052 0.0059 0.0025
https://doi.org/10.1371/journal.pone.0285496.t003
Supporting information
S1 Data.
(ZIP)
Author Contributions
Conceptualization: Yi Wang.
Investigation: Yi Wang.
Supervision: Yi Wang.
Visualization: Yi Wang.
Writing – original draft: Yi Wang.
Writing – review & editing: Yi Wang.
References
1. Zahra S, Gong W, Khattak H A, et al. Cross-domain security and interoperability in internet of things[J].
IEEE Internet of Things Journal, 2021, 9(14): 11993–12000.
2. Wei J, Karuppiah M, Prathik A. College music education and teaching based on AI techniques[J]. Com-
puters and Electrical Engineering, 2022, 100: 107851.
3. Yang Y. Piano performance and music automatic notation algorithm teaching system based on artificial
intelligence[J]. Mobile Information Systems, 2021, 2021: 1–13.
4. Zhuang W, Wang C, Chai J, et al. Music2dance: Dancenet for music-driven dance generation[J]. ACM
Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2022, 18(2): 1–
21.
5. Srinivasu P N, SivaSai J G, Ijaz M F, et al. Classification of skin disease using deep learning neural net-
works with MobileNet V2 and LSTM[J]. Sensors, 2021, 21(8): 2852.
6. Wen X. Using deep learning approach and IoT architecture to build the intelligent music recommenda-
tion system[J]. Soft Computing, 2021, 25: 3087–3096.
7. Hibat-Allah M, Ganahl M, Hayward L E, et al. Recurrent neural network wave functions[J]. Physical
Review Research, 2020, 2(2): 023358.
8. Khan M A. HCRNNIDS: Hybrid convolutional recurrent neural network-based network intrusion detec-
tion system[J]. Processes, 2021, 9(5): 834.
9. Shen Z, Bao W, Huang D S. Recurrent neural network for predicting transcription factor binding sites[J].
Scientific reports, 2018, 8(1): 15270.
10. Wu Q, Ding K, Huang B. Approach for fault prognosis using recurrent neural network[J]. Journal of Intel-
ligent Manufacturing, 2020, 31: 1621–1633.
11. Chu Y, Fei J, Hou S. Adaptive global sliding-mode control for dynamic systems using double hidden
layer recurrent neural network structure[J]. IEEE transactions on neural networks and learning systems,
2019, 31(4): 1297–1309.
12. Muller A T, Hiss J A, Schneider G. Recurrent neural network model for constructive peptide design[J].
Journal of chemical information and modeling, 2018, 58(2): 472–479.
13. Salmela L, Tsipinakis N, Foi A, et al. Predicting ultrafast nonlinear dynamics in fibre optics with a recur-
rent neural network[J]. Nature machine intelligence, 2021, 3(4): 344–354.
14. Yu Y, Si X, Hu C, et al. A review of recurrent neural networks: LSTM cells and network architectures[J].
Neural computation, 2019, 31(7): 1235–1270.
15. Fan H, Jiang J, Zhang C, et al. Long-term prediction of chaotic systems with machine learning[J]. Physi-
cal Review Research, 2020, 2(1): 012080.
16. Tang X, Dai Y, Wang T, et al. Short-term power load forecasting based on multi-layer bidirectional
recurrent neural network[J]. IET Generation, Transmission & Distribution, 2019, 13(17): 3847–3854.
17. Yuan J, Abdel-Aty M, Gong Y, et al. Real-time crash risk prediction using long short-term memory recur-
rent neural network[J]. Transportation research record, 2019, 2673(4): 314–326.
18. Liu N, Han J. A deep spatial contextual long-term recurrent convolutional network for saliency detection
[J]. IEEE Transactions on Image Processing, 2018, 27(7): 3264–3274.
19. Chen K, Tan Z, Lei J, et al. Choreomaster: choreography-oriented music-driven dance synthesis[J].
ACM Transactions on Graphics (TOG), 2021, 40(4): 1–13.
20. Vulli A, Srinivasu P N, Sashank M S K, et al. Fine-tuned DenseNet-169 for breast cancer metastasis
prediction using FastAI and 1-cycle policy[J]. Sensors, 2022, 22(8): 2988.
21. Sun J. Research on resource allocation of vocal music teaching system based on mobile edge comput-
ing[J]. Computer Communications, 2020, 160: 342–350.
22. Hong Yun Z, Alshehri Y, Alnazzawi N, et al. A decision-support system for assessing the function of
machine learning and artificial intelligence in music education for network games[J]. Soft Computing,
2022, 26(20): 11063–11075.
23. Hu R Q, Hanzo L. Twin-timescale artificial intelligence aided mobility-aware edge caching and comput-
ing in vehicular networks[J]. IEEE Transactions on Vehicular Technology, 2019, 68(4): 3086–3099.