Audio Segmentation in AAC Domain For Content
Audio Segmentation in AAC Domain For Content
Audio Segmentation in AAC Domain For Content
Analysis
Rong Zhu, Haojun Ai, Ruimin Hu
National Engineering Research Center for Multimedia Software
Wuhan University
Wuhan, China
zhurong@whu.edu.cn ai.haojun@gmail.com hrm1964@public.wh.hb.cn
Abstract—We focus the attention on the audio scene segmentation audio classification. The audio content analysis experiments
in AAC domain for audio-based multimedia indexing and and performance results are presented in Section 3.
retrieval applications. In particular, a MFCC extraction method Conclusions are followed in Section 4.
is proposed, which is adaptive to the window switch in AAC
encoding process, and independent of the audio sampling
II. FEATURE EXTRACTION AND MODEL
frequency. We discuss the fusion method of MFCC features,
which came from different window type in order to keep the
balance of the frequency and temporal resolution. A series of
A. Structure of AAC Stream
experiments via the probability distribution of MFCC were A fundamental component in the AAC audio coding
implemented to test the effective in audio scene segmentation. process is the conversion the signals from time domain to a
The experimental results show that such approach based on time frequency representation [6]. This conversion is done by a
compression domain can approach the performance of the system forward modified discrete cosine transform (MDCT). A direct
based on PCM audio, and the CPU overload decreased MDCT transformation is performed over the samples without
dramatically. It is meaningful to the real time analysis of audio dividing the audio signal into 32 subbands as in MP3 encoding.
content. Two windowing modes are applied in order to achieve a better
time/frequency resolution:
Keywords-Audio Content Analysis; AAC; Compression
Domaint; MFCC
N −1
⎛ 2π ⎛ 1 ⎞⎞
I. INTRODUCTION X k = 2i∑ zi ,n cos ⎜ (n + n0 ) ⎜ k + ⎟ ⎟ ,
n =0 ⎝ N ⎝ 2 ⎠⎠ (1)
Audio information often plays an essential role in
understanding the semantic content of multimedia. Audio for 0 ≤ k ≤ N / 2
content analysis (ACA), i.e. the automatic extraction of
semantic information from sounds, arose naturally from the Where:
need to efficiently manage the growing collections of data and
enhance man-machine communication. ACA can typically zin = windowed input sequence
transfer the audio signals into a set of numerical measure,
which is usually called low-level features to denote that they n = sample index
represent a low level of abstraction [1]. The starting point of
audio content analysis for a general time-dependent audio k = spectral coefficient index
signal is the temporal segmentation, which segments this signal
to different types of audio. The Mel frequency cepstral i = block index
coefficients (MFCC) are often used due to their good
discriminative capabilities for a broad range of audio N = window length of the transform window
classification tasks [2].
n0 = ( N / 2 + 1) / 2
In our research, we explores the possibility of working
directly in the compression domain so that no decoding is In the encoder the filterbank takes the appropriate block of
needed, thus lowering the processing requirement. There are a time samples, we modulates them by an appropriate window
number of approaches proposed for audio segmentation and function, and performs the MDCT. Each block of input
summarization in compressed domain [3][4][5]. Because samples is overlapped by 50% with the immediately preceding
MPEG Advanced Audio Coding (AAC) is used widely in block and the following block. The transform input block
broadcast, movie, DTV and portable device, we focus our length N can be set either 2048 or 256 samples.
research on how to identify speaker in AAC domain directly.
Fig.1 shows the sequence of blocks for the transition (D-E-
The structure of this paper is as follows. In section 2 we F) to and from a frame employing the sine function window.
describe MFCC extraction in AAC domain and model it for
This work was supported in part by the NSFC under Grant 60832002, and
the Sci. and Tech. Plan of Hubei Provenience under Grant 2007AA101C50.
windows during transient conditions We extend the MFCC algorithm from Fourier domain to
MDCT domain. The tri-angle filter banks can archived by
1
1 2 34 5 67 8 9 10
combined the Equ.2 and Equ.3. In Equ.3 N is the frame length,
fs is the sampling frequency and k is the index of the MDCT
spectrum.
0
0 512 1024 1536 2048 2560
Time(Samples)
3072 3584 4096
k MDCT = ⎡⎣ N ∗ ( flinear f s ) ⎤⎦ (3)
Figure 1. Example of Block Switching During Transient Signal Conditions Because N is 2048 or 512, two suit of filter banks should be
used for AAC audio stream.
As the MFCC only care the signal below 8 KHz, we have
Short Window
Audio PCM processed the MDCT spectrum below 8 KHz in the mode of 40
Window Type filterbanks for all sampling frequency.
Long Window
0 1024 2048 3072 4096 5120 6144 7168 8192 9216 10240
1 I −1
Time(Samples)
RMS j = ∑ Si ( j ) 2
I i =0
(4)
Window Type
• Because the short window is sparseness, we can
discard the short frame information directly. Especially
the measure is based on the probability characters.
• Another method is fusion the Long-frame and Short-
frame parameters, which are dependent on the natural
temporal sequence. There is 8 MFCC for a Short Long Window
III. EXPERIMENTS RESULTS Figure 3. The Window Type of the audio from two speakers
A. Data description and parameterization In Fig.4, MFCC vectors from short window mode and long
In this section, we present the evaluation results of the window mode are combined to the BIC model. In Fig.5, only
proposed speaker recognition algorithm. In our experiment, the MFCC vectors of the long window mode are used. The speaker
16 KHz audio is encoded to AAC stream in 128kbps firstly by changing point is the maximum of the BIC value in Fig.2. In
NeroAAC [8]. We have established the MDCT spectrum Fig.3, two BIC peaks were produced.
extraction tool based on FAAD open source [9].
500
200
According to (1), MFCC can be extracted frame by frame.
In this paper, the MFCC order is 13, the filter banks number
is 40.
100
400
Chang Point
Because the MFCC extraction is based on audio signal below 8
350
KHz, the algorithm should adapt to different sampling
frequency.
300
Many audio segmentation approaches are based on
250
probability model of MFCC. The two window types are
200
occurred alternatively in nature audio. Generally the
probability of short window is far less than that of long
150
window, though it is related to the audio signal and the
100
implementation of the codec. To discuss the fusion method of
MFCC from different window length, BIC model is selected
50
respectively in segmentation tasks.
0
0 50
Frames
100 150 The experimental results indicate that the MFCC fusion is
effective to the two type tasks. In the audio scene segmentation
task, the BIC curves increase more quickly depending on the
Figure 5. The BIC curve via only long window frames mixture MFCC than only Long-Window parameters.
On the other hand, extracting MFCC from MDCT spectrum
C. Video analysis based on Audio scene directly decreases the complexity by refrain from the IMDCT
A Video from YOUTUBE is used to test the scene and FFT. In general visual-based analysis involves much more
segmentation ability. Youtube supports two versions of high computation than audio-based one. Using the compressed-
and low quality video. The series is high stream. The video is domain audio information alone can often provide a good
encoded by H.264 at the bitrates of 422kbps. Audio is encoded initial solution for further examination based on visual
by AAC at the bitrates of 128kbps. The clip length is 637 information.
second.
REFERENCES
[1] Serkan Kiranyaz, Ahmad Farooq Qureshi, and Moncef Gabbouj, “A
Generic Audio Classification and Segmentation Approach for
Multimedia Indexing and Retrieval”, IEEE TRANSACTIONS ON
AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO.
3, pp. 1062-1081, MAY 2006
[2] Hyoung-Gook Kim Sikora, T, Comparison of MPEG-7 audio spectrum
projection features and MFCC applied to speaker recognition, sound
classification and audio segmentation, ICASSP 2004, pp: V- 925-8
vol.5
[3] Xi Shao, Changsheng Xu, Ye Wang and Mohan S Kankanhalli,
"Automatic Music Summarization in Compressed Domain," IEEE
ICASSP 2004, Montreal, Canada
[4] Y. Nakajima, Yang Lu, M. Sugano, A. Yoneyama, H. Yamagihara, A.
Kurematsu, “A Fast Audio Classification from MPEG Code Data”, in
Figure 6. The video retrieval via audio content analysis Proc. IEEE ICASSP, 1999, pp 3005-3008
[5] Yuhua Jiao; Mingyu Li; Bian Yang; Xiamu Niu, "Compressed domain
In the test all ten changing point of audio scene are detected robust hashing for AAC audio," Multimedia and Expo, 2008 IEEE
International Conference on , vol., no., pp.1545-1548, June 23 2008-
correctly. In Fig.6, the keyframes corresponding to the April 26 2008
changing points are shown. [6] Coding of Moving Pictures and Audio—IS 13818-7 (MPEG-2
Advanced, Audio Coding, AAC), JTC1/SC29/WG11/N7126, ISO/IEC,
IV. CONCLUSIONS 2005
[7] Beth Logan, Mel-Frequency Cepstral Coefficients for Music Modeling”,
The primary goal of this study is to implement audio scene Proc.Int. Conf. on Music Information Retrieval (ISMIR), Plymouth,
segmentation in AAC domain. Towards this goal, we have Massachusetts, 2000
investigated the algorithm to extent the MFCC extraction from [8] Nero AAC Codec 1.3.3.0, http://www.nero.com/
Fourier spectrum to MDCT domain. The MFCC extraction [9] FAAD, http://www.audiocoding.com/faad2.html
algorithm is developed, which adaptive the window switch and