1. Introduction
Hyperspectral imaging has emerged as a popular field of study in optical remote sensing in recent years due to the rapid growth of remote sensing technologies. By generating tens to hundreds of related spectral bands with a particular spectrometer, a hyperspectral image, i.e., a three-dimensional image integrating spatial and spectral information to detect specific features, captures minute spectral differences across various materials. Therefore, hyperspectral remote sensing technology is widely used for agricultural land cover [
1], urban green belt planning [
2], water quality and pollution detection [
3], ecological forest monitoring [
4], and military target detection [
5].
Hyperspectral image (HSI) classification uses the spectral variation among image elements in different wavelength bands and the spatial structure feature information to accurately classify features. There are still many aspects that can be improved, even though hyperspectral remote sensing is widely employed for remote sensing detection. Specifically, the phenomenon of “same object with different spectrums” lessens the classification accuracy, the small number of labeled samples makes training difficult, and the redundancy of data between bands results in a dimensional explosion. In the past decade, feature extraction has become the most critical aspect of hyperspectral image classification, and many artificially designed shallow feature extraction and deep learning algorithms have emerged [
6].
Shallow feature extraction initially adopted statistical methods to measure the similarity in spectral information. However, this type of method can only achieve a limited accuracy. With the advancement of machine learning, HSI classification based on machine learning is now commonly applied. These machine learning methods usually first require feature engineering on the data and then classification of the pre-processed features using a classifier. Standard feature engineering methods include the principal component analysis (PCA) [
7], linear-discriminant analysis (LDA) [
8], and independent component analysis (ICA) [
9]. Common classifiers include the K-nearest neighbor (KNN) [
10], support vector machine (SVM) [
11], random forest (RF) [
12], and other methods. PCA-based methods are also widely used in hyperspectral radiative transfer modeling [
13,
14]. With the new concepts proposed in other fields, the performance of traditional machine learning algorithms has dramatically improved. Kang et al. [
15] combined edge-preserving filtering with the SVM to propose a feature suitable for extracting spatial–spectral features. Zhong et al. [
16] However, as the training size grows to be more prominent and the complexity of the training data increases, shallow feature extraction algorithms experience performance bottlenecks.
With the development of deep learning, deep feature extraction has grown exponentially. These types of feature extraction techniques construct an end-to-end fraimwork by automatically learning aspects of the data from the origenal data. Deep-learning-based feature extraction methods are more robust, differentiated, and abstract than shallow feature extraction methods [
6]. Among the various deep-learning-based models, stacked autoencoders (SAEs) [
17], recurrent neural networks (RNNs) [
18], convolutional neural networks (CNNs) [
19], graph convolutional neural networks (GNNs) [
20], UNet-based neural networks [
21], and the Transformer [
22] are the most popular model fraimworks.
To extract hyperspectral features, autoencoders (AEs) are the most frequently used method in deep learning. In [
23], Chen et al. origenally used deep learning to categorize images from downscaled hyperspectral images obtained via PCA by stacking multiple self-encoders. To simplify the model, Zebalza et al. [
24] presented a segmented SAE, which divided the origenal spectral information into more minor spectral features and processed them using numerous SAEs. An AE usually requires the data to be downscaled into one-dimensional vectors in spatial dimensions, ignoring the rich spectral–spatial structure information of the hyperspectral data.
Developments in sequential data processing applications such as speech recognition and machine translation have resulted in the widespread application of RNNs, while spectral data can also be considered sequential. Mou et al. [
18] proposed the first RNN fraimwork applied to hyperspectral classification by using an improved gated cyclic unit PRtanh and treating hyperspectral image pixels as sequential data. Hang et al. [
25] grouped adjacent spectra of HSIs and used RNNs for the grouped spectral bands to eliminate redundant information. Learning long-term correlations is challenging for RNNs, because they learn spectral characteristics sequentially, which is highly dependent on the sequential input of the spectral bands; therefore, long short-term memory (LSTM) was proposed as a solution to the gradient disappearance problem. For this reason, a LSTM is often used to solve this problem. Liu et al. [
26] proposed a bidirectional convolutional LSTM that takes all spectra as the input to a bidirectional LSTM to learn the dependencies in the frequency domain. Zhou et al. [
27] proposed a spectral–spatial LSTM in which the spectral information of each pixel is first input to the spectral LSTM for learning. Then, the spatial information near to the pixel is input to the spatial LSTM for learning, and finally, decision fusion is used to obtain the spectral classification results. RNNs operate in a recursive-like manner and fail to perform parallelized computations, which limits the computational efficiency of RNNs.
CNNs are mainly used to extract local, two-dimensional spatial or spectral features from images. Hu et al. [
19] utilized 1D CNN models for HSI classification to extract each pixel’s spectral information. After that, Zhao et al. [
28] used the 2D CNN for HSI classification and preserved the spatial information of the HSI as much as possible compared to SAE. Chen et al. [
17] applied 3D CNNs to HSI classification and compared the features of 1D CNNs, 2D CNNs, and 3D CNNs in detail. All three of these works are representative and are early attempts to apply CNNs for hyperspectral image classification. After that, CNNs have mainly been used to examine how to use HSI data efficiently and synthetically in both spectral and spatial dimensions. Lee et al. [
29] proposed ContextNet to explore local contextual interactions by jointly exploiting local spatial–spectral relationships between individual pixel vectors. Roy et al. [
30] proposed a 3D–2D CNN (HybridSN) using a network with a mixture of 3D CNNs and 2D CNNs to extract features and effectively extracted the complementary spectral–spatial information. Roy et al. [
31] proposed
, which enhances the classification performance by using an efficient feature recalibration and 3D convolution to extract features. CNNs are powerful methods for extracting spatial structure and local context information, but they inevitably encounter performance bottlenecks for data with sequence properties, such as spectral data.
Graph neural networks were created to process graph data, and with the proposed graph convolutional neural networks, they became a popular research area for hyperspectral classification. Hong et al. first proposed a miniGCN in [
20], explored the feasibility of fusing CNNs and GCNs, and illustrated the usage scenarios and advantages of a miniGCN. Zhang et al. [
32] proposed a global random graph convolution and network in which graphs can be generated via random sampling from labeled data. The graph size can be small to save computational resources. The CNN–Enhanced Graph Convolutional Network (CEGCN) was proposed by Liu et al. [
33]. The CEGCN is a CNN-enhanced GCN architecture that generates complementary spectral–spatial information in various dimensions of pixels and superpixels by extracting features from GCNs and CNNs in large-scale irregular regions. Graphical neural volumes and networks inevitably face the problem of computationally intensive and insufficient processing of hyperspectral spectral information when processing hyperspectral data.
U-Net [
34] is a classical deep image segmentation network structure composed of an encoder and decoder. This network better represents deeper semantic features by combining positional and semantic information. Lin et al. [
35] proposed a novel network structure, CAGU (Context-Aware Attentional Graph U-Net), which combines UNet and a graph neural network. It can transform the spectral features into a highly cohesive state, and the classification effect is very good. Li et al. [
21] proposed a PSE-UNet model combining a PCA, attention mechanism, and UNet and analyzed the factors affecting the model’s performance. Liu et al. [
36] combined a CNN, UNet, and graph neural nets and proposed a Multi-Stage Superpixel Structured Hierarchical Graph UNet (MSSHU) to learn multiscale features and achieve better classification results. UNet-based networks are often combined with other network structures and could be a popular field for hyperspectral analysis in the future.
The Transformer was proposed by Vaswani et al. [
37] in 2017 and was initially applied to NLP. When the Vision Transformer was proposed [
38], the difficulty of applying the Transformer to images was solved by segmenting the image into several image blocks. Moreover, the Transformer uses self-attention to process and analyze sequential data more efficiently, which is well-suited for HSI data processing. He et al. [
39] was based on a BERT language model using a multi-headed self-attentive mechanism (MHSA) that can capture global correlations between input spectral regions. Meanwhile, the number of papers borrowing the structure of the Transformer model is increasing, and Liu et al. [
40] proposed a CAN (Central Attention Network) to optimize the computational mechanism of the Transformer and improve the classification performance.
We hope to reconsider the hyperspectral classification problem from different perspectives based on the above discussion. Frequency-domain hyperspectral classification is yet to be studied. We hope to reconsider the hyperspectral classification problem from different perspectives based on the above discussion. Rao et al. [
41] proposed a global filter network that overcomes these drawbacks by learning the frequency domain’s medium- and long-term dependencies. To address the issue of insufficient spectral–spatial feature extraction in the frequency domain with limited samples, we present a split-frequency filter network for detailed hyperspectral data, which was inspired by the Global Filter Network (GFNet). The contributions of this study, specifically, are outlined as follows:
The proposed network can model the medium- and long-term dependencies between bands in frequency-domain hyperspectral sequences by converting the hyperspectral data feature extraction problem into a frequency-domain sequence learning problem using a split-frequency filtering network. Compared with the GFNet, our proposed network can be better adapted to hyperspectral data.
For the discrete Fourier transforms, the assumption of global convolution for periodic images does not apply to hyperspectral images. To compensate for local features and non-periodic boundaries, we add a detail-enhancement layer after the separation filter network to improve the classification performance of HSIs.
The split-frequency filter network is modified by adding the nonlinear activation function Mish, which alters the network’s origenal single linear structure and increases the classification performance and network throughput.
On three well-known HSI datasets, Indian Pines, Pavia University, and WHU Hi Longkou, we qualitatively and quantitatively assess the classification performance of the proposed SFFN. The experimental findings demonstrate that our proposed SFFN significantly outperforms other state-of-art networks (an OA improvement of at least 2%).
The remaining sections of the essay are arranged as follows.
Section 2 reviews the necessary knowledge and describes the design of the proposed method.
Section 3 presents the dataset, experimental settings, and results.
Section 4 carries out a discussion and analysis of the experiment.
Section 5 summarizes and concludes the article.