1-s2.0-S0933365720300749-main

Artificial Intelligence In Medicine 107 (2020) 101924
Contents lists available at ScienceDirect
Artificial Intelligence In Medicine

journal homepage: www.elsevier.com/locate/artmed
Detection of early stages of Alzheimer’s disease based on MEG activity with T

a randomized convolutional neural network
Manuel Lopez-Martina,*, Angel Nevadob,c, Belen Carroa
a
Dpto., TSyCeIT, ETSIT, Universidad de Valladolid, Paseo de Belén 15, Valladolid, 47011, Spain
b
Laboratory for Cognitive and Computational Neuroscience, Center for Biomedical Technology, Campus Montegancedo, Pozuelo de Alarcón, Madrid, 28223, Spain
c
Experimental Psychology, Department of the Complutense, University of Madrid, Spain
A R T I C LE I N FO A B S T R A C T
Keywords: The early detection of Alzheimer’s disease can potentially make eventual treatments more effective. This work
Alzheimer’s disease detection presents a deep learning model to detect early symptoms of Alzheimer’s disease using synchronization measures
Deep learning obtained with magnetoencephalography. The proposed model is a novel deep learning architecture based on an
Convolutional neural network ensemble of randomized blocks formed by a sequence of 2D-convolutional, batch-normalization and pooling
Ensemble model
layers. An important challenge is to avoid overfitting, as the number of features is very high (25755) compared
Magnetoencephalography
to the number of samples (132 patients). To address this issue the model uses an ensemble of identical sub-
models all sharing weights, with a final stage that performs an average across sub-models. To facilitate the
exploration of the feature space, each sub-model receives a random permutation of features. The features cor-
respond to magnetic signals reflecting neural activity and are arranged in a matrix structure interpreted as a 2D
image that is processed by 2D convolutional networks.
The proposed detection model is a binary classifier (disease/non-disease), which compared to other deep
learning architectures and classic machine learning classifiers, such as random forest and support vector ma-
chine, obtains the best classification performance results with an average F1-score of 0.92. To perform the
comparison a strict validation procedure is proposed, and a thorough study of results is provided.
1. Introduction conditions and processes including rapid eye movement disorders [11],
speech tasks classification [13], Parkinson’s disease [14], epilepsy [15],
The early detection of Alzheimer's disease (AD) is a critical medical spinal cord injury [15], scene representations [16] and ocular and
and social challenge. There have been many approaches to this chal- cardiac artifacts detection [17]. The ML models that have been em-
lenge using machine learning (ML) [1], including current advance- ployed in this context are: (1) classic models - Random Forest, Bayesian
ments with deep learning [2]. In this paper we present a novel deep Network, Decision trees, K-Nearest Neighbors, Logistic Regression,
learning model for early detection of AD using magnetoencephalo- Support Vector Machines and Linear discriminant analysis - [6,7], (2)
graphy (MEG) signals obtained from patients with Mild Cognitive Im- deep learning models - Convolutional neural networks (CNN)
pairment (MCI). MCI [3] is a stage between the expected cognitive [2,9,11,14,15], Recurrent neural networks (RNN) [10] and custom ar-
decline of healthy aging and dementia. Patients with MCI are at an chitectures [8,13,17,18] - (3) Bayesian ML models [12].
increased risk of developing Alzheimer’s Disease [4]. In this work we propose a novel architecture for the detection of
Magnetoencephalography (MEG) has been shown to have good Mild Cognitive Impairment (MCI). The architecture is based on an en-
sensitivity to detect AD in its early stages [5]. Previous work has ap- semble of randomized learning blocks, each consisting of a sequence of
plied Machine Learning (ML) to MEG data for this purpose [1,6,7]. ML 2D-convolutional layers [19], batch-normalization layers [20] and
models have also been applied to detect AD using other neuroimaging max-pooling layers [21]. Batch-normalization layers facilitate con-
signals such as functional magnetic resonance imaging (fMRI) [8,9], vergence during training and provide a regularization effect that helps
magnetic resonance imaging (MRI) [2] and electroencephalography prevent overfitting. Max-pooling layers reduce the number of para-
(EEG) [10–12]. Similarly, neuroimaging data have been analyzed with meters (weights) of the network, which reduces the possibility of
machine learning techniques to investigate a large variety of brain overfitting. A randomization is applied to the input of the different
⁎
Corresponding author.
E-mail address: mlopezm@acm.org (M. Lopez-Martin).
https://doi.org/10.1016/j.artmed.2020.101924
Received 16 January 2020; Received in revised form 27 April 2020; Accepted 1 July 2020
Available online 02 July 2020
0933-3657/ © 2020 Elsevier B.V. All rights reserved.
M. Lopez-Martin, et al. Artificial Intelligence In Medicine 107 (2020) 101924
blocks, allowing for maximum exploration of the interaction between The contributions of this work are:
features, which is important considering the local-based nature of the
convolutional networks. A weight sharing strategy between blocks is - To introduce a novel deep learning architecture based on
used to keep the complexity of the final model low, given the small Convolutional Neural Networks for the diagnosis of MCI, a prior
number of training samples. To compare the detection performance of stage of Alzheimer’s disease.
the proposed model we compare it with several classic ML models: - To compare the proposed model with other CNN architectures and
Logistic Regression (LR), Naïve Bayes (NB), Gradient Boosting Machine classic ML models to detect MCI using a large collection of features
(GBM), Random Forest (RF), Gaussian Process Classifier (GP Classifier) obtained from MEG signals
and Support Vector Machine (SVM) as well as more advanced deep - To handle noisy and scarce training data with a 2D-CNN ensemble
learning models based on convolutional neural networks (CNN) [19] to model that averages a collection of sub-models based on the sto-
classify individuals as healthy aging subjects or MCI patients. chastic permutation of the input.
The features used to train the different models are based on a - To build a model with excellent classification results in a challen-
measure of statistical dependence, mutual information, between the ging scenario (scarce samples with a large feature’s dimensionality).
MEG signals of pairs of sensors, with a total of 102 sensors and 5151
pairs of sensors per subject. After post-processing the signals, we obtain The paper is organized as follows: Section 2 summarizes previous
five final features for each pair of sensors. The total number of in- works. Section 3 describes the dataset and models employed. Section 4
dividuals for the experiment is 132 (59 % with MCI), and the total details the results and section 5 presents the conclusions.
number of features, after post-processing of the MEG signals, is 25755
per individual. 2. Related works
The high ratio of number of features, 25755, to number of samples,
132, is associated with a high risk of overfitting. We have, therefore, The approach adopted in this work, which consists in the transfor-
taken special care to guarantee the generalization of results. Deep mation of MEG signals into images to subsequently apply CNN models
learning (DL) models are known to need large datasets with a large (which are especially suitable to handle images) has already been ex-
number of samples. We have circumvented this by using Deep Learning plored in other works investigating brain function. In [15] a CNN to
architectures, based on Convolutional Neural Networks, with a very classify patients with epilepsy and spinal cord injury was proposed.
limited number of parameters by carefully selecting the filter shapes, They calculated relative power spectra from MEG signals. The activity
max-pooling layers and employing weight-sharing for the more com- was presented in a channel vs. time matrix and treated as an image by a
plex architectures. CNN network. A similar approach was adopted in [13], this time using
An exhaustive comparison of the results is provided for all models. the raw data from time-series recordings, with a previous feature ex-
Special care has been taken in the selection of classification perfor- traction based on Independent Component Analysis (ICA). The objec-
mance metrics and the test strategy, which is based on a nested cross- tive of this last work was to predict the age of children based on various
validation approach. Considering the small number of samples, it is speech tasks. This approach is different from using anatomical brain
especially important to guarantee the distribution of the performance images - to identify brain conditions, such as using MRI images [2]. The
metrics. For this reason, the mean, standard deviation, maximum and identification of a data matrix as an image to allow the subsequent
minimum values and the 25 %, 50 % (median) and 75 % quartiles are application of a CNN has also been explored in other fields, such as
provided for all performance metrics and all models. The selected network intrusion detection and multimedia quality of experience
classification performance metrics are accuracy, F1-score, precision and prediction [22,23], providing, in both cases, an excellent improvement
recall. These four metrics provide a comprehensive description of the in classification results.
different aspects and objectives of a classifier. We need to mention the The approach taken in [27] employs a CNN architecture that is used
importance of the recall metric for this particular classification pro- as a feature extraction tool in conjunction with FreeSurfer [27]. The
blem, since it is essential to avoid false negatives (subjects for which work proposes an MCI detector using magnetic resonance imaging
MCI is not detected). (MRI) data. Features extracted from MRI are further processed with
To apply a 2D convolutional network to a one-dimensional vector of PCA (dimensionality reduction) and Lasso (feature selection). The final
features it is necessary to reshape the features into a matrix structure processed features are used by an Extreme Learning Machine for MCI
(pseudo-image) with an arbitrary allocation of features to positions classification.
within the new matrix. This approach of transforming one-dimensional Randomization of the input features has also been applied to AD
features into a two-dimensional structure, to be used by a 2D-CNN, has detection in an ensemble configuration of neural networks (NN), with
already been applied in other areas with excellent results [22,23]. In the NNs trained with randomly selected samples and features obtained
this work, we extend this approach by creating an ensemble of CNN after calculating the Pearson correlation coefficient between the fMRI
models, each with different inputs formed by matrices created by ran- time-series of pairs of voxels [8]. The randomization of the input fea-
domly permuting the original features (from the one-dimensional tures or the intermediate network representations has also been applied
vector of features) to different positions within each matrix. The output in other areas. In [28] an image classifier with a low computational
of each CNN model is averaged forming the output of the ensemble. The infrastructure was proposed. This classifier employed group convolu-
ensemble has an additional regularization effect that is very helpful in tions followed by a random combination of intermediate outputs. A
addressing this classification problem. different image classifier was presented in [29], where a CNN network
To understand why the randomized 2D-CNN model yields good was replaced by a series of linear classifiers that were fed by small
results we have to take into that we are dealing with high-dimensional portions of the image with a final aggregation (averaging) per class. In
noisy signals of which only a small number of samples are available. this case, there was no randomization of the input. Instead, input seg-
This creates a scenario prone to overfitting. The use of stochastic regation was employed, followed by separate classifiers with a final
methods to combat overfitting, as considered in many previous works aggregation of results. Still in the field of image classification [30],
[24–26], was the basis to apply randomization to the input part of the proposed a simple architecture based on the application of several fixed
network [26] instead of the intermediate [25] or final [24] parts. In random perturbations to the original image with a final linear combi-
addition, the fact that CNN models exploit spatial local correlations was nation of the perturbed intermediate images after a rectified linear unit
a motivation to introduce the random permutation of features, with the (ReLU) activation function. Surprisingly, the resulting architecture
goal of increasing the exploration of combination of variables that were produces similar classification results to a CNN network. In [31], an
not close to each other in their original locations. input shifting is performed within the CNN kernels rather than the
2
inputs. The work in [32] points out the effect of the locality property for Where p (X , Y ) is the joint probability of x (t ) and y (t ) , were value pairs
images and how when this property is lost (by randomly mixing pixels) were obtained from the same time instant t, and px and py are the
the usual performance of classic CNN is greatly affected, while other marginal probabilities of x (t ) and y (t ) . Prior to the calculation of the
CNN architectures (such as dilated convolutions) can provide much probability distributions the time series were discretized into 15 bins to
better results by integrating non-adjacent pixels into the same receptive yield the value ranges X and Y . Subsequently, a normalized mutual
field, which is also the purpose, by other means, of randomizing the information value was obtained:
input positions. Applying a series of dilated convolutions with different
I (X ; Y )
dilation rates provides an effect similar to input randomization, in a Inorm (X ; Y ) =
more deterministic way and also requiring a deeper network. H (X ; X ) H (Y ; Y ) (2)
The work presented here differs from the works mentioned above in
where H (X ; X ) and H (Y ; Y ) are the entropies of X and Y respectively
that it also employs randomization of the input, but as a means to create
an extended dataset by permuting the inputs of the original dataset H (Z ; Z ) = − ∑ pZ (z ) log (pZ (z ))
(treated as a pseudo-image), producing several shuffled pseudo-images z∈Z (3)
that are handled by separate networks with a final aggregated result.
This ensemble-like architecture produces better results than individual A value of normalized mutual information was obtained for each
CNN networks, and is more robust against overfitting, which is a real participant, pair of sensors and epoch. Next, five summary values were
problem considering the small number of samples and the large number derived by collapsing values across epochs: mean, median, standard
of features in this type of AD detection tasks. deviation, mean absolute deviation and range (maximum-minimum).
The total number of features (25755) corresponds to the number of
ways to select 2 sensors out of 102 sensors multiplied by the 5 different
3. Methods description
summary statistics of the mutual entropy per pair of MEG signals (with
one MEG signal per sensor). Feature values are continuous in the range
In this Section we provide details of the dataset used to carry on the
[0–1]. Fig. 1 shows a diagram of the process followed to obtain the
experiments, the architecture proposed to detect Mild Cognitive
features for each patient.
Impairment and the validation/test strategy applied. The employed
Considering the small number of samples and the large number of
dataset and the proposed model are presented on Sections 3.1 and 3.2
features it was necessary to perform feature selection with the aim to
respectively. The validation/test strategy is provided in Section 3.3.
achieve an ideal ratio of number of samples to number of features of 10.
To obtain the final set of features we applied three features selection
3.1. Selected dataset techniques to the dataset: (1) Random Forest variable importance [36],
(2) selection based on iterative feature elimination after comparing
Data was acquired in the Centre for Biomedical Technology, with random features created by shuffling the original features (Boruta)
(Madrid, Spain). Resting-state data was collected from 78 Mild [37] and (3) recursive feature elimination (RFE) (caret) [36]. All these
Cognitive Impairment (MCI) patients and 54 healthy aging participants. methods provide a ranking of features from highest to lowest im-
The demographic data is provided in Table 1. portance. The features resulting from the intersection of the feature sets
MCI patients fulfilled the following criteria [33]: (i) memory com- provided by these three selection techniques were arranged into two
plaint, corroborated by an informant, (ii) objective memory impairment final set of features: one set with 870 features and other with 20 fea-
for age, (iii) relatively preserved general cognition for age, (iv) essen- tures (the best ranked features in the intersection set). These feature
tially intact activities of daily living, and (v) not meeting the criteria for sets 1 and 2 (FS1 and FS2) corresponding to the previously defined 20
dementia. Exclusion criteria were having a neurological disease, a and 870 features, respectively, were then used by the different algo-
psychiatric disorder, or any severe illness. The ethics committee at the rithms presented in this work. Which feature set was used in each al-
MEG center approved the study and all participants gave written in- gorithm is specified in Section 4 (Results).
formed consent. The aforementioned scheme for performing feature selection at-
Participants sat in a 306-channel Elekta MEG system (Elekta Oy, tempts to avoid the selection bias problem [38].This problem happens
Helsinki, Finland) during 4 min with eyes closed. The sampling rate was when performing feature selection with a reduced number of samples
1000 Hz. A temporal signal space separation algorithm (tSSS) was ap- and a large number of features. This problem is due to random corre-
plied to reduce environmental and biological noise components [34]. lations between the features and the outcomes, even for unimportant
MEG data preprocessing was performed with FieldTrip [35]. Record- features. The intersection between the best sets of three different se-
ings were band-pass filtered between 2 and 45 Hz. From the available lection techniques is intended to avoid this problem. In addition, two of
sensors, the analysis focused on the 102 magnetometers. The time- the R packages used for feature selection: caret and Boruta [36,37],
series were subsequently segmented into 2 s epochs. Ocular, muscular both take into account the selection bias problem. In caret [36], the
and jump artifacts were automatically detected with Fieldtrip and the feature selection method incorporates the resampling solution proposed
corresponding segments discarded from the analysis. in [38] using cross validation within each iteration of the feature se-
The statistical dependencies between the signal time-series x (t ) lection process. Boruta [37] implements a sophisticated selection
from sensor i and y (t ) from sensor j were measured with mutual in- technique based on comparing each feature with a synthetic shadow
formation: feature made by shuffling the values of the original feature.
p (x , y ) ⎞ The number of 20 features in FS1, was selected after some ad hoc
I (X ; Y ) = ∑ ∑ p(X ,Y ) (x , y) log ⎜⎛ p (X(,xY)) p ⎟ tests, as a good compromise between a reduced number of features and
y∈Y x∈X ⎝ X Y (y ) ⎠ (1) the classification performance obtained. Similarly, a number of 870 was
Table 1
Mean ± standard deviation and range (between brackets) of the age and Mini Mental State Exam (MMSE) for the two diagnostic groups, and proportion of males
and females.
Diagnostic Group Age MMSE Sex (% Female)
Healthy Aging Participants 69.95 ± 4.40 [61−80] 29.33 ± 0.79 [27–30] 72.73
MCI Patients 74.02 ± 6.40 [58−87] 27.61 ± 2.17 [22–30] 51.16
3
Fig. 1. Diagram of the generation of features associated with a generic patient (patient k). Each pair of patient sensors produces five features. The diagram shows two
of these pairs. The sensor layout on the left was created with Fieldtrip [35].
chosen for FS2 as it can be factored into two similar numbers, namely softmax activation for the last layer and cross entropy as the loss
29 and 30, allowing the construction of a squared-shaped matrix as function. The last layer has a dropout [24] with a drop rate of 0.5. The
input for the 2D-CNN models. total number of trainable weights is 147 with a batch size of 10 and 80
epochs of training with an early stopping of 30 epochs. The 1-D CNN
model shown in Fig. 2 is the 1-D CNN model that presents the best
3.2. Model description results after testing several variants that included additional 1-D CNN
layers, batch-normalization and max-pooling layers. The 1-D CNN
The objective is to explore and compare several CNN architectures model has 147 trainable parameters (weights).
to detect Mild Cognitive Impairment using a large collection of features The training samples used for the 1-D CNN models are vectors of 20
obtained from MEG signals while patients are at rest. We investigated features obtained after a feature selection process (Section 3.1) from the
the following CNN architectures: 25755 original features. This model is applied to a 1-D vector of fea-
tures. The reason for using the FS1 dataset (20 features) is due to the
- 1-D CNN: A classic 1-D CNN [39] applied to the one-dimensional best results obtained compared to the FS2 dataset (870 features) when
vector of features derived from the MEG signals. we apply the 1D-CNN model.
- 1-D CNN + Gaussian Process Classifier: Similar to the previous ar- The model in Fig. 3 is similar to the model in Fig. 2 with the ad-
chitecture with a final layer formed by a Gaussian Process Classifier dition of a gaussian process classifier (GPC) [40] to the output of the
[40]. last layer of the 1-D CNN model. A GPC is a non-parametric bayesian
- 2-D CNN: A classic 2-D CNN [19] applied to synthetic images model, well known for its suitability to classify with a small number of
formed by an arbitrary arrangement of the features obtained from features. Nevertheless, this model did not produce better results than
the MEG signals. the simpler 1-D CNN model. To train the 1-D CNN + GPC model, we
- Randomized 2-D model: This is the novel architecture proposed for used a two-step process, with a previous training phase of the 1-D CNN
this work achieving the best classification results. It consists on an model and a subsequent and independent training phase of the GPC
ensemble of 2D-CNN networks with a permutation of the input va- model, using as inputs the outputs from the last layer of the already
lues for each network. trained 1-D CNN model. The two-phase training could be one of the
reasons for the lower performance of this model compared to the 1-D
The dataset used to train the models was FS2 with 870 features for CNN model, since we did not perform an end-to-end training of the
the 2-D CNNs and FS1 with 20 features for de 1D models. Considering complete model. The kernel used for the GPC was an RBF kernel. The 1-
the small number of samples and the large number of features, the D CNN + GPC model has 140 trainable parameters (weights).
application of a CNN poses difficulties since they typically require many The model in Fig. 4 corresponds to a 2-D CNN [19]. In this case the
samples. The number of parameters (weights) of a CNN is usually large, 1-D feature vector of 870 features (Section 3.1) was reshaped into a
which implies that many samples are necessary to tune the parameters matrix structure (pseudo-image) of dimension 30 × 29. The arrange-
and avoid overfitting. Therefore, special care has been taken to create ment of the features in a pseudo-image allows to apply a 2-D CNN as
CNN architectures with a small number of weights and to ensure the shown in Fig. 4. Several architectures were tested. The architecture
generalization of the results to the greatest extent possible. presented in Fig. 4 is the one that provides the best results for this type
Starting with the 1-D CNN [39] architecture, Fig. 2 presents the 1-D of models, after testing several variants that included additional 2-D
CNN model. It is a simple model with just one 1-D CNN layer (5 filters, CNN layers, batch-normalization and max-pooling layers. This archi-
kernel size of 5 and stride of 5) followed by two fully connected layers tecture is formed by three 2-D CNN layers with intermixed batch-nor-
(FCL) with 5 and 2 nodes respectively. Between the CNN and FCL layers malization [20] and max-pooling [21] layers. The last model compo-
it is necessary to perform a vector flattening of the output tensor of the nents are 2 FCL (Fig. 4) with 5 and 2 nodes respectively. The
CNN layer, generating a vector of 20 elements that is the input to the configuration of the three 2D-CNN layers is, in order, as follows: first (2
FCL layers. The activation function of the layers is ReLU with a final
4
Fig. 2. 1-D CNN model.
filters, kernel size of 3 and stride of 2), second (3 filters, kernel size of 3 outputs of all the models, we used an element-wise average operation
and stride of 1) and third (4 filters, kernel size of 2 and stride of 1). between the k flattened output vectors produced by the k sub-models.
Between the CNN and FCL layers it is necessary to perform a vector The resulting vector is used as input for the last FCL layers. After testing
flattening of the output tensor of the last CNN layer, generating a vector several values of k, we chose a value of 10, which provides sufficient
of 16 elements, which is the input to the FCL layers. The 2 max-pooling input randomization while keeping the computational cost under rea-
layers have a pool size of 2. The activation function of the layers is sonable limits. The randomized 2-D CNN model has 236 trainable
ReLU with a final softmax activation for the last layer and cross entropy parameters and 10 non-trainable parameters (for batchnormalization
as the loss function. The last layer has a dropout [24] with a drop rate of layers). This is the same number of parameters (weights) as the 2-D
0.5. The total number of trainable weights is 236, with a batch size of CNN model. This is a desired result since we achieve a much versatile
10 and 80 epochs of training, with an early stopping of 30 epochs. The architecture with the same model complexity thanks to the weight
2-D CNN model has 236 trainable parameters (weights) and 10 non- sharing and the final average function.
trainable parameters, associated to batch-normalization layers. It is important to note that our initial dataset consists of 25755
The last model is presented in Fig. 5. This is the model that showed features that correspond to the number of ways to select 2 sensors out of
the best classification results (Section 4). This model is formed by a set 102 sensors multiplied by the 5 different summary statistics per MEG
of k sub-models, each identical to the model presented in Fig. 4. The signal pair. Therefore, the location of these features in the dataset does
input to each of the k sub-models is formed by the same pseudo-image not follow a clear order (under a distance-metric), nor does it represent
matrix structure used for the model in Fig. 4, with a previous additional an easy association with the location of the sensors (Fig. 1). Ad-
random permutation of the columns (features). The permutation of the ditionally, the difficulty in creating location-aware features increases
columns is applied independently for each of the k inputs of the model. with the pre-processing (feature selection) applied to the dataset (Sec-
Therefore, the process followed to create the k inputs involves per- tion 3.1). In summary, we have a large number of features with a non-
forming k random permutations of all the columns of the original da- informative position. CNN models exploit the correlation between ad-
taset, producing k column-permuted datasets. The random permuta- jacent features using a location-based metric. Its application to image
tions, once defined, are maintained throughout the training and datasets is fully justified, as smooth changes in pixel properties in local
prediction phases i.e. the position for each feature in each of the k in- proximity are intrinsic features of images. However, its application to
puts is different and established randomly, but once they are estab- other datasets where the position of the features is purely random
lished, their position is maintained, otherwise training and prediction creates a problem since the importance of locality is not satisfied.
would be completely stochastic and impossible. These datasets are the However, even in these cases, CNN has proven to be incredibly useful
source for generating k pseudo-images for the k sub-models. Con- [22,23].
sidering the large number of weights of the final model, due to the To overcome the problem of random positioned features for datasets
repetition of sub-models, we have implemented a weight sharing where the location of the features does not provide any information, the
among all sub-models, keeping the total number of trainable weights at solution has been to add successive convolutional layers, so that the
236, which is important to reduce overfitting. To incorporate the deeper layers can integrate more distant features regardless of their
Fig. 3. 1-D CNN + GPC model.
5
Fig. 4. 2-D CNN models.
spatial (not meaningful) location. In our case, the addition of a large mapping between the original (non-image) to an image-like data by
number of convolutional layers is not feasible since we are facing a transferring similarity between the original features (in the non-image
problem that is extremely prone to overfitting. That is the reason for data) to location distance between these features in the transformed
proposing an alternative solution where, instead of a single input, we data, as proposed in [41]. This is an interesting alternative, but with
create k copies of the input, each of them formed with the same features clear challenges, the first and most important being the necessary a-
but with their positions permuted. The permutation performed for each priori definition of similarity, location distance and feature co-location
input is set randomly. It is important to note that the randomization of in the new dataset. In summary, with this approach there is a need to
the features location is applied independently for each of the k inputs of manually define an appropriate feature representation (feature en-
the model, and that the random permutation of the features, once de- gineering problem), which is precisely the main problem solved by a
fined, is maintained throughout the training and prediction phases i.e. CNN i.e. automatic extraction of feature representation. Avoiding this
the position for each feature in each of the k inputs is different and contradiction was the main reason for not following this alternative in
randomly established, but once they are established their position is this work.
maintained, otherwise training and prediction would be completely The creation of k randomly permuted copies of the input comes
stochastic and impossible. along with two additional elements:
In summary, creating k independent inputs with their features in
permuted positions is one way to provide a CNN different feature lay- • A final averaging layer, that serves two purposes: 1) an additional
outs, not making it necessary to include a large number of layers, and ensemble property to prevent overfitting, and, 2) a way to obtain
bringing together in the same receptive field two features that could be stable results considering the initial random allocation of features
highly correlated with widely separated initial locations. An additional position.
option to deal with non-image data using a CNN is to perform a • Weights shared among all sub-models, which reduces the amount of
Fig. 5. Randomized 2-D CNN model (proposed model).
6
Fig. 6. Test strategy details. Several test sets are considered to obtain a distribution of classification performance metrics from which statistics such as the mean and
standard deviation can be obtained.
weights, decreasing the possibility of overfitting and creating a and 40, respectively. This number is 16 for the other models (Table 2).
smoothing effect on the training level by imposing an average on The number of epochs and batch sizes used to train the models are also
gradient updates for all k inputs. shown in Table 2.
We implemented all the neural network models in python using

4. Results
Tensorflow/Keras. For all the classic ML models we used the scikit-learn
python package [42]. To perform feature selection, we used the R
In this section we present the classification performance results
packages: caret [36] and Boruta [37].
obtained with all the proposed models. The focus of this work was to
investigate the suitability of employing several CNN architectures for
3.3. Validation & test strategy the early detection of Alzheimer’s disease, with an emphasis on pro-
posing novel models that can be trained in the context of a small
The test strategy followed is a two-stage nested cross-validation number of high-dimensional noisy signals. Here we show that our
(CV) approach. This approach provides a more complete analysis, as a proposed model (randomized 2D-CNN) provides the best classification
distribution of values under various configurations of sets of samples results. Special care was taken to the test strategy, to guarantee con-
used for training and testing is obtained. The test strategy is presented fidence in the results.
graphically in Fig. 6. The inner-CV cycle is used to decide when to stop The results presented in this section come from applying the dataset
training, Meanwhile, the outer-CV cycle is used to randomize the described in Section 3.1 to various types of models:1) classic ML
training samples employed in the inner-CV. This allows having in- models: LR, NB, GBM, RF, GP Classifier and SVM with radial basis
dependent and randomly chosen test samples to produce the final test functions (RBF); 2) CNN architectures: 1D-CNN, 1D-CNN plus GPC, 2D-
results. To increase the number of final results, the entire test archi- CNN and Randomized 2-D CNN models, and; 3) other neural network
tecture presented in Fig. 6 is repeated n times. Table 2 presents the architectures: Multilayer Perceptron (MLP) and Extreme Learning Ma-
specific values for the number of repetitions of the entire process (first chine (ELM) [43].
numeric column), and the number of outer and inner iterations of the An exhaustive comparison of the results is provided for all models
hierarchical CV strategy for each of the models. The total number of (Figs. 7 and 8) with carefully selected classification metrics. Con-
final test results obtained for each model is the product of the number sidering the small number of samples, it is especially important to
of repetitions (Table 2: first numeric column) and the number of outer provide the distribution of the performance metrics. For this reason, the
iterations (Table 2: second numeric column). The total number of test mean, standard deviation, maximum and minimum values and the 25
results corresponds to the “count” value in Figs. 7 and 8, which may be %, 50 % (median) and 75 % quartiles are provided for all performance
different for each model. These test results are those used to obtain the metrics and all models. The selected classification performance metrics
statistics per model presented in Figs. 7 and 8 (mean, standard devia- were: accuracy, F1-score, precision and recall. These four metrics pro-
tion, quartiles…). Considering the large number of validation cycles vide a good overview of the different aspects and objectives of a clas-
and iterations and their impact on computing time, we chose to obtain sifier.
more test results for the models with the best performance. This is why To define these metrics, and using the usual terminology, we con-
the number of results for the 2D-CNN and randomized 2D-CNN is 24 sider the presence of MCI as a ‘positive’ result and the healthy state a
Table 2
Parameters used to implement the test strategy for the different models.
Model Repetitions Outer iterations Inner iterations Epochs Batch size
Randomized 2D-CNN 5 8 20 30 10
2D-CNN 3 8 50 80 10
1D-CNN 2 8 50 80 10
1D-CNN + GPC 2 8 50 80 10
Multilayer Perceptron 5 8 20 30 10
Extreme Learning Machine 5 8 20 30 10
Classic ML models 2 8 NA NA NA
7
Fig. 7. Results obtained for all models based on neural networks.
‘negative’ result. With these definitions, a true positive (TP) is a pre- F 1 − score = 2 ×
Precision × Recall
diction that the patient suffers from MCI when that is actually the case. Precision + Recall (7)
A true negative (TN) is defined as predicting that an actual healthy
participant does not suffer from MCI. A false positive (FP) occurs when Where a ‘#’ sign before a symbol denotes the total number of occur-
an actual healthy participant is classified as suffering from MCI. Finally, rences in that category. E.g., #TN is the total number of true negatives.
a false negative (FN) is defined as the instance when an actual patient is All the results presented are based on the dataset described in
classified as not suffering from the condition. With these definitions, the Section 3.1, where the feature set FS2 (870 features) was used for the
metrics are expressed as: 2D-CNN and Randomized 2D-CNN models, and the feature set FS1(20
features) was used for all the other models.
#TP + #TN Figs. 7 and 8 provide separate results for all models with one table
Accuracy =
#TP + #TN + #FP + #FN (4) per model. There are four scores: accuracy, F1, precision and recall, and
their associated statistics: mean, standard deviation, minimum and
#TP maximum values, three quantiles and the number of results used to
Precision =
#TP + #FP (5) perform the statistics (first row named “count”). A box-plot for each
particular set of results is also provided below each model. The total
#TP number of test results corresponds to the “count” value in Figs. 7 and 8,
Recall =
#TP + #FN (6)
which may be different for each model.
8
Fig. 8. Results obtained for all models based on classic ML models.
The recall metric is particularly important since it reflects the character of the original features, it was natural to use a 1D-CNN as one
number of false positives, actual patients who have not been diagnosed, of the alternatives. A 2D-CNN model was also a clear candidate since
which is a number one would like to keep as low as possible. The our proposed model is based on it. Finally, the reason for adding a
precision metric provides an indication of the number of false positives, Gaussian Process Classifier (GPC) to the 2D-CNN was to investigate
which should also be kept low, as diagnosing a healthy participant with whether incorporating a non-parametric bayesian model could improve
the condition also has costs. This is the reason why the F1-score, which the results considering its excellent performance on problems with a
is the geometric mean of precision and recall, has been adopted as our limited number of samples. The other two neural network architectures
main indicator for the prediction quality of the classifier. Fig. 7 shows (MLP and ELM) were chosen, since MLP is the basis for all neural
that our proposed model (randomized 2D-CNN) provides the best va- network models and ELM was used for similar detection of MCI in
lues for the F1-score. It is superior with respect to all other metrics as previous works [27]. Finally, the classic ML models (LR, NB, GBM, RF,
well, except for the 1D-CNN model, that provides a better average result GP Classifier and SVM) were chosen to compare the results with well-
for the recall metric, but a worse median result for this metric compared known methods.
to the randomized 2D-CNN model. The classic ML classifiers (Fig. 8) are setup with a minimum hyper-
The reason for choosing the other 3 deep learning models (2D-CNN, parameters tuning from their default values in scikit-learn [42]. The
1D-CNN and 1D-CNN + GPC) has been to compare the proposed model Multilayer Perceptron (MLP) is formed by two hidden layers of 15 and 2
with other simpler ones, which were also based on a convolutional neurons with ReLU and softmax as activation functions for the middle
neural network architecture. In our case, and given the longitudinal and last layers and cross entropy as the loss function. The Extreme
9
Fig. 9. Histograms (upper charts) and probability density function approximation (lower chart) for the results of the Randomized 2D-CNN model.
Learning Machine consists of a single hidden layer of 20 neurons with requiring additional least squares optimization.
random initialization and sigmoid activations. The weights of the Considering the interest of the results from the 2D-CNN randomized
hidden layer are set as non-trainable, and the layer operates as a model, Fig. 9 provides a detailed set of histograms for the four statistics
random projection. The end layer has a softmax activation with a cross of results for this model (Fig. 9, upper part). Together with the histo-
entropy loss. The entire model is trained end-to-end like a normal grams, Fig. 9 also offers an approximation to the probability density
neural network. The end layer functions similarly to a logistic regres- function for these four statistics (Fig. 9, bottom). It is interesting to
sion and its weights are trained as proposed in [43], but without appreciate that no more than 4 values (10 % of the total values) are
10
Fig. 10. Results obtained using data augmentation.
below 0.8 for any statistic, that is, 90 % of the values obtained for any Fig. 11 provides the average (mean) values for all metric scores and all
statistic are larger than 0.8, with most of them being much closer to 1. models. The figure is color-coded column-wise, with a color palette
Given the small number of samples available for training, we ad- from green to red, where the greenest color is for best results and the
ditionally explored whether increasing the number of samples by data reddest for worst results. Given the importance of the F1-score, the bars
augmentation would be of benefit. Fig. 10 provides the results obtained in Fig. 12 show the corresponding average F1 value for all models.
by several models after a 3-fold increase in the number of samples. New Looking at Figs. 11 and 12, it can be seen how the Randomized 2D-CNN
samples were created by adding Gaussian noise with small variance to model is the one that provides the best classification results.
the original features. We can observe from the results in Fig. 10 that To ensure that the good results obtained by the Randomized 2D-
there is no improvement for any of the algorithms. We have not con- CNN model are significantly better than those obtained by other models
sidered the use of methods such as the Synthetic Minority Over-sam- from a statistical point of view, Fig. 13 presents the application of the
pling Technique (SMOTE) and its variants [44] since these methods Wilcoxon one-sided unpaired rank-sum test for the comparison of per-
designed for unbalanced datasets, which is not the case here. Ad- formance metrics between the best model (Randomized 2D-CNN) and
ditionally, the behavior of these methods when applied to high-di- the next two best (2D-CNN and 1D-CNN). The Wilcoxon test is a non-
mensional data (large number of features) [45], which corresponds to parametric test that checks whether two groups of values show differ-
the present situation, has not been fully investigated. ences in their population mean ranks. It does not assume any a-priori
Furthermore, we built ensembles of models based on various com- probability distribution of the values. The Wilcoxon test was applied to
binations of models: (1) NB, GBM and RF, and (2) NB, GBM, RF and compare the results of the best model with each of the other two models
GPC; using a majority voting strategy to select the final label. This separately. Fig. 13 presents these two separate tests with two in-
ensemble approach did not provide any improvement compared to the dependent blocks of results (left and right parts of Fig. 13), each pro-
best individual model used to conform the ensemble. viding the p-value resulting from the test and the test results that allow
Figs. 11 and 12 summarize all the results presented previously. (or not) to reject the null hypothesis that is associated with a non-
Fig. 11. Average values of all performance metrics for all models evaluated.
11
Fig. 12. Detailed diagram of the average values of F1-scores for all models evaluated.
significant difference in the results. The test used was one-sided to produces several measurements (based on different test-sets) of the
specifically check if one of the groups has higher ranked mean that the classification metrics, whose statistics reinforce confidence in the gen-
other. The values in Fig. 13 are adjusted for multiple comparisons eralization of the outcomes. A complete and detailed analysis of the
correction (Bonferroni) [46]. The significance level used was 1%. From results is provided.
Fig. 13 we can conclude that the results achieved by the best model are The dataset used was real MEG data obtained at the Centre for
significantly better than the others in three of the four metrics. Only the Biomedical Technology, (Madrid, Spain) from 78 Mild Cognitive
recall metric presents a non-significant result at a level of significance Impairment (MCI) patients and 54 healthy aging participants. The
of 1%, even when the best model has the highest median value for signals post-processing and data preparation necessary to train the al-
recall (Fig. 7); the greater variance of the recall results (Fig. 9) could be gorithms is presented in detail.
an explanation for this result. This work shows that it is possible to apply CNN models to scenarios
that are highly prone to overfitting with an appropriate architecture
based on (1) reducing the number of parameters as much as possible,
5. Conclusion
(2) using an ensemble of similar structures that share their weights, (3)
averaging the results of the ensemble and (4) the permutation of entries
This work presents a novel deep learning architecture based on CNN
to create new input data randomly altered and with different co-loca-
networks for the detection of Mild Cognitive Impairment. The proposed
tion of features. Given the good prediction results of the proposed
model applies stochastic permutation of the training data (synchroni-
model, this approach could be the basis for practical applications in the
zation measures extracted from MEG signals of the patients during rest)
early detection of Alzheimer’s disease based on MEG activity.
employing an ensemble of identical learning blocks formed by a com-
In the present work, the features that were used as input to the
bination of CNN, batch-normalization and max-pooling layers. The
classifier were measures of synchronization between MEG channel
outputs from the ensemble are aggregated (averaged) and constitute the
signals. There are other types of variables such as demographic vari-
prediction output of the classifier. We show that the combination of (1)
ables like age and sex, and neuropsychological variables like the MMSE
CNN networks, (2) the averaging capabilities of the ensemble structure
scores that can also contribute to diagnosing patients. It is therefore
and (3) the exploration of new feature interactions (provided by their
very likely that including some of these variables as input to the clas-
random permutation), creates a final architecture that offers excellent
sifier would improve the classification performance. This is a possible
classification results for a challenging scenario characterized by sample
avenue for future work. On the other hand, while classification was the
scarcity and a high-dimensional feature space.
focus of this work, the fact that brain activity can serve as a basis for
We provide empirical results that show that the proposed model
diagnosis indicates that the two diagnostic groups show different brain
offers significantly better classification metrics than alternative models
activity profiles, which is of interest from a fundamental point of view.
based on 2D-CNN and 1D-CNN networks and classic machine learning
In that context, to the extent that the two groups differ in demographic
models (logistic regression, random forest, GBM …). To make the
or neuropsychological values, the identified difference in brain activity
comparison, we apply a hierarchical cross-validation strategy that
Fig. 13. Wilcoxon rank-sum test. This test checks if the results for the best model (Randomized 2D-CNN) are statistically significantly better compared to each of the
next two best models: 2D-CNN (left sub-table) and 1D-CNN (right sub-table).
12
profiles may partially reflect differences in these other variables rather 3389/fncom.2018.00060.
than differences in diagnosis. There were some differences in the age [7] Maestú F, Peña JM, Garcés P, González S, Bajo R, Bagic A, et al. A multicenter study
of the early detection of synaptic dysfunction in Mild Cognitive Impairment using
distribution, as well as in the proportion of females and males, between Magnetoencephalography-derived functional Connectivity. Neuroimage Clin
patients and controls. These between-group age differences are also 2015;9:103–9. https://doi.org/10.1016/j.nicl.2015.07.011.
likely to be found in a clinical setting. On the other hand, MMSE scores [8] Bi XA, Jiang Q, Sun Q, Shu Q, Liu Y. Analysis of Alzheimer’s disease based on the
random neural network cluster in fMRI. Front Neuroinform 2018;7(September
probably require a somewhat different consideration in the present (12)):60. https://doi.org/10.3389/fninf.2018.00060.
discussion. While the relationship between MMSE scores and diagnosis [9] Meszlenyi R, Buza K, Vidnyanszky Z. Resting state fMRI functional connectivity-
is not one to one, it is significant [33]. It would therefore not be pos- based classification using a convolutional neural network architecture arXiv:
1707.06682 [stat.ML]. 2017.
sible, and probably not desirable, to define groups matched for MMSE [10] Ruffini G, et al. EEG-driven RNN classification for prognosis of neurodegeneration
score. Nevertheless, at a fundamental level, it could be interesting to in at-risk patients. Artificial neural networks and machine learning – ICANN 2016.
investigate in future works to what extent between-group differences in Lecture notes in computer science vol. 9886. Springer; 2016. https://doi.org/10.
1007/978-3-319-44778-0_36.
brain activity are more closely related to MMSE scores or diagnosis.
[11] Ruffini G, Ibañez D, Castellano M, Dubreuil L, Gagnon JF, Montplaisir J, et al. Deep
Future work could also investigate to what extent the methods de- learning using EEG spectrograms for prognosis in idiopathic rapid eye movement
scribed here are robust when data from different scanners is used for behavior disorder (RBD). bioRxiv 2018:240267. https://doi.org/10.1101/240267.
diagnosis; future research could also focus on the combination of deep [12] Wu W, Nagarajan S, Chen Z. Bayesian machine learning: EEG\/MEG signal pro-
cessing measurements. IEEE Signal Process Mag 2016;33(1):14–36. https://doi.org/
learning models and stochastic methods. 10.1109/MSP.2015.2481559.
This work is also a proof of concept to assess the suitability of [13] Kostas D, Pang EW, Rudzicz F. Machine learning for MEG during speech tasks. Nat
randomized input 2D-CNN models to make predictions on noisy and Sci Rep 2019;9:1609. https://doi.org/10.1038/s41598-019-38612-9.
[14] Martinez-Murcia FJ, et al. A 3D convolutional neural network approach for the
small datasets with a large number of features. And, it opens up in- diagnosis of Parkinson’s disease. natural and artificial computation for biomedicine
teresting research opportunities for other areas with similar problems and neuroscience. IWINAC 2017. Lecture notes in computer science vol. 10337.
e.g. network intrusion detection... This allows us to validate the algo- Springer; 2017.
[15] Aoe J, Fukuma R, Yanagisawa T, Harada T, Tanaka M, Kobayashi M, et al.
rithms and methods developed from previous research activities on Automatic diagnosis of neurological diseases using MEG signals with a deep neural
network traffic analysis and prediction, extending the scope of our re- network. Nat Sci Rep 2019;9(2019):5057. https://doi.org/10.1038/s41598-019-
search to other fields such as neuroscience. 41500-x.
[16] Cichy RM, Khosla A, Pantazis D, Oliva A. Dynamics of scene representations in the
human brain revealed by magnetoencephalography and deep neural networks.
Declaration of Competing Interest NeuroImage 2017;153:346–58. https://doi.org/10.1016/j.neuroimage.2016.03.
063.
[17] Hasasneh A, Kampel N, Sripad P, Shah NJ, Dammers J. Deep Learning Approach for
The authors declare that they have no known competing financial Automatic Classification of Ocular and Cardiac Artifacts in MEG Data. Hindawi J
interests or personal relationships that could have appeared to influ- Eng 2018;2018:1350692https://doi.org/10.1155/2018/1350692.
ence the work reported in this paper. [18] Zubarev I, Zetter R, Halme HL, Parkkonen L. Adaptive neural network classifier for
decoding MEG signals. NeuroImage 2019;197:425–34. https://doi.org/10.1016/j.
neuroimage.2019.04.068.
Acknowledgements [19] Rawat W, Wang Z. Deep convolutional neural networks for image classification: a
comprehensive review. Neural Comput 2017;29(9):2352–449. https://doi.org/10.
1162/neco_a_00990.
The collection of the data employed in the analysis was partially [20] Ioffe S, Szegedy C. Batch normalization: accelerating deep network training by
funded with grant PSI2015-68793-C3-1-R from the Spanish Ministry for reducing internal covariate shift. 2015. arXiv:1502.03167 [cs.LG].
Science, Innovation and Universities. [21] Lee Ch-Y, Gallagher PW, Tu Z. Generalizing pooling functions in convolutional
neural networks: mixed, gated, and tree. 2015. arXiv:1509.08985 [stat.ML].
The preparation of the article and the study of algorithms were [22] Lopez-Martin M, Carro B, Sanchez-Esguevillas A, Lloret J. Network traffic classifier
partially funded with grant RTI2018-098958-B-I00 from Proyectos de with convolutional and recurrent neural networks for Internet of Things. IEEE
I + D+i «Retos investigación», Programa Estatal de I + D+i Orientada Access 2017;5:18042–50. https://doi.org/10.1109/ACCESS.2017.2747560.
[23] Lopez-Martin M, Carro B, Lloret J, Egea S, Sanchez-Esguevillas A. Deep learning
a los Retos de la Sociedad, Plan Estatal de Investigación Científica, model for multimedia quality of experience prediction based on network flow
Técnica y de Innovación 2017-2020. Spanish Ministry for Science, packets. IEEE Commun Mag 2018;56(September (9)):110–7. https://doi.org/10.
Innovation and Universities. 1109/MCOM.2018.1701156.
[24] Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a
simple way to prevent neural networks from overfitting. J Mach Learn Res
Appendix A. Supplementary data 2014;15(1):1929–58http://jmlr.org/papers/v15/srivastava14a.html.
[25] An G. The effects of adding noise during backpropagation training on a general-
Supplementary material related to this article can be found, in the ization performance. Neural Comput 1996;8(3):643–74. https://doi.org/10.1162/
neco.1996.8.3.643.
online version, at doi:https://doi.org/10.1016/j.artmed.2020.101924. [26] Bishop CM. Training with noise is equivalent to tikhonov regularization. Neural
Comput 1995;7(1):108–16. https://doi.org/10.1162/neco.1995.7.1.108.
References [27] Weiming L, Tong T, Qinquan G, Di G, Xiaofeng D, Yonggui Y, et al. Convolutional
neural networks-based MRI image analysis for the Alzheimer’s disease prediction
from mild cognitive impairment. Front Neurosci 2018;12. https://doi.org/10.3389/
[1] Khan A, Usman M. Early diagnosis of Alzheimer’s disease using machine learning fnins.2018.00777.
techniques: a review paper. 2015 7th international joint conference on knowledge [28] Zhang X, et al. ShuffleNet: an extremely efficient convolutional neural network for
discovery, knowledge engineering and knowledge management (IC3K) 2015. p. mobiledevices. 2017. arXiv:1707.01083 [cs.CV].
380–7. https://doi.org/10.5220/0005615203800387. [29] Brendel W, Bethge M. Approximating CNNs with Bag-of-local-Features models
[2] Ullah HMT, Onik Z, Islam R, Nandi D. Alzheimer’s disease and dementia detection works surprisingly well on ImageNet. 2019. arXiv:1904.00760 [cs.CV].
from 3D brain MRI data using deep convolutional neural networks. 2018 3rd in- [30] Juefei-Xu F, Boddeti VN, Savvides M. Perturbative neural networks. 2018.
ternational conference for convergence in technology (I2CT) 2018. p. 1–3. https:// arXiv:1806.01817 [cs.CV].
doi.org/10.1109/I2CT.2018.8529808. [31] Zhao G, et al. Random shifting for CNN: a solution to reduce information loss in
[3] Petersen RC. Mild cognitive impairment. N Engl J Med 2011;364(23):2227–34. down-sampling layers. Proceedings of the twenty-sixth international joint con-
https://doi.org/10.1056/NEJMcp0910237. ference on artificial intelligence (IJCAI-17) 2017:3476–82. https://doi.org/10.
[4] Robertson K, et al. Using varying diagnostic criteria to examine mild cognitive 24963/ijcai.2017/486.
impairment prevalence and predict dementia incidence in a community-based [32] Ivan C. Convolutional neural networks on randomized data. 2019.
sample. J Alzheimer Dis 2019;68(4):1439–51. https://doi.org/10.3233/JAD- arXiv:1907.10935 [cs.CV].
180746. [33] Petersen RC. Mild cognitive impairment as a diagnostic entity. J Intern Med
[5] López-Sanz D, Serrano N, Maestú F. The role of magnetoencephalography in the 2004;256(September (3)):183–94. https://doi.org/10.1111/j.1365-2796.2004.
early stages of Alzheimer’s disease. Front Neurosci 2018;12:572. https://doi.org/ 01388.x. Review.
10.3389/fnins.2018.00572. [34] Taulu S, Simola J. Spatiotemporal signal space separation method for rejecting
[6] Mandal PK, Banerjee A, Tripathi M, Sharma A. A comprehensive review of mag- nearby interference in MEG measurements. Phys Med Biol 2006;51(April
netoencephalography (MEG) studies for brain functionality in healthy aging and (7)):1759–68. https://doi.org/10.1088/0031-9155/51/7/008.
Alzheimer’s disease (AD). Front Comput Neurosci 2018;12:60. https://doi.org/10. [35] Oostenveld R, Fries P, Maris E, Schoffelen JM. FieldTrip: open source software for
13
advanced analysis of MEG, EEG, and invasive electrophysiological data. Comput methodology to transform a non-image data to an image for convolution neural
Intell Neurosci 2011;2011:156869https://doi.org/10.1155/2011/156869. network architecture. Sci Rep 2019;9:11399. https://doi.org/10.1038/s41598-019-
[36] Kuhn M. Caret: classification and regression training. R package. 2019https://cran. 47765-6.
r-project.org/web/packages/caret/caret.pdf. [42] Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-
[37] Kursa MB. Boruta: wrapper algorithm for all relevant feature selection, R package. learn: machine learning in python. J Mach Learn Res 2011;12:2825–30.
2018https://cran.r-project.org/web/packages/Boruta/Boruta.pdf. [43] Huang GB, Zhu QY, Siew ChK. Extreme learning machine: theory and applications.
[38] Ambroise C, McLachlan GJ. Selection bias in gene extraction on the basis of mi- Neurocomputing 2006;70(1):489–501. https://doi.org/10.1016/j.neucom.2005.
croarray gene-expression data. Proc Natl Acad Sci 2002;99(10):6562–6. https://doi. 12.126.
org/10.1073/pnas.102102699. [44] Batista GEAPA, Prati RC, Monard MC. A study of the behavior of several methods
[39] Kiranyaz S, et al. 1D convolutional neural networks and applications – a survey. for balancing machine learning training data. ACM Sigkdd Explor Newsl
2019. arXiv:1905.03554 [eess.SP]. 2004;6(1):20–9. https://doi.org/10.1145/1007730.1007735.
[40] Williams CKI, Barber D. Bayesian classification with gaussian processes. IEEE Trans. [45] Blagus R, Lusa L. SMOTE for high-dimensional class-imbalanced data. BMC
Pattern Anal Mach Intell 1998;20(12):1342–51. https://doi.org/10.1109/34. Bioinformatics 2013:14–106. https://doi.org/10.1186/1471-2105-14-106.
735807. [46] Chen SY, Feng Z, Yi X. A general introduction to adjustment for multiple compar-
[41] Sharma A, Vans E, Shigemizu D, Boroevich KA, Tsunoda T. DeepInsight: a isons. J Thorac Dis 2017;9(6):1725–9. https://doi.org/10.21037/jtd.2017.05.34.
14

1-s2.0-S0933365720300749-main

Uploaded by

Copyright:

Available Formats

1-s2.0-S0933365720300749-main

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1-s2.0-S0933365720300749-main

Uploaded by

Copyright:

Available Formats

Artificial Intelligence In Medicine 107 (2020) 101924

Contents lists available at ScienceDirect

Artiﬁcial Intelligence In Medicine

Detection of early stages of Alzheimer’s disease based on MEG activity with T

Fig. 2. 1-D CNN model.

Fig. 3. 1-D CNN + GPC model.

Fig. 4. 2-D CNN models.

Fig. 5. Randomized 2-D CNN model (proposed model).

We implemented all the neural network models in python using

Fig. 7. Results obtained for all models based on neural networks.

Fig. 8. Results obtained for all models based on classic ML models.

Fig. 10. Results obtained using data augmentation.

You might also like

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.