## A GENERATIVE MODEL BASED ADVERSARIAL SECURITY OF DEEP LEARNING AND LINEAR CLASSIfiER MODELS

____________________ A PREPRINTOctober 17, 2020

In recent years, machine learning algorithms have been applied widely in various fields such as health, transportation, and the autonomous car. With the rapid developments of deep learning techniques, it is critical to take the security concern into account for the application of the algorithms. While machine learning offers significant advantages in terms of the application of algorithms, the issue of security is ignored. Since it has many applications in the real world, security is a vital part of the algorithms. In this paper, we have proposed a mitigation method for adversarial attacks against machine learning models with an autoencoder model that is one of the generative ones. The main idea behind adversarial attacks against machine learning models is to produce erroneous results by manipulating trained models. We have also presented the performance of autoencoder models to various attack methods from deep neural networks to traditional algorithms by using different methods such as non-targeted and targeted attacks to multi-class logistic regression, a fast gradient sign method, a targeted fast gradient sign method and a basic iterative method attack to neural networks for the MNIST dataset.

Keywords First keyword ⋅ Second keyword ⋅ More

### 1 Introduction

With the help of artificial intelligence technology, machine learning has been widely used in classification, decision making, voice and face recognition, games, financial assessment, and other fields [1, 2]. The machine learning methods consider player’s choices in the animation industry for games and analyze diseases to contribute to the decision-making mechanism [3–6]. With the successful implementations of machine learning, attacks on the machine learning process and counter-attack methods and incrementing robustness of learning have become hot research topics in recent years [7–11]. The presence of negative data samples or an attack on the model can lead to producing incorrect results in the predictions and classifications even in the advanced models.

It is more challenging to recognize the attack because of using big data in machine learning applications compared to other cybersecurity fields. Therefore, it is essential to create components for machine learning that are resistant to this type of attack. In contrast, recent works have conducted in this area and demonstrated that the resistance is not very robust to attacks [12, 13]. These methods have shown success against a specific set of attack methods and have generally failed to provide complete and generic protection[14].

Previous methods have shown success against a specific set of attack methods and have generally failed to provide complete and generic protection [14]. This field has been spreading rapidly, and, in this field, lots of dangers have attracted increasing attention from escaping the filters of unwanted and phishing e-mails, to poisoning the sensor data of a car or aircraft that drives itself [15, 16]. Disaster scenarios can occur if any precautions are not taken in these systems [17].

The main contribution of this work is to explore the autoencoder based generative models against adversarial machine learning attacks to the models. Adversarial Machine Learning has been used to study these attacks and reduce their effects [18, 19]. Previous works point out the fundamental equilibrium to design the algorithms and to create new algorithms and methods that are resistant and robust against attacks that will negatively affect this balance. However, most of these works have been implemented successfully for specific situations. In Section 3, we present some applications of these works.

This work aims to propose a method that not only presents a generic resistance to specific attack methods but also provides robustness to machine learning models in general. Our goal is to find an effective method that can be used by model trainers. For this purpose, we have processed the data with autoencoder before reaching to the machine learning model. In our previous works [20, 21] we applied generative model based mitigation approach for the deep learning model attacks.

We have used non-targeted and targeted attacks to multiclass logistic regression machine learning models for observing the change and difference between attack methods as well as various attack methods to neural networks such as fast gradient sign method (FGSM), targeted fast gradient sign method (T-FGSM) and basic iterative method (BIM). We have selected MNIST dataset that consists of numbers from people’s handwriting to provide people to understand and see changes in the data.

The study is organized as follows. In Section 2, we first present the related works. In Section 3, we introduce several adversarial attack types, environments, and autoencoder. In Section 4, we present selection of autoencoder model, activation function and tuning parameters. In Section 5, we provide some observation on the robustness of autoencoder for adversarial machine learning with different machine learning algorithms and models. In Section 6, we conclude this study.

### 2 Related Work

In recent years, with the increase of the machine learning attacks, various studies have been proposed to create defensive measures against these attacks. Data sterility and learning endurance are recommended as countermeasures in defining a machine learning process [18]. Most of the studies in these fields have been focused on specific adversarial attacks and

generally, presented the theoretical discussion of adversarial machine learning area [22, 23].

Bo Li and Yevgeniy Vorobeychik present binary domains and classifications. In their work, the approach starts with mixed-integer linear programming (MILP) with constraint generation and gives suggestions on top of this. They also use the Stackelberg game multi-adversary model algorithm and the other algorithm that feeds back the generated adversarial examples to the training model, which is called as RAD (Retraining with Adversarial Examples) [24]. On the other hand, their work is particular and works only in specific methods, even though it is presented as a general protection method. They have proposed a method that implements successful results. Similarly, Xiao et al. provide a method to increase the speed of resistance training against the rectified linear unit (RELU) [25]. They use weight sparsity and RELU stability for robust verification. It can be said that their methodology does not provide a general approach.

Yu et al. propose a study that can evaluate the neural network’s features under hostile attacks. In their study, the connection between the input space and hostile examples is presented. Also, the connection between the network strength and the decision surface geometry as an indicator of the hostile strength of the neural network is shown. By extending the loss surface to decision surface and other various methods, they provide adversarial robustness by decision surface. The geometry of the decision surface cannot be demonstrated most of the time, and there is no explicit decision boundary between correct or wrong prediction. Robustness can be increased by constructing a good model, but it can change with attack intensity [26].

Mardy et al. investigate artificial neural networks resistant with adversity and increase accuracy rates with different methods, mainly with optimization and prove that there can be more robust machine learning models [14].

Pinto et al. provide a method to solve this problem with the supported learning method. In their study, they formulate learning as a zero-sum, minimax objective function. They present machine learning models that are more resistant to disturbances are hard to model during the training and are better affected by changes in training and test conditions. They generalize reinforced learning on machine learning models. They propose a “Robust Adversarial Reinforced Learning” (RARL), where they train an agent to operate in the presence of a destabilizing adversary that applies disturbance forces to the system. However, in their work, Robust Adversarial Reinforced Learning may overfit itself, and sometimes it can miss predicting without any adversarial being in presence [27].

Carlini and Wagner propose a model that the self-logic and the strength of the machine learning model with a strong attack can be affected. They prove that these types of attacks can often be used to evaluate the effectiveness of potential defenses. They propose defensive distillation as a general-purpose procedure to increase robustness [12].

Harding et al. similarly investigate the effects of hostile samples produced from targeted and non-targeted attacks in decision making. They provide that non-targeted samples are more effective than targeted samples in human perception and categorization of decisions [28].

Bai et al. present a convolutional autoencoder model with the adversarial decoders to automate the generation of adversarial samples. They produce adversary examples by a convolutional autoencoder model. They use pooling computations and sampling tricks to achieve these results. After this process, an adversarial decoder automates the generation of adversarial samples. Adversarial sampling is useful, but it cannot provide adversarial robustness on its own, and sampling tricks are too specific [29].

Sahay et al. propose FGSM attack and use an autoencoder to denoise the test data. They have also used an autoencoder to denoise the test data, which is trained with both corrupted and healthy data. Then they reduce the dimension of the denoised data. These autoencoders are specifically designed to compress data effectively and reduce dimensions. Hence, it may not be wholly generalized, and training with corrupted data requires a lot of adjustments to get better test results [30].

I-Ting Chen et al. also provide with FGSM attack on denoising autoencoders. They analyze the attacks from the perspective that attacks can be applied stealthily. They use autoencoders to filter data before applied to the model and compare it with the model without an autoencoder filter. They use autoencoders mainly focused on the stealth aspect of these attacks and used them specifically against FGSM with specific parameters [31].

Gondim-Ribeiro et al. propose autoencoders attacks. In their work, they attack 3 types of autoencoders: Simple variational autoencoders, convolutional variational autoencoders, and DRAW (Deep Recurrent AttentiveWriter). They propose to scheme an attack on autoencoders. As they accept that “No attack can both convincingly reconstruct the target while keeping the distortions on the input imperceptible.“. This method cannot be used to achieve robustness against adversarial attacks [32].

Table 2 shows the strength and the weakness of the each paper.

Research Study | Strength | Weakness |

Adversarial Machine Learning [18] | Introduces the emerging field of Adversarial Machine Learning. | Discusses the countermeasures against attacks without suggesting a method. |

Evasion-Robust | Demonstrates some methods that can be used on Binary Domains, which are based on MILP. | Very specific about the robustness, even though it is presented as a general method. |

Training for Faster
Adversarial Robustness Verification via
Inducing ReLU | Using weight sparsity and RELU stability for robust verification. | Does not provide a general approach, or universality as it is suggested in paper. |

Interpreting Adversarial Robustness: A View from Decision Surface in Input Space [26] | By extending the loss surface to decision surface and other various methods, they provide adversarial robustness by decision surface. | The geometry of the decision surface cannot be shown most of the times and there is no explicit decision boundary between correct or wrong prediction. Robustness can be increased by constructing a good model but it can change with attack intensity. |

Robust Adversarial | They have tried to generalize reinforced learning on machine learning models. They suggested a Robust Adversarial Reinforced Learning (RARL) where they have trained an agent to operate in the presence of a destabilizing adversary that applies disturbance forces to the system. | Robust Adversarial Reinforced Learning may overfit itself and sometimes it may mispredict without any adversarial being in presence. |

Alleviating Adversarial Attacks via Convolutional Autoencoder [29] | They have produced adversary examples via a convolutional autoencoder model. Pooling computations and sampling tricks are used. Then an adversarial decoder automate the generation of adversarial samples. | Adversarial sampling is useful but it cannot provide adversarial robustness on its own. Sampling tricks are also too specified. |

Combatting Adversarial Attacks through Denoising and Dimensionality Reduction: A Cascaded Autoencoder Approach [30] | They have used an autoencoder to denoise the test data which is trained with both corrupted and normal data. Then they reduce the dimension of the denoised data. | Autoencoders specifically designed to compress data effectively and reduce dimensions. Therefore it may not be completely generalized and training with corrupted data requires a lot of adjustments for test results. |

A Comparative Study of Autoencoders against Adversarial Attacks [31] | They have used autoencoders to filter data before applying into the model and compare it with the model without autoencoder filter. | They have used autoencoders mainly focused on the stealth aspect of these attacks and use them specifically against FGSM with specific parameters. |

Adversarial Attacks on Variational Autoencoders [32] | They propose a scheme to attack on autoencoders and validate experiments to three autoencoder models: Simple, convolutional and DRAW (Deep Recurrent Attentive Writer). | As they have accepted “No attack can both convincingly reconstruct the target while keeping the distortions on the input imperceptible.“. it cannot provide robustness against adversarial attacks. |

Understanding Autoencoders with Information Theoretic Concepts [33] | They examine data processing inequality with stacked autoencoders and two types of information planes with autoencoders. They have analyzed DNNs learning from a joint geometric and information theoretic perspective, thus emphasizing the role that pair-wise mutual information plays important role in understanding DNNs with autoencoders. | The accurate and tractable estimation of information quantities from large data seems to be a problem due to Shannon’s definition and other information theories are hard to estimate, which severely limits its powers to analyze machine learning algorithms. |

Adversarial Attacks and Defences Competition [34] | Google Brain organized NIPS 2017 to accelerate research on adversarial examples and robustness of machine learning classifiers. Alexey Kurakin and Ian Goodfellow et al. present some of the structure and organization of the competition and the solutions developed by several of the top-placing teams. | We experimented with the proposed methods of this competition bu these methods do not provide a generalized solution for the robustness against adversarial machine learning model attacks. |

Explaining And | Ian Goodfellow et al. makes considerable observations about Gradient-based optimization and introduce FGSM. | Models may mislead for the efficiency of optimization. The paper focuses explicitly on identifying similar types of problematic points in the model. |

### 3 Preliminaries

In this section, we consider attack types, data poisoning attacks, model attacks, attack environments, and autoencoder.

#### 3.1 Attack Types

Machine Learning attacks can be categorized into data poisoning attacks and model attacks. The difference between the two attacks lies in the influencing type. Data poisoning attacks mainly focus on influencing the data, while model evasion attacks influencing the model for desired attack outcomes. Both attacks aim to disrupt the machine learning structure, evasion from filters, causing wrong predictions, misdirection, and other problems for the machine learning process. In this paper, we mainly focus on machine learning model attacks.

##### 3.1.1 Data Poisoning Attacks

According to machine learning methods, algorithms are trained and tested with datasets. Data poisoning in machine learning algorithms has a significant impact on a dataset and can cause problems for algorithm and confusion for developers. With poisoning the data, adversaries can compromise the whole machine learning process. Hence, data poisoning can cause problems in machine learning algorithms.

##### 3.1.2 Model Attacks

Machine learning model attacks have been applied mostly in adversarial attacks, and evasion attacks being have been used most extensively in this category. For spam emails, phishing attacks, and executing malware code, adversaries apply model evasion attacks. There are also some benefits to adversaries in misclassification and misdirection. In this type of attack, the attacker does not change training data but disrupts or changes its data and diverse this data from the training dataset or make this data seem safe. This study mainly concentrates on model attacks.

#### 3.2 Attack Environments

There are two significant threat models for adversarial attacks: the white-box and black-box models.

##### 3.2.1 White Box Attacks

Under the white-box setting, the internal structure, design, and application of the tested item are accessible to the adversaries. In this model, attacks are based on an analysis of the internal structure. It is also known as open box attacks. Programming knowledge and application knowledge are essential. White-box tests provide a comprehensive assessment of both internal and external vulnerabilities and are the best choice for computational tests.

##### 3.2.2 Black Box Attacks

In the black-box model, internal structure and software testing are secrets to the adversaries. It is also known as behavioral attacks. In these tests, the internal structure does not have to be known by the tester. They provide a comprehensive assessment of errors. Without changing the learning process, black box attacks provide changes to be observed as external effects on the learning process rather than changes in the learning algorithm. In this study, the main reason behind the selection of this method is the observation of the learning process.

#### 3.3 Autoencoder

An autoencoder neural network is an unsupervised learning algorithm that takes inputs and sets target values to be equals of the input values [33]. Autoencoders are generative models that apply backpropagation. They can work without the results of these inputs. While the use of a learning model is in the form of model.fit(X,Y), autoencoders work as model.fit(X,X). The autoencoder works with the ID function to get the output x that corresponds to x entries. The identity function seems to be a particularly insignificant function to try to learn; however, there is an interesting structure related to the data, putting restrictions such as limiting the number of hidden units on the network[33]. They are neural networks which work as neural networks with an input layer, hidden layers and an output layer but instead of predicting Y as in model.fit(X,Y), they reconstruct X as in model.fit(X,X). Due to this reconstruction being unsupervised, autoencoders are unsupervised learning models. This structure consists of an encoder and a decoder part. We will define the encoding transition as ϕ and decoding transition as ψ.

ϕ : X → F

ψ : F → X

ϕ,ψ = argmin_{ϕ,ψ}||X - (ψ ∘ ϕ)X||^{2}

With one hidden layer, encoder will take the input x ∈ ℝ^{d} = χ and map it to h ∈ ℝ^{p} = F . The h below is referred to as
latent variables. σ is an activation function such as ReLU or sigmoid which were used in this study[36, 37]. b is
bias vector, W is weight matrix which both are usually initialized randomly then updated iteratively through
training[38].

h = σ(Wx + b)

After the encoder transition is completed, decoder transition maps h to reconstruct x′.

x′ = σ′(W′h + b′) where σ′, W′, b′ of decoder are unrelated to σ, W , b of encoder. Loss of autoencoders are trained to be minimal, showed as L below.

L(x,x′) = ||x - x′||^{2} = ||x - σ′(W′(σ(Wx + b)) + b′)||^{2}

So the loss function shows the reconstruction errors, which need to be minimal. After some iterations with input training set x is averaged.

In conclusion, autoencoders can be seen as neural networks that reconstruct inputs instead of predicting them. In this paper, we will use them to reconstruct our dataset inputs.

### 4 System Model

This section presents the selection of autoencoder model, activation function, and tuning parameters.

#### 4.1 Creating Autoencoder Model

In this paper, we have selected the MNIST dataset to observe changes easily. Therefore, the size of the layer structure in the autoencoder model is selected as 28 and multipliers to match the MNIST datasets, which represents the numbers by 28 to 28 matrixes. Figure 2 presents the structure of matrixes. The modified MNIST data with autoencoder is presented in Figure 3. In the training of the model, the encoded data is used instead of using the MNIST datasets directly. As a training method, a multi-class logistic regression method is selected, and attacks are applied to this model. We train autoencoder for 35 epochs. Figure 4 provides the process diagram.

#### 4.2 Activation Function Selection

In machine learning and deep learning algorithms, the activation function is used for the computations between hidden and output layers[39]. The loss values are compared with different activation functions. Figure 5 indicates the comparison results of loss value. Sigmoid and ReLU have the best performance among these values and gave the best results. Sigmoid has more losses at lower epochs than ReLU, but it has better results. Therefore, it is aimed to reach the best result of activation function in both layers. The model with the least loss value is to make the coding parts with the ReLU function and to use the exponential and softplus functions in the analysis part respectively. These functions are used in our study. Figure 6 illustrates the result of the loss function, and Figure 2 presents the structure of the model with the activation functions.

#### 4.3 Tuning Parameters

The tuning parameters for autoencoders depend on the dataset we use and what we try to apply. As previously mentioned, ReLU and sigmoid function are selected to be activation function for our model [37, 39]. ReLU is the activation function through the whole autoencoder while exponential is the softplus being the output layer’s activation function which yields the minimal loss. Figure 2 presents the input size as 784 due to our dataset and MNIST dataset contains 28x28 pixel images[40]. Encoding part for our autoencoder size is 784 × 504 × 28 and decoding size is 28 × 504 × 784.

This structure is selected by the various neural network structures that take the square of the size of the matrix, lower it, and give it to its dimension size lastly. The last hidden layer of the decoding part with the size of 504 uses exponential activation function, and an output layer with the size of 784 uses softplus activation function [41, 42]. We used adam optimizer with categorical crossentropy[43, 44]. We see that a small number is enough for training, so we select epoch number for autoencoder as 35. This is the best epoch value to get meaningful results for both models with autoencoder and without autoencoder to see accuracy. In lower values, models get their accuracy scores too low for us to see the difference between them, even though some models are structurally stronger than others.

### 5 Experiments with MNIST Dataset

#### 5.1 Introduction

We examine the robustness of autoencoder for adversarial machine learning with different machine learning algorithms and models to see that autoencoding can be a generalized solution and an easy to use defense mechanism for most adversarial attacks. We use various linear machine learning model algorithms and neural network model algorithms against adversarial attacks.

#### 5.2 Autoencoding

In this section, we look at the robustness provided with auto-encoding. We select a linear model and a neural network model to demonstrate this effectiveness. In these models, we also observe the robustness of different attack methods. We also use the MNIST dataset for these examples.

##### 5.2.1 Multi-Class Logistic Regression

In linear machine learning model algorithms, we use mainly two attack methods: Non-Targeted and Targeted Attacks. The non-targeted attack does not concern with how the machine learning model makes its predictions and tries to force the machine learning model into misprediction. On the other hand, targeted attacks focus on leading some correct predictions into mispredictions. We have three methods for targeted attacks: Natural, Non-Natural, and one selected target. Firstly, natural targets are derived from the most common mispredictions made by the machine learning model. For example, guessing number 5 as 8, and number 7 as 1 are common mispredictions. Natural targets take these non-targeted attack results into account and

attack directly to these most common mispredictions. So, when number 5 is seen, an attack would try to make it guessed as number 8. Secondly, non-natural targeted attacks are the opposite of natural targeted attacks. It takes the minimum number of mispredictions made by the machine learning model with the feedback provided by non-natural attacks. For example, if number 1 is least mispredicted as 0, the non-natural target for number 1 is 0. Therefore, we can see that how much the attack affects the machine learning model beyond its common mispredictions. Lastly, one targeted attack focuses on some random numbers. The aim is to make the machine learning model mispredict the same number for all numbers. For linear classifications, we select multi-class logistic regression to analyze the attacks. Because we do not interact with these linear classification algorithms aside from calling their defined functions from scikit-learn library, we use a black-box environment for these attacks. In our study, the attack method against multi-class classification models developed in NIPS 2017 is used [34]. An epsilon value is used to determine the severity of the attack, which we select 50 in this study to demonstrate the results better. We apply a non-targeted attack to a multi-class logistic regression trained model which is trained with MNIST dataset without an autoencoder. The confusion matrix of this attack is presented in 9.

The findings from Figure 9 and 10 show that an autoencoder model provides robustness against non-targeted attacks. The accuracy value change with epsilon is presented in Figure 13. Figure 11 illustrates the change and perturbation of the selected attack with epsilon value as 50.

We apply a non-targeted attack on the multi-class logistic regression model with autoencoder and without autoencoder. Figure 13 provides a difference in accuracy metric. The detailed graph of the non-targeted attack on the model with autoencoder is presented in Figure 14. The changes in the MNIST dataset after autoencoder is provided in Figure 3. The value change and perturbation of an epsilon 50 value on data are indicated in Figure 12.

The following process is presented in Figure 4. In the examples with the autoencoder, data is passed through the autoencoder and then given to the training model, in our current case a classification model with multi-class logistic regression. Multi-class logistic regression uses the encoded dataset for training. Figure 10 provides to see improvement as a confusion matrix. For the targeted attacks, we select three methods to use. The first one is natural targets for MNIST dataset, which is also defined in NIPS 2017 [34]. Natural targets take the non-targeted attack results into account and attack directly to these most common mispredictions. For example, the natural target for number 3 is 8. When we apply the non-targeted attack, we obtain these results. Heat map for these numbers is indicated in Figure 15.

The second method of targeted attacks is non-natural targets which is the opposite of natural targets. We select the least mis predicted numbers as the target. These numbers is indicated as the heat map in Figure 15. The third method is the selection one number and making all numbers predict it. We randomly choose 7 as that target number. Targets for these methods are presented in Figure 16. The confusion matrixes for these methods are presented below.

##### 5.2.2 Neural Networks

We use neural networks with the same principles as multi-class logistic regressions and make attacks to the machine learning model. We use the same structure, layer, activation functions and epochs for these neural networks as we use in our autoencoder for simplicity. Although this robustness will work with other neural network structures, we will not demonstrate them in this study due to structure designs that can vary for all developers. We also compare the results of these attacks with both the data from the MNIST dataset and the encoded data results of the MNIST dataset. As for attack methods, we select three methods: FGSM, T-FGSM and BIM. Cleverhans library is used for providing these attack methods to the neural network, which is from the Keras library.

We examine the differences between the neural network model that has autoencoder and the neural network model that takes data directly from the MNIST dataset with confusion matrixes and classification reports. Firstly, our model without autoencoder gives the following results, as seen in Figure 25 for the confusion matrix and the classification report. The results with the autoencoder are presented in Figure 26. Note that these confusion matrixes and classification reports are indicated before any attack.

Fast Gradient Sign Method:

There is a slight difference between the neural network models with autoencoder and without autoencoder model. We apply the FGSM attack on both methods. The method uses the gradients of the loss accordingly for creating a new image that maximizes the loss. We can say the gradients are generated accordingly to input images. For these reasons, the FGSM causes a wide variety of models to misclassify their input [35].

As we expect due to results from multi-class logistic regression, autoencoder gives robustness to the neural network model too. After the DGSM, the neural network without an autoencoder suffers an immense drop in its accuracy, and the FGSM works as intended. But the neural network model with autoencoder only suffers a 0.01 percent accuracy drop.

Targeted Fast Gradient Sign Method: There is a directed type of FGSM, called T-FGSM. It uses the same principles to maximize the loss of the target. In this method, a gradient step is computed for giving the same misprediction for different inputs.

In the confusion matrix, the target value for this attack is number 5. The neural network model with the autoencoder is still at the accuracy of 0.98. The individual differences are presented when compare with Figure 26.

Basic Iterative Method:

BIM is an extension of FGSM to apply it multiple times with iterations. It provides the recalculation of a gradient attack for each iteration.

This is the most damaging attack for the neural network model that takes its inputs directly from the MNIST Dataset without an autoencoder. The findings from Figure 31 show that the accuracy drops between 0.01 and 0.02 percent. The neural network model with autoencoder’s accuracy stays as 0.97 percent, losing only 0.1 percent.

Findings indicate that autoencoding before giving dataset as input to linear models and neural network models improve robustness against adversarial attacks significantly. We use vanilla autoencoders. They are the basic autoencoders without modification. In the other sections, we apply the same attacks with the same machine learning models with different autoencoder types.

#### 5.3 Sparse Autoencoder

Sparse autoencoders present improved performance on classification tasks. It includes more hidden layers than the input layer. The significant part is defining a small number of hidden layers to be active at once to encourage sparsity. This constraint forces the training model to respond uniquely to the characteristics of translation and uses the statistical features of the input data.

Because of this sparse autoencoders involve sparsity penalty Ω(h) in their training. L(x,x′) + Ω(h)

This penalty makes the model to activate specific areas of the network depending on the input data while making all other neurons inactive. We can create this sparsity by relative entropy, also known as Kullback-Leibler divergence.

_{j} = ∑
_{i=1}^{m}[h_{j}(x_{i})] _{j} is our average activation function of the hidden layer j which is averaged over m training
examples. For increasing the sparsity in terms of making the number of active neurons as smaller as it can be, we would want ρ
close to zero. The sparsity penalty term Ω(h) will punish _{j} for deviating from ρ, which will be basically exploiting
Kullback-Leibler divergence. KL(p|| _{j}) is our Kullback-Leibler divergence between a random variable ρ and random variable
with mean _{j}.

∑
_{j=1}^{s}KL(ρ|| _{j}) = ∑
_{j=1}^{s}[ρlog + (1 - ρ)log]

Sparsity can be achieved with other ways, such as applying L1 and L2 regularization terms on the activation of the hidden layer. L is our loss function and λ is our scale parameter.

L(x,x′) + λ∑
_{i}|h_{i}|

##### 5.3.1 Multi-Class Logistic Regression of Sparse Autoencoder

This section presents multi-class logistic regressions with sparse autoencoders. The difference from the autoencoder section is the autoencoder type. The findings from Figure 6 and Figure 33 show that loss is higher compared to the autoencoders in sparse autoencoder.

The difference between perturbation is presented in Figure 35 and Figure 36 compared to the perturbation in Figure 11 and Figure 12. The perturbation is sharper in sparse autoencoder.

Figure 37 indicates that sparse autoencoders performs poorly compared to autoencoders in multi-class logistic regression.

##### 5.3.2 Neural Network of Sparse Autoencoder

Sparse autoencoder results for neural networks indicate that vanilla autoencoder seems to be slightly better than sparse autoencoders for neural networks. Sparse autoencoders do not perform as well in linear machine learning models, in our case, multi-class logistic regression.

#### 5.4 Denoising Autoencoder

Denoising autoencoders are used for partially corrupted input and train it to recover the original undistorted input. In this study,
the corrupted input is not used. The aim is to achieve a good design by changing the reconstruction principle for using
denoising autoencoders. For achieving this denoising properly, the model requires to extract features that capture useful
structure in the distribution of the input. Denoising autoencoders apply corrupted data through stochastic mapping. Our input is
x and corrupted data is and stochastic mapping is ~ q_{D}( |x).

As its a standard autoencoder, corrupted data is mapped to a hidden layer.

h = f_{θ}( ) = s(W + b).

And from this the model reconstructs z = g_{θ}′(h).

##### 5.4.1 Multi-Class Logistic Regression of Denoising Autoencoder

In denoising autoencoder for multi-class logistic regression, the loss does not improve for each epoch. Although it starts better at lower epoch values, in the end, vanilla autoencoder seems to be better. Sparse autoencoder’s loss is slightly worse.

And just like sparse autoencoder, denoising autoencoder also applies a sharp perturbation, which is presented in Figure 48 and Figure 49.

We observe that there is a similarity between accuracy results for denoising autoencoder with multi-class logistic regression and sparse autoencoder results. Natural fooling accuracy drops drastically in denoising autoencoder, but non-targeted and one targeted attack seem to be somewhat like sparse autoencoder, one targeted attack having less accuracy in denoising autoencoder.

##### 5.4.2 Neural Network of Denoising Autoencoder

We investigate that neural network accuracy for denoising autoencoder is worse than sparse autoencoder results and vanilla autoencoder results. It is still a useful autoencoder for denoising corrupted data and other purposes; however, it is not the right choice just for robustness against adversarial examples.

#### 5.5 Variational Autoencoder

In this study, we examine variational autoencoders as the final type of autoencoder type. The variational autoencoders have an encoder and a decoder, although their mathematical formulation differs significantly. They are associated with Generative Adversarial Networks due to their architectural similarity. In summary, variational autoencoders are also generative models. Differently, from sparse autoencoders, denoising autoencoders, and vanilla autoencoders, all of which aim discriminative modeling, generative modeling tries to simulate how the data can be generated and to understand the underlying causal relations. It also considers these causal relations when generating new data.

Variational autoencoders use an estimator algorithm called Stochastic Gradient Variational Bayes for training. This algorithm
assumes the data is generated by p_{θ}(x|h) which is a directed graphical model and θ being the parameters of decoder, in
variational autoencoder’s case, the parameters of the generative model. The encoder is learning an approximation of q_{ϕ}(h|x) to
a posterior distribution which is showed by p_{θ}(x|h) and ϕ being the parameters of the encoder; in variational
autoencoder’s case, the parameters of recognition model. We will use Kullback-Leibler divergence again, showed as
D_{KL}.

L = (ϕ,θ,x) = D_{KL}(q_{ϕ}(h|x)||p_{θ}(h)) - E_{qϕ(h|x)}(logp_{θ}(x|h)).

Variational and likelihood distributions’ shape is chosen by factorized Gaussians. The encoder outputs are p(x)
and w^{2}(x). The decoder outputs are μ(h) and σ^{2}(h). The likelihood term of variational objective is defined
below.

q_{ϕ}(h|x) = N(p(x),w^{2}(x)I)

p_{θ}(x|h) = N(μ(h),σ^{2}(h)I)

##### 5.5.1 Multi-Class Logistic Regression of Variational Autoencoder

The findings from Figure 59 show that variational autoencoder indicates the best loss function result. However, Figure 60 presents that the accuracy is low, especially in low epsilon values where even autoencoded data gives worse accuracy than the normal learning process.

Perturbation applied by variational autoencoder is not as sharp in sparse autoencoder and denoising autoencoder. It seems similar to vanilla autoencoder’s perturbation.

The variational autoencoder has the worst results. Besides, it presents bad results at the low values of epsilon, making autoencoded data less accurate and only a slight improvement compared to the normal data in high values of epsilon.

##### 5.5.2 Neural Network of Variational Autoencoder

Variational autoencoder with neural networks also illustrates the worst results compared to other autoencoder types, where the accuracy for autoencoded data against an attack has around between 0.96 and 0.99 accuracies, variational autoencoder has around between 0.65 and 0.70 accuracies.

### 6 Conclusion

In this paper, we have presented the results for pre-filtering the data with an autoencoder before sending it to the machine learning model against adversarial machine learning attacks. We have investigated that the classifier accuracy changes for linear and neural network machine learning models. We have also applied non-targeted and targeted attacks to multi-class logistic regression. Besides, FGSM, T-FGSM, and BIM attacks have been applied to the neural network machine learning model. The effects of these attacks on implementing autoencoder as a filter have been analyzed for both machine learning models. We have observed that the robustness provided by autoencoder after adversarial attacks can be seen by accuracy drop between 0.1 and 0.2 percent while the models without autoencoder suffered tremendous accuracy drops hitting accuracy score between 0.6 and 0.3 in some cases even 0.1. We have proposed general, generic, and easy to implement protection against adversarial machine learning model attacks. It will be beneficial to remind that all autoencoders in this study were trained with the epoch of 35 with 1024 sized batches, so the results can be improved by increasing the number of epochs. In conclusion, we have discussed that autoencoders provide robustness against adversarial machine learning attacks to machine learning models for both linear models and neural network models. We have examined the other types of autoencoders, which are mostly called vanilla autoencoders, give the best results. The second most accurate autoencoder type is sparse autoencoders, and the third most accurate is denoising autoencoders, which gives similar results with the sparse autoencoders. We have observed that the worst autoencoder type for this process is variational autoencoders because variational autoencoders are generative models used in different areas.

In summary, the natural practice of implementing an autoencoder between data and machine learning models can provide considerable defense and robustness against attacks. These autoencoders can be easily implemented with libraries such as TensorFlow and Keras. Through the results of this review, it is evident that autoencoders can be used in any machine learning model easily because of their implementation as a separate layer.

#### Acknowledgement

Acknowledgement text.

### References

[1] O. Vinyals, I. Babuschkin, W. M. Czarnecki, M. Mathieu, A. Dudzik, J. Chung, D. H. Choi, R. Powell, T. Ewalds, P. Georgiev, et al., “Grandmaster level in starcraft ii using multi-agent reinforcement learning,” Nature, vol. 575, no. 7782, pp. 350–354, 2019.

[2] F. S. Board, “Artificial intelligence and machine learning in financial services: Market developments and financial stability implications,” Financial Stability Board, p. 45, 2017.

[3] H. Z. S. S. T. K. J. Saito, “Mode-adaptive neural networks for quadruped motion control,” ACM Trans. Graph., vol. 37, pp. 145:1–145:11, July 2018.

```
</p>
<p class="bibitem" ><span class="biblabel">
```

[4] T.-C. Wang, M.-Y. Liu, A. Tao, G. Liu, J. Kautz, and B. Catanzaro, “Few-shot video-to-video synthesis,” arXiv preprint arXiv:1910.12713, 2019.

[5] M. Bakator and D. Radosav, “Deep learning and medical diagnosis: A review of literature,” Multimodal Technologies and Interaction, vol. 2, p. 47, 2018.

[6] B. Baker, I. Kanitscheider, T. Markov, Y. Wu, G. Powell, B. McGrew, and I. Mordatch, “Emergent tool use from multi-agent autocurricula,” arXiv preprint arXiv:1909.07528, 2019.

[7] A. Siddiqi, “Adversarial security attacks and perturbations on machine learning and deep learning methods,” CoRR, July 2019.

[8] K. A. R. T. Kolagari and M. Zoppelt, “Attacks on machine learning: Lurking danger for accountability,” CoRR, Jan. 2019.

[9] M. Jagielski, A. Oprea, B. Biggio, C. Liu, C. Nita-Rotaru, and B. Li, “Manipulating machine learning: Poisoning attacks and countermeasures for regression learning,” in 2018 IEEE Symposium on Security and Privacy (SP), pp. 19–35, 2018.

[10] S. P. H. R. S. L. S. Lee and J. Lee, “Learning predict-and-simulate policies from unorganized human motion data,” ACM Transactions on Graphics, vol. 38, pp. 1–11, Nov. 2019.

[11] L. Y. Z. S. Y. Zheng and K. Zhou, “Dynamic hair modeling from monocular videos using deep neural networks,” ACM Trans. Graph., vol. 38, pp. 235:1–235:12, Nov. 2019.

[12] N. Carlini and D. A. Wagner, “Towards evaluating the robustness of neural networks,” CoRR, vol. abs/1608.04644, 2016.

[13] A. A. N. Carlini and D. A. Wagner, “Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples,” CoRR, vol. abs/1802.00420, Feb. 2018.

[14] A. M. A. M. L. S. D. Tsipras and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” CoRR, vol. abs/1706.06083, 2017.

[15] A. E. R. T. S. G. M. P. M. C. S. Z. N. O. Tippenhauer, “Real-time evasion attacks with physical constraints on deep learning-based anomaly detectors in industrial control systems,” CoRR, vol. abs/1907.07487, July 2019.

[16] M. J. B. G. A. N. Asokan, “Making targeted black-box evasion attacks effective and efficient,” CoRR, vol. abs/1906.03397, 2019.

```
<p class="bibitem" ><span class="biblabel">
```

[17] A. C. A. O. C. Nita-Rotaru and B. Kim, “Are self-driving cars secure? evasion attacks against deep neural networks for steering angle prediction,” CoRR, vol. abs/1904.07370, Apr. 2019.

[18] L. H. A. D. J. B. N. B. Rubinstein and J. D. Tygar, “Adversarial machine learning,” in Proceedings of the 4th ACM Workshop on Security and Artificial Intelligence, AISec ’11, (New York, NY, USA), pp. 43–58, ACM, Oct. 2011.

[19] X. Y. P. H. Q. Z. R. R. Bhat and X. Li, “Adversarial examples: Attacks and defenses for deep learning,” CoRR, vol. abs/1712.07107, July 2017.

[20] S. Sivasloglu, F. O. Catak, and E. Gul, “Incrementing adversarial robustness with autoencoding for machine learning model attacks,” in 2019 27th Signal Processing and Communications Applications Conference (SIU), pp. 1–4, 2019.

[21] M. Aladag, F. O. Catak, and E. Gul, “Preventing data poisoning attacks by using generative models,” in 2019 1st International Informatics and Software Engineering Conference (UBMYK), pp. 1–5, 2019.

[22] J. G. Y. Z. X. H. Y. Jiang and J. Sun, “Rnn-test: Adversarial testing framework for recurrent neural network systems,” CoRR, Nov. 2019.

[23] M. Isakov, V. Gadepally, K. M. Gettings, and M. A. Kinsy, “Survey of attacks and defenses on edge-deployed neural networks,” in 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8, 2019.

[24] B. Li and Y. Vorobeychik, “Evasion-robust classification on binary domains,” ACM Trans. Knowl. Discov. Data, vol. 12, no. 4, pp. 50:1–50:32, 2018.

[25] K. Y. X. V. T. N. M. Shafiullah and A. Madry, “Training for faster adversarial robustness verification via inducing relu stability,” CoRR, vol. abs/1809.03008, Sept. 2018.

[26] F. Y. C. L. Y. W. L. Zhao and X. Chen, “Interpreting adversarial robustness: A view from decision surface in input space,” CoRR, vol. abs/1810.00144, Sept. 2018.

[27] L. P. J. D. R. Sukthankar and A. Gupta, “Robust adversarial reinforcement learning,” CoRR, vol. abs/1703.02702, Mar. 2017.

[28] S. Harding, P. Rajivan, B. I. Bertenthal, and C. Gonzalez, “Human decisions on targeted and non-targeted adversarial sample.,” in CogSci, 2018.

[29] W. Bai, C. Quan, and Z. Luo, “Alleviating adversarial attacks via convolutional autoencoder,” in 2017 18th IEEE/ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing (SNPD), pp. 53–58, IEEE, 2017.

```
</p>
<p class="bibitem" ><span class="biblabel">
```

[30] R. S. R. Mahfuz and A. E. Gamal, “Combatting adversarial attacks through denoising and dimensionality reduction: A cascaded autoencoder approach,” CoRR, vol. abs/1812.03087, Dec. 2018.

[31] I. Chen and B. Sirkeci-Mergen, “A comparative study of autoencoders against adversarial attacks,” nt’l Conf. IP, Comp. Vision, and Pattern Recognition, 2018.

[32] G. G. P. Tabacof and E. Valle, “Adversarial attacks on variational autoencoders,” CoRR, vol. abs/1806.04646, 2018.

[33] S. Y. J. C. Príncipe, “Understanding autoencoders with information theoretic concepts,” CoRR, vol. abs/1804.00057, Mar. 2018.

[34] A. K. I. J. G. S. B. Y. D. F. L. M. L. T. P. J. Z. X. H. C. X. J. W. Z. Z. Z. R. A. L. Y. S. H. Y. Z. Y. Z. Z. H. J. L. Y. B. T. A. S. Tokui and M. Abe, “Adversarial attacks and defences competition,” CoRR, vol. abs/1804.00097, Mar. 2018.

[35] I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” arXiv preprint arXiv:1412.6572, 2014.

[36] J. Han and C. Moraga, “The influence of the sigmoid function parameters on the speed of backpropagation learning,” in International Workshop on Artificial Neural Networks, pp. 195–201, Springer, 1995.

[37] A. F. Agarap, “Deep learning using rectified linear units (relu),” CoRR, vol. abs/1803.08375, 2018.

[38] J. Schmidhuber, “Deep learning in neural networks: An overview,” CoRR, vol. abs/1404.7828, Apr. 2014.

[39] C. N. W. I. A. Gachagan and S. Marshall, “Activation functions: Comparison of trends in practice and research for deep learning,” CoRR, vol. abs/1811.03378, Nov. 2018.

[40] F. C. N. C. H. Mao and H. Hu, “Assessing four neural networks on handwritten digit recognition dataset (MNIST),” CoRR, vol. abs/1811.08278, Nov. 2018.

[41] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep network learning by exponential linear units (elus),” arXiv preprint arXiv:1511.07289, 2015.

[42] H. Zheng, Z. Yang, W. Liu, J. Liang, and Y. Li, “Improving deep neural networks using softplus units,” in 2015 International Joint Conference on Neural Networks (IJCNN), pp. 1–4, IEEE, 2015.

```
<p class="bibitem" ><span class="biblabel">
```

[43] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” International Conference on Learning Representations, Dec. 2014.

[44] Z. Zhang and M. R. Sabuncu, “Generalized cross entropy loss for training deep neural networks with noisy labels,” CoRR, vol. abs/1805.07836, May 2018.