Classification with Boosting of Extreme Learning Machine Over Arbitrarily Partitioned Data

abstract: | Machine learning based computational intelligence methods are widely used to analyze large scale data sets in this age of big data. Extracting useful predictive modeling from these types of data sets is a challenging problem due to their high complexity. Analyzing large amount of streaming data that can be leveraged to derive business value is another complex problem to solve. With high levels of data availability (i.e. Big Data) automatic classification of them has become an important and complex task. Hence, we explore the power of applying MapReduce based Distributed AdaBoosting of Extreme Learning Machine (ELM) to build a predictive bag of classification models. Accordingly, (i) data set ensembles are created; (ii) ELM algorithm is used to build weak learners (classifier functions); and (iii) builds a strong learner from a set of weak learners. We applied this training model to the benchmark knowledge discovery and data mining data sets.

Introduction

It is clear that there has been an unexpected increase in the quantity and variety of data generated worldwide by computers, mobile phones, and sensors. Just as computer technology evolved, the quantity and variety of data has also increased, becoming more focused on storing every type of data, the so-called Big Data. As the volume of data to build a predictive model increases, the complexity of that training increases too. As a result, building actionable predictive modeling of a large scale unstructured data set is a definitive Big Data problem. Predictive learning models try to discover patterns of training data and label new data instances to the correct output value. To efficiently handle unstructured large scale big data sets, it is critical to develop new machine learning methods that combine several boosting and classification algorithms.

Extreme Learning Machine (ELM) was proposed by (G.-B. Huang, Zhu, and Siew 2006) based on generalized Single-hidden Layer Feedforward Networks (SLFNs). Main characteristics of ELM are small training time compared to traditional gradient-based learning methods, high generalization property of predicting unseen examples with multi-class labels and parameter free with randomly generated hidden nodes. ELM algorithm is used in many different areas including document classification (Zhao et al. 2011), bioinformatics (Wang, Zhao, and Wang 2008) multimedia recognition (Zong and Huang 2011; Lan et al. 2013).

In recent years, much computational intelligence research has been devoted to building predictive modeling of distributed and parallel frameworks. In this research, the proposed learning model creates data chunks with varying size and bag of classifier functions using ELM algorithm trained with these arbitrary chosen sub data set with AdaBoosting method for large scale predictions. By creating data chunks from the training data set using the MapReduce paradigm, each subset of the training data set is used to find out the set of ELM ensembles as a single global classifier function.

The main objective of this work is to train large scale data sets using ELM and AdaBoost. Another objective is to achieve the model’s classification performance with same or close to the conventional ELM method. Conventional ELM training cannot be applied to large scale data sets on a single computer because of their complexity. Then experiments section is split into two subsections: \“commonly used data sets\” in Section 5.1.1 and \“large scale data sets\” in Section 5.1.2. Commonly used data sets are suitable for training on a single computer with the conventional ELM algorithm. We trained these data sets both conventional and proposed methods to show the classification performance changes of the proposed method. Classification performance results are shown in Section 5.3.

The contributions of this paper are as follows:

  • A generative MapReduce technique based AdaBoosted ELM classification model is proposed for learning, and thus, faster classification model training is achieved.

  • This research proposes a new learning method for AdaBoosted ELM that achieves parallelization both in large scale data sets and reduced computational time of learning algorithm.

  • Training computations of working nodes are independent from each other thus minimizing the data communication. The other approaches, including Support Vector Machine training need data communication for the support vector exchange. (Lu, Roychowdhury, and Vandenberghe 2008; Sun and Fox 2012; Catak and Balaban 2013).

The rest of the paper is organized as follows: Section 2 briefly introduces some of the earlier works related to our problem. Section 3 describes algorithm ELM, AdaBoost and MapReduce technique. Section 4 and Section 5 evaluates the proposed learning model. Section 6 concludes this paper.

Related work

In this section, we describe the general overview of literature review. Section 2.1 describes the general distributed ELM methods. Section 2.2 shows the MapReduce based ELM training methods.

Literature Review Overview

MapReduce based learning algorithms from distributed data chunks has been studied by many researchers. Many different MapReduce based learning solutions over arbitrary partitioned data have been proposed recently. Some popular MapReduce based solutions to train machine learning algorithms in the literature include the following. Panda et al. proposed a learning tree model which is based on series of distributed computations, and implements each one using the MapReduce model of distributed computation (Panda et al. 2009). Zhang et al. develops some algorithms using MapReduce to perform parallel data joins on large scale data sets (Zhang, Li, and Jestes 2012). Sun et al. use batch updating based hierarchical clustering to reduce computational time and data communication (Sun et al. 2009). Their approach uses co-occurence based feature selection to remove noisy features and decrease the dimension of the feature vectors. He et al. proposed parallel density based clustering algorithm (DBSCAN). They developed a partitioning strategy for large scale non-indexed data with a 4-stages MapReduce paradigm (He et al. 2011). Zhao et al. proposed parallel k-means clustering based on MapReduce (Zhao, Ma, and He 2009). Their approaches focus on implementing k-means with the read-only convergence heuristic in the MapReduce pattern.

MapReduce Based ELM Training Methods {#sec:mr-elm-methods}

Section 2.2.1 - Section 2.2.5 describe five different MapReduce training methods of ELM algorithm.

ELM $\star$

Xin et al. proposed MapReduce based ELM training method called as ELM $^\ast$ (Xin et al. 2014). Main idea behind this method is to calculate matrix multiplication of ELM to find weight vector. They show that Moore-Penrose generalized inverse operator is the most expensive computation part of the algorithm. As we know, matrix multiplication can be divide into smaller part. Using this property, they proposed an efficient implementation of training phase to manage massive data sets. The final output of this method is a single classifier function. In this paper, they proposed two different versions of ELM $^\ast$, naive and improved. In naive-ELM $^\ast$, the algorithm has two classes, Class Mapper and Class Reducer. Both classes contain only one method. In improved ELM $^\ast$, they decompose the calculation of matrix multiplication using MapReduce framework. Moreover, the proposed algorithm decreases the computation and communication cost. In the experimental platform, they used their synthetic data sets to evaluate the performance of the proposed algorithms with MapReduce framework.

OS-ELM based Classification in Hierarchical P2P Network

Sun et al. proposed OS-ELM (Liang et al. 2006) based distributed ensemble classification in P2P networks (Sun, Yuan, and Wang 2011). They apply the incremental learning principle of OS-ELM to hierarchical P2P network. They proposed two different versions of the ensemble classifier in hierarchical P2P, one-by-one ensemble classification and parallel ensemble classification. In one-by-one learning method, each peer, one by one, calculates the classifier with all the data. Therefore, this approach has a large network delay. In the parallel ensemble learning, all the classifiers are learnt from all the data in parallel manner. Conversely to ELM $^\ast$, their experimental results are based on three different real data sets downloaded from the UCI repository.

Parallel online sequential ELM: POS-ELM

Wang et al. have been proposed parallel online sequential extreme learning machine (POS-ELM) method (Wang et al. 2015). Main idea behind in this approach is to analyze the dependency relationships and the matrix calculations of OS-ELM (Liang et al. 2006). Their experimental results are based on nine different real data sets downloaded from the UCI repository.

Distributed and Kernelized ELM: DK-ELM

Bi et al. have been proposed both distributed and kernelized ELM (DK-ELM) based on MapReduce (Bi et al. 2015). The difference between ELM and Kernelized ELM is that K-ELM applies kernels opposite to create random feature mappings. They provide a distributed implementation RBF kernel matrix calculation in massive data learning applications. Their experimental results are based on four different real data sets downloaded from the UCI repository and four synthetic data sets.

ELM-MapReduce

Chen et al. have been proposed MapReduce based ELM ensemble classifier called ELM-MapReduce, for large scale land cover classification of remote sensing data (Chen, Zheng, and Chen 2013). Their approach contains two sequential phases: parallel training of multiple ELM classifiers and voting mechanism. In parallel training phase of proposed method, each $Map$ function computes an ELM classifier with a given training data set. In second phase called voting mechanism, a new MapReduce job is executed with a new partitioned test set into each $Map$ function with notation $data_j$. In $Reduce$ function of this phase, each $data_j$ is predicted with each ELM classifier trained in parallel training phase. Final classification predictions are the output of final $Reduce$ function. Therefore, this approach has a high communication cost. Their experimental results are based synthetic remote sensing image of training data.

The Differences Between Proposed Model and Literature Review

The main differences are:

  • In ELM $\star$, they use matrix multiplication decomposition. Each $Map$ function is responsible to calculate the Moore-Penrose generalized inverse operation. And their method produces one single classifier. In the proposed model in our paper, each $Reduce$ function produces ensemble classifier based on AdaBoost method. The final output ensemble classifier is a voting based combination of ensemble classifier trained in each $Reduce$ phase.

  • In OS-ELM based classification in hierarchical P2P Network, POS-ELM and DK-ELM, they propose ensemble classifier that combines multiple classifier trained with data chunks. Each peer classifier is learned from the local data. Therefore, each peer produces a single ELM classifier. In our method, each node (or peer) produces ensemble classifier to increase the classification accuracy.

  • In ELM-MapReduce, they propose ensemble classifier with two different MapReduce jobs. In first MpaReduce job, their approach produces a single ELM classifier in each $Map$ function. In second MapReduce job, the test set is partitioned into each $Map$ function and produces final predicted labels based on the voting mechanism of ELM classifiers that are trained in the first MapReduce job. In our method, prediction is not included, our aim is to create a final ensemble classifier in only one MapReduce job.

Table 1 shows the main differences of all proposed methods. There are five different columns that are ensemble methods, single pass MapReduce, matrix multiplication, entire data set and network communication. Ensemble column shows that the method builds a set of classifier function (i.e. ensemble model) to improve the accuracy performance of the final classification model. If an ensemble method is applied, then the performance of final model will have better accuracy result (Kuncheva and Whitaker 2003). Single Pass MapReduce column shows that an iterative approach is not applied to the model. Entire learning phase is performed in a single pass of data through the job. Matrix Multiplication column shows the hidden layer matrix is calculated in each $Map$ function. The hidden layer matrix computation is a compute intensive operation. Entire Data Set column shows each $Map$ operation needs entire data set to build a final classifier model. Network Communication column shows that each $MapReduce$ job needs to communicate with another job. Network communication will affect negatively on training time of the algorithm.

Table: The Differences Between Proposed Model and Literature Review.

Method Ensemble Single Pass MR Matrix Mult. Entire DS Network Comm.
ELM $\star$ No Yes No Yes No
OS-ELM Yes Yes No No Yes
POS-ELM Yes Yes No Yes No
DK-ELM Yes Yes No Yes No
ELM-MapReduce Yes No No Yes Yes
Proposed Method Yes Yes No No No

Preliminaries

In this section, we introduce preliminary knowledge of ELM, AdaBoost and MapReduce briefly.

Extreme learning machine

ELM was originally proposed for the single-hidden layer feedforward neural networks (G.-b. Huang, Zhu, and Siew 2006; G.-B. Huang, Chen, and Siew 2006; G.-B. Huang, Zhu, and Siew 2006) . Then, ELM was extended to the generalized single-hidden layer feedforward networks where the hidden layer may not be neuron like (Huang and Chen 2007; G.-B. Huang and Chen 2008). The main advantages of the ELM classification algorithm are that ELM can be trained hundred times faster than traditional neural network or support vector machine algorithm since its input weights and hidden node biases are randomly created and output layer weights can be analytically calculated by using a least-squares method (Tang et al. 2015; G.-B. Huang et al. 2008). The most noticeable feature of ELM is that its hidden layer parameters are selected randomly.

Given a set of training data $\mathcal{D}={(\mathbf{x}_i, y_i)\mid i=1,…,n},\mathbf{x}_i \in \mathbb{R}^p, y_i \in {1, 2,…,K}$ sampled independently and identically distributed (i.i.d.) from some unknown distribution. The goal of a neural network is to learn a function $f:\mathcal{X} \rightarrow \mathcal{Y}$ where $\mathcal{X}$ is instance and $\mathcal{Y}$ is the set of all possible labels. The output label of an single hidden-layer feedforward neural networks (SLFNs) with $N$ hidden nodes can be described as equation where $\mathbf{a}_i$ and $b_i$ are the learning parameters of hidden nodes and $\beta_i$ is the weight connecting the $i$th hidden node to the output node.

The output function of ELM for generalized SLFNs can be identified by equation

For the binary classification applications, the decision function of ELM becomes equation

Equation [eq:slfnsgen]can be written in another form as $$\label{eq:elm} H\beta=T$$ $H$ and $T$ are respectively hidden layer matrix and output matrix. Hidden layer matrix can be described as

equation

where $\tilde{a}=a_1,…,a_L$, $\tilde{b}=b_1,…,b_L$, $\tilde{x}=x_1,…,x_N$. Output matrix can be described as $T= \begin{bmatrix} t_1 \cdots t_N \end{bmatrix}^T$. The hidden nodes of SLFNs can be randomly generated. They can be independent of the training data.

AdaBoost

The AdaBoost (Freund and Schapire 1995) is a supervised learning algorithm designed to solve classification problems (Freund, Schapire, and Abe 1999). The algorithm takes as input a training set $(\mathbf{x}_1, y_1),…,(\mathbf{x}_n, y_n)$ where the input sample $\mathbf{x}_i \in \mathbb{R}^p$, and the output value, $y_i$, in a finite space $y\in {1,…K}$. AdaBoost algorithm assumes, like ELM, a set of training data sampled independently and identically distributed (i.i.d.) from some unknown distribution $\mathcal{X}$.

Given a space of feature vectors $X$ and two possible class labels, $y \in {-1,+1}$, AdaBoost goal is to learn a strong classifier $H(\mathbf{x})$ as a weighted ensemble of weak classifiers $\mathbf{h}$ predicting the label of any instance $\mathbf{x} \in X$ (Landesa-Vázquez and Alba-Castro 2013). $$\label{eq:adaboost} H(\mathbf{x}) = sign(f(\mathbf{x}))=sign\left(\sum_{t=1}^{T}\alpha_t h_t(\mathbf{x}) \right)$$ Pseudocode for AdaBoost is given in Alg. [alg:adaboost].

Algorithm-1

MapReduce

MapReduce is a new programming model to run parallel applications for large scale data sets processing to support data-intensive applications. It is derived from the map and reduce function combination from functional programming. Users specify a map function that processes a key/value pair to generate a set of intermediate key/value pairs, and a reduce function that merges all intermediate values associated with the same intermediate key. The MapReduce was originally developed by Google and built on principles in parallel manner (Dean and Ghemawat 2008). The MapReduce framework first takes the input, divides it into smaller data chunks, and distributes them to worker nodes. MapReduce is divided into three major phases called map, reduce and a separated internal shuffle phase. The MapReduce framework automatically executes all those functions in a parallel manner over any number of processors/servers (Schatz 2009).

Pseudo code of MapReduce framework is shown in Eq. [eq:mapreduce]{reference-type=“ref” reference=“eq:mapreduce”}. $$\label{eq:mapreduce} \begin{split} map(key_1,value_1) & \rightarrow list(key_2,value_2)
reduce(key_2,list(value_2)) & \rightarrow list(key_3,value_3) \end{split}$$

Mapreduce programming technique is widely used on different scientific fields, i.e. cyber-security (Choi et al. 2014; Ogiela, Castiglione, and You 2014), high energy physics (Bhimji, Bristow, and Washbrook 2014), biology (Xu et al. 2014).

Proposed Approach

In this section we provide the details of the MapReduce based distributed AdaBoosted ELM algorithm. The basic idea of AdaBoost-ELM based on MapReduce technique is introduced in Section 4.1. The MapReduce implementation of AdaBoosted ELM is described in Section 4.3.

Basic Idea

Our main task is to parallel and distributed execute the computation of AdaBoosted ELM classification method. AdaBoosted ELM’s basic idea is to calculate ensemble of classifier functions over partitioned data $(X_m,Y_m)$ in parallel manner. In Table 2{reference-type=“ref” reference=“tbl:notation”}, a summary of commonly used variables and notations to assess the classifier model performance of the AdaBoosted ELM method is given for convenience.

Table: Commonly used variables and notations.

Variables/Notation Description
$M$ Data chunck split size
$h$ A single classifier function
$X_m$ Data chunck $m$ of input values of $\mathcal{D}$
$Y_m$ Data chunck $m$ of output values of $\mathcal{D}$
$\epsilon$ Error rate
Chunk Number of data chunk
$T$ AdaBoost $T$ size
H. Nodes Number of hidden nodes used in ELM
Acc Accuracy of classifier hypothesis
$k$ Number of classes

:

Analysis of the proposed algorithm

Barlett showed that the size of the weights is more important than the size of the neural network (Bartlett 1998). Kragh et al. also showed that ensemble methods of neural networks get better accuracy performance over unseen examples (Krogh and Vedelsby 1995). The main motivation of the this work is the idea that small size ELM ensembles can obtain more accurate classifier model that are comparable to individual classifiers.

In the proposed model, at every data chunk, there is a set of classifier functions that acts as a single classification model. The single model at every data chunk $m$ is defined as follows: equation The selected ensemble ELM classifier models from the reduce phase of MapReduce algorithm are combined into one single classification model. equation

Implementation of the Model

The pseudocodes of MapReduce-based AdaBoost ELM are shown in Algorithm [alg:map] and Algorithm [alg:reduce]. The $Map$ procedure of our training model is implemented based on random assignment of each row of the training data set with split size of data, $M$, in line 2 of Algorithm [alg:map]. The input, $\mathbf{x}$ , is a row of traing data set $\mathcal{D}$. $Map$ procedure partition the input matrix by row, producing $(randomSplitId,\mathbf{x})$ key-value pairs. $randomSplitId$ is the identifier of the data chunk and is transferred as the input key to $Reduce$ phase.

$k \gets rand(0,M)$ $Output(k,(\mathbf{x},y))$

The pseudo code of $Reduce$ phase is shown in Algorithm [alg:reduce]. $Reduce$ procedure is implemented based on the for-loop of lines 3 — 8 of Algorithm [alg:reduce]. The output ELM classifier of sub data set $(\mathbf{X}_k,\mathbf{y}_k)$ is calculated using AdaBoost constantly block by block, so every reduce task completes training phase and outputs an AdaBoosted set of classifier functions. The $mapper$’s input $k$ is the $randomSplitId$ to create the data chunk and created in the $Map$ phase of our training model.

Algorithm-2 .

Experiments

In this section, we perform experiments on real-world data sets from the public available data set repositories. Public data sets are used to evaluate the proposed learning method. Then, classification models of each data set are compared for accuracy results with the single instance of learning algorithm performance.

In Section 5.1 we explain the data sets and parameters that are used in experiments. The conventional ELM is applied all data sets and we find the accuracy performance over number of hidden nodes in Section 5.3. In Section 5.2, we show the empirical results of proposed distributed adaboost ELM training algorithm.

Experimental setup

In this section we apply our approach to five different data sets to verify its effectivity and efficiency. To demonstrate the effectiveness and performance of the proposed model, we apply it on various classification data sets from public data set repositories. To obtain an optimal value of Mapper size, $m$, we range it in the range from 20 to 100.

Commonly Used Classification Data Sets

We experiment on five public data sets which are summarized in Table 3, including Pendigit, Letter, Statlog, Page-blocks and Waveform. They are all multiclass data sets. All experiments are repeated 5 times and the results are averaged. All data sets are publicly available in svmlight format on the LIBSVM web site (LIBSVM 2015).

Pendigit data set is a collection of pen-based recognition of handwritten digits (Alimoglu and Alpaydin 1996). The data set contains 250 samples from 44 people. The first 7494 instances written by 30 people are used for the training data set, and the digits written by other 14 people are used for the independent testing purpose.

Skin data set is a collection of skin segmentation constructed over R, G, B color space (Bhatt et al. 2009). The data set contains face images of different age groups (young, middle, old), genders and racial groups (White, Black, Asian). The data set contains 245057 instances; out of which 50859 is the skin labeled instances and 194198 is non-skin instances.

Statlog / Shuttle data set is a collection of space shuttle created by NASA (Hsu and Lin 2002). The data set contains 43500 training instances and 14500 testing instances. 80% of the data belongs to class 1.

Page Blocks data set is a collection of page layout of a document that has been detected by a segmentation process (Malerba, Esposito, and Semeraro 1996). The data set contains 4500 training instances and 973 testing instances.

Waveform data set is a collection of Breiman’s waveform domains of CART book’s (Breiman et al. 1984). The data set contains 4400 training instances and 600 testing instances.

Table: Description of the testing data sets used in the experiments.

Data set #Train #Test #Classes #Attributes
Pendigit 7494 3498 10 16
Skin 220543 24507 2 3
Statlog / Shuttle 43500 14500 7 9
Page-blocks 4500 973 5 10
Waveform 4400 600 3 21

Large Scale Classification Data Sets

We experiment on three public large scale data sets which are summarized in Table 4, including \”Record Linkage Comparison Patterns (Donation) \“, \”SUSY\” and \”HIGGS\“. All experiments are repeated 5 times and the results are averaged.

Donation represent individual data, including first and family name, sex, date of birth and postal code, which were collected through iterative insertions in the course of several years. The comparison patterns in this data set are based on a sample of 100.000 records dating from 2005 to 2008 (Schmidtmann et al. 2009). The data set contains 5,749,132 training instances and 1,000,000 testing instances. The data set is available on UCI web site (UCI 2011).

SUSY is a classification data set that distinguish between a signal process which produces supersymmetric particles and a background process which does not (Baldi, Sadowski, and Whiteson 2014). The first 8 features are kinematic properties measured by the particle detectors in the accelerator. The last ten features are functions of the first 8 features. The data set contains 5,000,000 training instances and 50,000 testing instances. The data set is available on UCI web site (UCI 2014b).

HIGSS is a classification problem to distinguish between a signal process which produces Higgs bosons and a background process which does not (Baldi, Sadowski, and Whiteson 2014). The first 21 features (columns 2-22) are kinematic properties measured by the particle detectors in the accelerator. The last seven features are functions of the first 21 features. The data set contains 11,000,000 training instances and 500,000 testing instances. The data set is available on UCI web site (UCI 2014a).

Table: Description of the testing large scale data sets used in the experiments.

Data set #Train #Test #Classes #Attributes
Donation 5,749,132 1,000,000 2 12
SUSY 5,000,000 50,000 2 18
HIGSS 11,000,000 1,000,000 2 28

Evaluation

Since the data sets that are used in our experiments are highly imbalanced, traditional accuracy based performance evaluation is not enough to find out an optimal classifier. We used four different metrics, the overall prediction accuracy, average recall, average precision (Turpin and Scholer 2006) and $F$-score, to evaluate the classification accuracy which are common measurement metrics in information retrieval (Manning, Raghavan, and Schütze 2008; Makhoul et al. 1999).

Precision is defined as the fraction of retrieved samples that are relevant. Precision is shown in Eq. [eqn:prec]. $$\label{eqn:prec} Precision = \frac{Correct}{Correct + False}$$ Recall is defined as the fraction of relevant samples that is retrieved. Recall is shown in Eq. [eqn:recall]. $$\label{eqn:recall} Precision = \frac{Correct}{Correct + Missed}$$ The proposed evaluation model calculates the precision and recall for each class from prediction scores then finds their mean. Average precision and recall is shown in Eq. [eqn:avgprec] and Eq. [eqn:avgrecall]. $$Precision-avg = \frac{1}{n-{classes}}\sum{Prec}$$

$$Recall-avg = \frac{1}{n-{classes}}\sum{Recall}$$ $F$-measure is defined as the harmonic mean of precision and recall. The $$F_1 = 2 \times \frac{Prec-{avg} \times Recall-{avg}}{Prec-{avg} + Recall-{avg}}$$

Data set results with conventional ELM

Figure [fig:convelm] shows that the accuracy performance of ELM for experimental data sets becomes steady-state after a threshold value of $N$. The testing classification performance is measured through accuracy, precision, recall and $F_1$ measure. $N$ varies from 150 to 500.

Statlog data set.

Skin data set.

Pen digit data set.

Waveform data set.

Page blocks data set.

Table 5 shows the best performance of the conventional ELM method of each data set.

Table: Data set results with conventional ELM.

Data set $F_1$ Recall Precision Accuracy
Pendigit 0.8404 0.8393 0.8416 0.8407
Skin 0.9754 0.9956 0.9583 0.9894
Statlog 0.8871 0.8556 0.9237 0.9757
Page-blocks 0.9873 0.9764 0.9988 0.9977
Waveform 0.8372 0.8368 0.8375 0.8376

The conventional ELM training algorithm can be applied only in Section 5.1.1. The large scale data sets in Section 5.1.2 are not feasible to train on a single computer.

Testing Accuracy Analysis

Because of two different data set type (\“commonly used\“, \“large scale\“) are used, the results are divided into two different sections. In Section 5.4.1, the figures and the plots show the implementation results of commonly used classification data sets. Section 5.4.2 shows the large scale data sets results.

Commonly Used Classification Data Sets

The results of accuracy and performance tests with real data are shown in Table 6 and Figure [fig:statlogres] - Figure [fig:waveformres]. According to the these results, AdaBoost $T$ size and Mapper size have more impact on the accuracy of ensemble ELM classifier than number of hidden nodes in ELM network.

Accuracy of classification models are visualized by heatmap color coding according to

  • Map size ($M$) - AdaBoost size ($T$)
  • Map size ($M$) - Number of hidden nodes ($nh$)
  • AdaBoost size ($T$) - Number of hidden nodes ($nh$)

Figure [fig:statlogres] - Figure [fig:waveformres] are used to plot the quantitative differences in accuracy score of each data set. Heatmaps are two dimensional graphical representations of data with a pre-defined colormap to display values of a matrix (Khomtchouk, Van Booven, and Wahlestedt 2014). Heatmaps can be used to understand what parameters affect the accuracy of the classification model. The figures are used to comparatively illustrate accuracy levels across a number of different parameters including Map size, AdaBoost size and the number of hidden nodes in ELM algorithm obtained from the proposed learning method.

Figure: Split size and adaboost $T$ size

Figure: Split size and number of $nh$.

Figure: Adaboost $T$ size and number of $nh$.

Figure: Split size and adaboost $T$ size

Figure: Split size and number of $nh$.

Figure: Adaboost $T$ size and number of $nh$.

Figure: Split size and adaboost $T$ size

Figure: Split size and number of $nh$.

Figure: Adaboost $T$ size and number of $nh$.

Split size and adaboost $T$ size

Figure: Split size and number of $nh$.

Figure: Adaboost $T$ size and number of $nh$.

Figure: Split size and adaboost $T$ size

Figure: Split size and number of $nh$.

Figure: Adaboost $T$ size and number of $nh$.

Table: Best performance results of data sets

Data set # C. T # H.N. Acc Prec. Recall $F_1$
Pendigit 20 10 21 0,8256 0,8369 0,8234 0,8301
Skin 21 5 21 0,9892 0,9773 0,9913 0,9842
Statlog 11 2 21 0,9103 0,7486 0,5069 0,6045
Page Blocks 1 1 340 0,9404 0,9027 0,5756 0,7030
Waveform 19 6 40 0,862 0,8680 0,8605 0,8642

According to Table 7{reference-type=“ref” reference=“tbl:comparison”}, classification performance results of the proposed method have almost the same values with the conventional ELM method.

Large Scale Classification Data Sets

Figure 21 shows the speed up on mapper size over proposed method on large scale data sets. To asses the effectiveness of the learning algorithm, the time is measured with varying mapper size. Because of high dimensionality, the data sets cannot be trained on a single computer. Then, the standart speed up percentage is modifed such that:

equation

where $t_{\mathop{\mathrm{arg\,min}}{m} \in M }$ is the total time on minimum mapper that can be achieved to build a classifier model.

As can be seen from the figure, the data sets achives performance improvement in learning time of the algorithm. By examining the trends observed as the number of mappers increases, one can see that non-linear speed up is achieved.

Stability analysis of ensemble ELM classifiers with Mapper size.

Stability Analysis

Standard deviation of testing accuracy of the method is shown in Figure 22 and Figure 23. We analyzed the stability of ensemble ELM classifier with two aspects, Mapper size and AdaBoost $T$ size. Mapper size is the most important variable for the model stability according to the Figure 22. From Figure 22 and Figure 23, we can find that standard deviation of testing accuracy decreases enormously with the increasing of Mapper function size. Through this analysis, one can argue that a model with high Mapper function size do has higher stability than low Mapper function size.

Stability analysis of ensemble ELM classifiers with Mapper size.

Figure: Stability analysis of ensemble ELM classifiers with AdaBoost $T$ size.

Conclusion and Future Works

In this paper,a parallel AdaBoost extreme learning machine algorithm implementation has been proposed for massive data learning. By creating the overall data set into data chunks, MapReduce based learning algorithm reduces the training time of ELM classification. To overcome the accuracy performance decreasing, distributed ELM is enhanced with AdaBoost method. The experimental results show that AdaBoosted ELM not only reduce the training time of large-scale data sets, but also evaluation metrics of accuracy performance compared with the conventional ELM.

The proposed AdaBoost based ELM has three different trade-off parameters which are (i) data chunk split size, $M$, (ii) maximum number of iterations, $T$,in AdaBoost Algorithm and lastly (iii) number of hidden layer nodes $nh$ in ELM algorithm. The empirical results in heatmap figures show that parameters $M$ and $T$ are more dominant than parameter $nh$ for the classification accuracy of the hypothesis.

The algorithm is designed to deal with large scale data set ELM training problems. Another objective is to achieve the model’s classification performance with same or close to the conventional ELM method. Classification performance results are shown in Section 5.3. The empirical results show us that classification performance results of the proposed method have almost the same values with the conventional ELM method.

References

Alimoglu, Fevzi, and Ethem Alpaydin. 1996. “Methods of Combining Multiple Classifiers Based on Different Representations for Pen-Based Handwritten Digit Recognition.” In Proceedings of the Fifth Turkish Artificial Intelligence and Artificial Neural Networks Symposium (Tainn 96.

Baldi, Pierre, Peter Sadowski, and Daniel Whiteson. 2014. “Searching for Exotic Particles in High-Energy Physics with Deep Learning.” Nature Communications 5.

Bartlett, P. L. 1998. “The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights Is More Important Than the Size of the Network.” Information Theory, IEEE Transactions on 44 (2): 525–36. https://doi.org/10.1109/18.661502.

Bhatt, R. B., G. Sharma, A. Dhall, and S. Chaudhury. 2009. “Efficient Skin Region Segmentation Using Low Complexity Fuzzy Decision Tree Model.” In India Conference (Indicon), 2009 Annual Ieee, 1–4. https://doi.org/10.1109/INDCON.2009.5409447.

Bhimji, W, T Bristow, and A Washbrook. 2014. “HEPDOOP: High-Energy Physics Analysis Using Hadoop.” In Journal of Physics: Conference Series, 513:022004. IOP Publishing.

Bi, Xin, Xiangguo Zhao, Guoren Wang, Pan Zhang, and Chao Wang. 2015. “Distributed Extreme Learning Machine with Kernels Based on Mapreduce.” Neurocomputing 149, Part A (0): 456–63. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2014.01.070.

Breiman, Leo, Jerome Friedman, Charles J Stone, and Richard A Olshen. 1984. Classification and Regression Trees. CRC press.

Catak, F.Ozgur, and M.Erdal Balaban. 2013. “CloudSVM: Training an Svm Classifier in Cloud Computing Systems.” In Pervasive Computing and the Networked World, edited by Qiaohong Zu, Bo Hu, and Atilla Elçi, 7719:57–68. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-37015-1_6.

Chen, Jiaoyan, Guozhou Zheng, and Huajun Chen. 2013. “ELM-Mapreduce: MapReduce Accelerated Extreme Learning Machine for Big Spatial Data Analysis.” In Control and Automation (Icca), 2013 10th Ieee International Conference on, 400–405. https://doi.org/10.1109/ICCA.2013.6565081.

Choi, Junho, Chang Choi, Byeongkyu Ko, and Pankoo Kim. 2014. “A Method of Ddos Attack Detection Using Http Packet Pattern and Rule Engine in Cloud Computing Environment.” Soft Computing 18 (9): 1697–1703. https://doi.org/10.1007/s00500-014-1250-8.

Dean, Jeffrey, and Sanjay Ghemawat. 2008. “MapReduce: Simplified Data Processing on Large Clusters.” Commun. ACM 51 (1): 107–13. https://doi.org/10.1145/1327452.1327492.

Freund, Yoav, Robert Schapire, and N Abe. 1999. “A Short Introduction to Boosting.” Journal-Japanese Society for Artificial Intelligence 14 (771-780): 1612.

Freund, Yoav, and Robert E Schapire. 1995. “A Desicion-Theoretic Generalization of on-Line Learning and an Application to Boosting.” In Computational Learning Theory, 23–37. Springer.

He, Yaobin, Haoyu Tan, Wuman Luo, Huajian Mao, Di Ma, Shengzhong Feng, and Jianping Fan. 2011. “MR-Dbscan: An Efficient Parallel Density-Based Clustering Algorithm Using Mapreduce.” In Parallel and Distributed Systems (Icpads), 2011 Ieee 17th International Conference on, 473–80. https://doi.org/10.1109/ICPADS.2011.83.

Hsu, Chih-Wei, and Chih-Jen Lin. 2002. “A Comparison of Methods for Multiclass Support Vector Machines.” Trans. Neur. Netw. 13 (2): 415–25. https://doi.org/10.1109/72.991427.

Huang, Guang-Bin, and Lei Chen. 2007. “Convex Incremental Extreme Learning Machine.” Neurocomputing 70 (16–18): 3056–62. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2007.02.009.

———. 2008. “Enhanced Random Search Based Incremental Extreme Learning Machine.” Neurocomputing 71 (16–18): 3460–8. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2007.10.008.

Huang, Guang-Bin, Lei Chen, and Chee-Kheong Siew. 2006. “Universal Approximation Using Incremental Constructive Feedforward Networks with Random Hidden Nodes.” Neural Networks, IEEE Transactions on 17 (4): 879–92. https://doi.org/10.1109/TNN.2006.875977.

Huang, Guang-Bin, Ming-Bin Li, Lei Chen, and Chee-Kheong Siew. 2008. “Incremental Extreme Learning Machine with Fully Complex Hidden Nodes.” Neurocomputing 71 (4–6): 576–83. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2007.07.025.

Huang, Guang-Bin, Qin-Yu Zhu, and Chee-Kheong Siew. 2006. “Extreme Learning Machine: Theory and Applications.” Neurocomputing 70 (1–3): 489–501. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2005.12.126.

Huang, Guang-bin, Qin-yu Zhu, and Chee-kheong Siew. 2006. “Extreme Learning Machine: A New Learning Scheme of Feedforward Neural Networks.” In IN Proc. INT. JOINT Conf. NEURAL Netw, 985–90.

Khomtchouk, BohdanB, DerekJ Van Booven, and Claes Wahlestedt. 2014. “HeatmapGenerator: High Performance Rnaseq and Microarray Visualization Software Suite to Examine Differential Gene Expression Levels Using an R and C++ Hybrid Computational Pipeline.” Source Code for Biology and Medicine 9 (1). https://doi.org/10.1186/s13029-014-0030-2.

Krogh, Anders, and Jesper Vedelsby. 1995. “Neural Network Ensembles, Cross Validation, and Active Learning.” In Advances in Neural Information Processing Systems, 231–38. MIT Press.

Kuncheva, Ludmila I, and Christopher J Whitaker. 2003. “Measures of Diversity in Classifier Ensembles and Their Relationship with the Ensemble Accuracy.” Machine Learning 51 (2): 181–207.

Lan, Yuan, Zongjiang Hu, Yeng Chai Soh, and Guang-Bin Huang. 2013. “An Extreme Learning Machine Approach for Speaker Recognition.” Neural Computing and Applications 22 (3-4): 417–25.

Landesa-Vázquez, Iago, and José Luis Alba-Castro. 2013. “Double-Base Asymmetric Adaboost.” Neurocomputing 118 (0): 101–14. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2013.02.019.

Liang, Nan-Ying, Guang-Bin Huang, P. Saratchandran, and N. Sundararajan. 2006. “A Fast and Accurate Online Sequential Learning Algorithm for Feedforward Networks.” Neural Networks, IEEE Transactions on 17 (6): 1411–23. https://doi.org/10.1109/TNN.2006.880583.

LIBSVM. 2015. “LIBSVM Data: Classification, Regression, and Multi-Label.” http://ntucsu.csie.ntu.edu.tw/.

Lu, Yumao, V. Roychowdhury, and L. Vandenberghe. 2008. “Distributed Parallel Support Vector Machines in Strongly Connected Networks.” Neural Networks, IEEE Transactions on 19 (7): 1167–78. https://doi.org/10.1109/TNN.2007.2000061.

Makhoul, John, Francis Kubala, Richard Schwartz, and Ralph Weischedel. 1999. “Performance Measures for Information Extraction.” In In Proceedings of Darpa Broadcast News Workshop, 249–52.

Malerba, Donato, Floriana Esposito, and Giovanni Semeraro. 1996. “A Further Comparison of Simplification Methods for Decision-Tree Induction.” In In d. Fisher and H. Lenz (Eds.), Learning, 365–74. Springer-Verlag.

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. New York, NY, USA: Cambridge University Press.

Ogiela, MarekR., Aniello Castiglione, and Ilsun You. 2014. “Soft Computing for Security Services in Smart and Ubiquitous Environments.” Soft Computing 18 (9): 1655–8. https://doi.org/10.1007/s00500-014-1380-z.

Panda, Biswanath, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo. 2009. “PLANET: Massively Parallel Learning of Tree Ensembles with Mapreduce.” Proc. VLDB Endow. 2 (2): 1426–37. https://doi.org/10.14778/1687553.1687569.

Schatz, Michael C. 2009. “CloudBurst: highly sensitive read mapping with MapReduce.” Bioinformatics (Oxford, England) 25 (11): 1363–9. https://doi.org/10.1093/bioinformatics/btp236.

Schmidtmann, Irene, Gaël Hammer, Murat Sariyar, Aslihan Gerhold-Ay, and Körperschaft des öffentlichen Rechts. 2009. “Evaluation Des Krebsregisters Nrw–Schwerpunkt Record Linkage.” Abschlußbericht Vom 11.

Sun, Tianyang, Chengchun Shu, Feng Li, Haiyan Yu, L. Ma, and Yitong Fang. 2009. “An Efficient Hierarchical Clustering Method for Large Datasets with Map-Reduce.” In Parallel and Distributed Computing, Applications and Technologies, 2009 International Conference on, 494–99. https://doi.org/10.1109/PDCAT.2009.46.

Sun, Yongjiao, Ye Yuan, and Guoren Wang. 2011. “An Os-Elm Based Distributed Ensemble Classification Framework in {P2p} Networks.” Neurocomputing 74 (16): 2438–43. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2010.12.040.

Sun, Zhanquan, and Geoffrey Fox. 2012. “Study on Parallel Svm Based on Mapreduce.” In International Conference on Parallel and Distributed Processing Techniques and Applications, 16–19. Citeseer.

Tang, Jiexiong, Chenwei Deng, Guang-Bin Huang, and Baojun Zhao. 2015. “Compressed-Domain Ship Detection on Spaceborne Optical Image Using Deep Neural Network and Extreme Learning Machine.” Geoscience and Remote Sensing, IEEE Transactions on 53 (3): 1174–85. https://doi.org/10.1109/TGRS.2014.2335751.

Turpin, Andrew, and Falk Scholer. 2006. “User Performance Versus Precision Measures for Simple Search Tasks.” In Proceedings of the 29th Annual International Acm Sigir Conference on Research and Development in Information Retrieval, 11–18. SIGIR ‘06. New York, NY, USA: ACM. https://doi.org/10.1145/1148170.1148176.

UCI. 2011. “Record Linkage Comparison Patterns Data Set.”

———. 2014a. “Higgs Data Set.”

———. 2014b. “Susy Data Set.”

Wang, Botao, Shan Huang, Junhao Qiu, Yu Liu, and Guoren Wang. 2015. “Parallel Online Sequential Extreme Learning Machine Based on Mapreduce.” Neurocomputing 149, Part A (0): 224–32. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2014.03.076.

Wang, Guoren, Yi Zhao, and Di Wang. 2008. “A Protein Secondary Structure Prediction Framework Based on the Extreme Learning Machine.” Neurocomputing 72 (1–3): 262–68. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2008.01.016.

Xin, Junchang, Zhiqiong Wang, Chen Chen, Linlin Ding, Guoren Wang, and Yuhai Zhao. 2014. “ELM: Distributed Extreme Learning Machine with Mapreduce.” World Wide Web 17 (5): 1189–1204. https://doi.org/10.1007/s11280-013-0236-2.

Xu, Lei, Hanyee Kim, Xi Wang, Weidong Shi, and Taeweon Suh. 2014. “Privacy Preserving Large Scale Dna Read-Mapping in Mapreduce Framework Using Fpgas.” In Field Programmable Logic and Applications (Fpl), 2014 24th International Conference on, 1–4. IEEE.

Zhang, Chi, Feifei Li, and Jeffrey Jestes. 2012. “Efficient Parallel kNN Joins for Large Data in Mapreduce.” In Proceedings of the 15th International Conference on Extending Database Technology, 38–49. EDBT ‘12. New York, NY, USA: ACM. https://doi.org/10.1145/2247596.2247602.

Zhao, Weizhong, Huifang Ma, and Qing He. 2009. “Parallel K-Means Clustering Based on Mapreduce.” In Cloud Computing, edited by MartinGilje Jaatun, Gansen Zhao, and Chunming Rong, 5931:674–79. Lecture Notes in Computer Science. Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-10665-1_71.

Zhao, Xiang-guo, Guoren Wang, Xin Bi, Peizhen Gong, and Yuhai Zhao. 2011. “XML Document Classification Based on Elm.” Neurocomputing 74 (16): 2444–51.

Zong, Weiwei, and Guang-Bin Huang. 2011. “Face Recognition Based on Extreme Learning Machine.” Neurocomputing 74 (16): 2541–51. https://doi.org/http://dx.doi.org/10.1016/j.neucom.2010.12.041.