EVALUATING THE PERFORMANCE OF SUPPORT VECTOR MACHINES BASED ON DIFFERENT KERNEL METHODS FOR FORECASTING AIR POLLUTANTS

Annotation. The alarming level of air pollution in urban centres is an urgent threat to human health. Its consequences can be measured in terms of health issues experienced by children, an increasing numbers of heart and lung diseases, and, most importantly, the number of pollution related deaths. That is why a lot of attention has recently been paid to air pollution monitoring and prediction modelling. In order to develop prediction models, the study uses Support Vector Machines (SVM) with linear, polynomial, radial base function, normalised polynomial, and Pearson VII function kernels to predict the hourly concentration of pollutants in the air. The paper analyses the monitoring dataset of air pollutants and meteorological parameters as input variable to predict the concentrations of various air pollutants. The prediction performance of the models was assessed by using evaluation metrics, namely the correlation coefficient, root mean squared error, relative absolute error, and relative root squared error. To validate the model, the accuracy of the predictive algorithm was tested against two widely and commonly applied regression approaches called multilayer perceptron and linear regression. Furthermore, back check prediction test was performed to examine the consistency of the models. According to the results, the Pearson VII function and normalised polynomial kernel yield the most accurate results in terms of the correlation coefficient and error values to predict the concentrations of atmospheric pollutants as compared to other SVM kernels and traditional prediction models.


INTRODUCTION
According to World Health organization (WHO) air pollution is a biggest health risk because besides killing millions of people every year, it shortens the life expectancy as well [1]. In view of a report from American Lung Association [2], it is concluded that: a) a slight increase of 10 parts per billion (ppb) in Ozone (O 3 ) mixing can cause over 3700 premature deaths; b) the increased concentrations of PM 2.5 in air has a serious concern because, these particles due to their tiny size can deposit into the wind pipe and lung exchange region called as alveoli, and c) SO 2 is also an important precursor due to its strong association with respiratory diseases [3]. Crossing a certain limit, all air pollutants become dangerous to human health, however, due to number of reasons atmospheric concentrations of SO 2 and NO 2 in particular are considered extremely harmful to public health because a short exposure to such pollutants can aggravate human respiratory system [4]. Therefore, for air quality management and effective policy making, besides strict monitoring, the development of accurate prediction models is equally essential.
A. Masih, A. N. Medvedev According to several studies, environmental parameters, regional and synoptic meteorology can seriously influence air pollutant concentrations [5]. For example the ground level O 3 concentration over Chicago according to Holloway [6] is sensitive to air temperature, wind speed and direction, relative humidity, incoming solar radiation, and cloud cover. Various meteorological compositions and their interaction with light are important e.g. higher ambient temperature and solar intensity speed up photochemical reactions which lead to the formation of air pollutants, likewise wind speed and humidity can directly affect air pollutants dispersion [7].
Pollution prediction modelling deals with the concentration of atmospheric gases and their connection with the regional meteorological parameters for scientific application [8]. It helps in measuring the level of air pollution and assesses its impact on living beings [9]. Considering the strong association of emission sources with air pollutants as well as with regional and meteorological parameters, the role of such models is indispensable [10]. Because not just they help determining the actual emission sources, but future mitigation solutions is the other major contribution of such models [11].
There are two types of approaches which regulate atmospheric concentration of air pollution: a) Chemical Transport Models (CTMs) and b) data driven approaches. CTMs generally deal with the emission process, mixing and transportation of atmospheric gases with respect to the regional weather parameters [12]. These models are based on multiprocessing techniques which use real time and updated emission and meteorological records. However, the implementation of such models at times is held by the lack of primary emission and meteorological data in areas with initial boundary conditions [13]. No doubt the accuracy achieved by regression based models is reasonable, however, several studies revealed that non-linear behaviour of air pollutants and other influential regional features leads to a complex system of pollution especially in regions with complex terrain. And traditional deterministic models find it difficult to capture this non-linear complex system of air pollution [14]. Therefore, to deal with these problems, data driven ap-proaches based on machine learning such as Artificial Neural Networks (ANN), and SVM seem promising for their ability to efficiently overcome the issue of non-linearity. These approaches are generally based on statistical techniques using historical data to make future predictions. These models are trained by using emission data, meteorological conditions, land use, anthropogenic activity etc. [15].
Literature review conducted in the context of this study suggest that recent researches in the field of environment science based on machine learning techniques such as ANNs, SVM etc. show a superior predictive performances over classic statistical models without knowing the chemical mixing, dispersion and transportation details of atmospheric gases [16][17][18][19]. It further revealed that, though traditional machine learning tools are able to handle non-linearity and complexity of emission datasets, however, ANNs based algorithms usually fall in traps of overfitting, local minima and best network architecture [20,21]. Therefore, SVMs based algorithms can be used as alternative approaches, for their capability to deal with drawbacks of ANNs [22,23]. Although these techniques are more efficient and reported to have promising performances in other research areas for prediction purposes, however, a lack of research using SVM based on different kernels especially Pearson VII Universal and normalized polynomial Kernel Functions in the field of atmospheric pollution modelling was identified.
Following the observations, the work comprehensively inspects about the performance of SVMs under different kernels against the leading regression approaches based on classic statistical and traditional machine learning algorithms.

Study area
Kazakhstan is a growing economy. It generally relies on natural resource extraction such as oil and natural gas. It has a long history of environmental issues. After 28 years of independence, air pollution in most cities remains one of the major urban problem of the country. During recent years, air pollution has become a key focus Evaluating the performance of support vector machines based on different kernel methods... due to its serious health effects. Due to the direct involvement of NO 2 and SO 2 concentrations in Air Quality Index value both are deemed to be extremely dangerous for human health. Hence, the accurate prediction will help policy makers to take precautionary measures and formulate effective policies in time.

Data collection and pre-processing
The study makes use of historical atmospheric dataset gathered in Ust-Kamenogorsk -an administrative centre located in East Kazakhstan where several air quality and meteorological monitoring stations are installed. The dataset was provided by the Institute of Industrial Ecology Ural branch Russian Academy of Science. The datasets were recorded at 8 air monitoring sites located near Gastello Street (station 1), Delegatskaya Street (station 2), Kazakhstan Street (station 3), Auezov Avenue (station 4), Pogranichnaya street (station 5), Kuybysheva Street (station 6), Mendeleev Street (station 7), and Abay Avenue (station 8). These monitoring stations measure the concentration of various pollutants such as nitrogen dioxide (NO 2 ), carbon monoxide (CO), Sulphur dioxide (SO 2 ), hydrochloric acid (HCl), Formaldehyde (HCOH) and total hydrocarbon amount (CXHY). Since, wind speed and wind direction are reported to have a vital role in the transportation of pollutants, while regional temperature, precipitation and relative humidity aid the chemical mixing and dilution of air pollutants, therefore under this work meteorological records gathered at a meteorological station near Astana Agro-technika building were also taken into account. It captures wind speed (m/s), wind direction (degree), amount of precipitation (mm/h), ambient temperature (°C), relative humidity (%), and atmospheric pressure (mm-Hg).
A careful preliminary analysis unearthed several important features of the dataset e.g. the concentration values of CXHY were very low to be considered for pollution modelling, and atmospheric pressure values recorded during the study period were more or less static, therefore, for prediction modelling features considered include; (1) concentration of SO 2 ; (2) concentration of NO 2 ; (3) concentration of CO; (4) concentration of HCl; (5) concentration of HCOH; (6) ambient temperature; (7) wind speed; (8) wind direction; (9) amount of Precipitation; and (10) relative humidity, to characterize the hourly concentrations of air pollutants. Moreover it was observed that around 9 % meteorological values are missing, and a number of outliers were found in air quality dataset. Which necessitates the thorough cleaning of the data. Prior to in-depth cleaning, incomplete/missing values with discrepancies were carefully replaced, whereas the noisy data containing outliers were also removed. After data pre-processing a dataset containing 18000 instances were prepared for modelling. Lastly the dataset were split into two subsets containing 80 % -20 % of instances for training and testing purpose respectively.

Experiment setting
Support vector machines are one of the best and widely known data mining tools for solving problems related to classification and regression. The technique works on a principle of optimizing hyperplane in a kernel to maximize the boundary between two classes. To achieve the maximum margin, hyper-parameters (or kernel parameters) are need to be selected carefully. These hyper-parameter the hyperplane/boundary line largely depend on the selection and value of support vectors.
In order to classify non-linear behaviour of a data point, SVMs generally adopts a kernel trick. Different kernels have different nonlinear mapping prospects which implies that SVM performances are often hindered by the right choice of the kernel. A number of such kernels are commonly adopted for regression purposes, however, this work only analyses the prediction performances of linear, polynomial, normalized polynomial, Gaussian radial basis function (RBF), and Pearson VII universal function kernel (PUFK) by using the equations (1)-(4) respectively. While the prediction performances of classifiers were evaluated on the basis of predicted and observed concentration value of pollutants obtained from test dataset by using evaluation measures such as Coefficient of Determination 2 ( ), R Root Mean Squared Error (RMSE), Relative A. Masih, A. N. Medvedev Absolute Error (RAE) and Relative Root Squared Error (RRSE) by following equations (5)-(8); γ is the peak height at the center, σ is the tailing factor of the peak and ω is the Pearson width in equations (2)-(5).
where i y and i x are predicted and observed values respectively; x and y are the averages of targeted and predicted values respectively and n is the index number in equations (6)- (9).
For a fair prediction modelling, an appropriate adjustment of kernel parameters is a key to efficient performance of SVM kernels therefore, kernel parameters such a degree ( ) d for polynomial kernel, γ -width selection for RBF kernel and optimal selection of σ and ω for PUFK, all were fairly accustomed by using a technique called "grid search" during data training. The technique is known to determine the optimal values for all SVM parameters over a specified search range, hence, the study reports the best predic-tion results achieved under different SVM based algorithms. All prediction models were trained and tested following a train/test ratio of 80 %-20 % respectively. While, to evaluate the proficiency of the adopted prediction models the experimental design calculates Correlation Coefficient 2 ( ) R and error different error values like RMSE, RAE and RRSE.
Like SVM, study also employs Multilayer Perceptron (MLP) because, it is a supervised machine learning algorithm ranked among the list of top current age algorithms used for classification and regression purposes. The approach is robust having several eye-catching characteristics such as big datasets management and generalization abilities etc. Therefore, the study is aimed at a comprehensive investigation involving the development of several SVM based algorithm using different kernels.
The main contributions of the work include; (1) the application of SVM using PUFK function to predict atmospheric concentration of air pollutants in Ust-Kamenogorsk region; (2) the prediction performance comparison of SVMs with other well-known classic machine learning tools such as LR, MLP and SVMs.

Model Evaluation
Different support vector machine kernels have successfully applied in a number of fields like textile, electricity, bioinformatics, and atmospheric sciences etc. for various purposes such as regression analysis, time series prediction, condition monitoring, optimal control and fault diagnosis [23][24][25][26][27]. However, the review conducted suggest that the application of Pearson VII Universal function kernel and normalized polynomial kernel in environmental sciences have been limited especially for the prediction of air pollutant concentrations. Therefore, the work is aimed at assessing the prediction accuracies of SVM kernels, PUFK in particular, against other kernels such as RBF and polynomial and normalized polynomial kernels as well as the most commonly adopted approaches i.e. MLP and LR.
Besides choosing the right pollutants and environmental factors, to develop a robust predic- Similarly, for MLP a series of trails were performed to determine the number of neurons in hidden layer, while the optimization of MLP was carried under 80 % training and 20 % testing data points. The initial MLP architecture uses 10 input variables, having 5 neurons in hidden layer, learning rate equal to 0.3, moment coefficient of 0.2 and the number of epochs =500 to predict the concentration of gaseous pollutants. The accuracy performances of all classifiers to predict air pollutant concentrations are presented in table 1. It is a fact that the prediction performance of an algorithm is largely dependent upon predictive variables, nonetheless, results compiled in table 1 are evident that overall SVM with PUFK has clearly achieved better performances in terms of high correlation coefficient and low error values as compared to other widely adopted prediction models. It reflects the superiority of PUFK to grasp the complex relationship between air pollutants and meteorological records to make precise future predictions regarding air pollution. Normalized polynomial kernel with slightly low accuracy has registered the second best performance, while the performances of LR, SVM with linear and polynomial kernels were mediocre Considering the practical application of PUFK in air pollution modelling, its prediction performance was tested using different test datasets acquired at 3 different monitoring locations to predict two different air pollutants namely SO 2 and NO 2 . It is recognizable from the table 1 that SVM based on PUFK is the only modelling approach that is steady with top performances regardless the type of pollutant predicted or location of the test dataset used. While the performance of other algorithms to predict air pollutants remained wobbly with respect to the changing test dataset. The prediction model using normalized polynomial kernel and MLP have shown some notable performances to accurately predict the concentration of both the air pollutants at a number of occasions for example (1) at station 3, the normalized polynomial kernel has achieved an accuracy of 96.6 % to predict SO 2 that is nearly equal to what PUFK has accomplished (97.05 %) at this station; (2) normalized polynomial kernel's accuracy to predict NO 2 concentrations (92.8 %) was slightly better than that of PUFK which yielded 92.5 % accuracy; (3) whereas at station 1 normalized polynomial kernel shared almost a similar performance like PUFK to measure the concentration value of NO 2 yet another time; (4) and lastly both MLP and normalized polynomial kernel have secured the second and third best spots by predicting atmospheric concentration values of SO 2 with a precision equal to 95.5 % and 94.5 % respectively. However, the prediction performance of SVM having PUFK kernel remains the best with highest correlation coefficient values that no other algorithm selected under this study could achieve at all 3 test data stations.
In order to clearly see the overall performance of each model the author draws an average of correlation coefficient value using results obtained from 3 test stations to predict the amount of SO 2 and NO 2 in air as presented in Fig 1. It shows that on average the prediction accuracy attained by employing PUFK kernel in support vector machines is the best to predict SO 2 concentrations with 2 0.91 R = followed by normalized polynomial kernel with an average 2 R value equal to 0.90. On the other hand, to predict the exact con-

Fig. 1. Averaged R 2 value of 3 test data stations achieved by different prediction models
Evaluating the performance of support vector machines based on different kernel methods... centration of NO 2 in air, PUFK and normalized polynomial kernels have shown an analogous efficiency with coefficient value 0.89 which is significantly better than that of other algorithms selected under this work. Although MLP have also shown a character to predict SO 2 with a reasonable coefficient of determination value 0.89 that is not far behind what PUFK and normalized polynomial kernel have attained, however it fails to continue its run to make an equally efficient prediction of NO 2 concentrations. Interestingly, to predict NO 2 concentrations, the average accuracies accomplished by polynomial kernel (0.85) is better than LR (0.84) and is exactly equal to that of MLP accuracy i.e. 0.84.
In addition to accuracy measurements, algorithm performances with respect to three important error functions called RMSE, RAE, and RRSE were also gauged using equations (8), (9) and (10). The section analyses the average error values of RMSE, RAE and RRSE to predict pollutant concentrations of SO 2 and NO 2 drawn under different algorithms. The data presented in Fig. 2 Fig. 1 is reasonably good, but Fig. 2 reflects that its error values are listed among the bottom performers like LR, SVMs with linear, RBF and polynomial kernels.

Back propagation test
Since more attributes in a dataset make it more difficult for an algorithm to correctly learn the complicated relationship of input and output variables. Therefore, In view of back check prediction or self-consistency test the study divides the above main dataset of 17,216 instances (acquired at 8 pollution monitoring stations) into 5 sub-datasets for 5 different tests using data collected at monitoring stations 1, 2, 4, 6, and 8 respectively. In other words, besides embracing meteorological data, 1 st test only uses the data gathered at station 1 i.e. 2152 to predict the amount pollutants in air, 2nd test assesses the prediction performance of algorithms based on 2 stations data (4304 instances). Similarly, next tests (i.e. 3 rd , 4 th and 5 th ) train pollution records obtained from 4, 6 and 8 stations having data instances equal to 8608, 12912 and 17216 instances respectively, to construct prediction models. The test was aimed at examining the consistency of MLP, LR, SVM based PUFK, RBF, normalized polynomial and polynomial kernels to predict the concentration of SO 2 and NO 2 present in air.
It is important to note that, the results presented in Fig. 3 display the average correlation coefficient value obtained to predict NO 2 and SO 2 concentrations acquired by different modelling schemes. It visibly demonstrates the pure dominance, character, and consistency of SVM algorithm attained by Pearson VII Universal Function Kernel to predict the atmospheric concentration of air pollutants namely SO 2 and NO 2 with high accuracy, low variance and low error values irrespective the size of the dataset. Under similar circumstances other algorithms have underperformed. The prediction performance of nor-malized polynomial kernel is comparatively low during first experiment when 1 station's dataset were taken into account, however, the consistent upward trend, and smoothness of the curve is an answer to why the kernel should be rendered as the second best model in this specific experiment setting. Although overall SVM with RBF kernel hasn't done well nevertheless the sleekness of its curve confirms the stability of the kernel with the increased data instances. Fig. 3 also illustrates the inability of LR and linear kernel to capture the non-linear behavior between input and output predictors when size of the data is big. Admitting to the fact that the performance of MLP algorithm has been average, yet it has shown multiple times i.e. table 1, Fig. 1 and 3, that, the algorithm is capable of attaining a realistically high correlation coefficient value irrespective of the type of the modelling schemes adopted and varying size of the dataset. Contrary to that, it was quite strange to see the high enough error values produced by MLP algorithm (Fig. 2) due to which it could be categorized among the models with worst prediction accuracies such as linear regression, SVMs using linear, polynomial, and RBF kernels, hence confines its adoption for pragmatic purposes.

CONCLUSIONS
In this work 7 prediction models namely LR, MLP, SVMs using PUFK, RBF, linear, polynomial, and normalized polynomial kernels were developed. The study uses air pollutants data of NO 2 , SO 2 , CO, HCl, and CHOH gathered at 8 different monitoring stations and weather-related regional parameters such as ambient temperature, wind speed, wind direction, amount of precipitation, relative humidity, and atmospheric pressure to predict the atmospheric concentration of NO 2 , and SO 2 . The results obtained suggest that overall support vector machine using Pearson VII Universal Function Kernel has the characteristics of predicting atmospheric concentration of SO 2 and NO 2 with high accuracy, low error values and low overfitting probability. The work also affirms that, PUFK can outsmart both classic (LR) as well as the state of the art machine learning algorithms such as MLP and SVM having polynomial, normalized polynomial and RBF kernels adopted, Evaluating the performance of support vector machines based on different kernel methods... irrespective of data size or type of modelling technique applied. Normalized polynomial and MLP both have shown high accuracies under specific conditions but MLP doesn't qualify to the list of best predictors for overtraining and producing a significantly high error values. Other notable observations of the work include the winning performances of normalized polynomial kernel at occasions to predict the concentration of NO 2 and SO 2 with high accuracy, and finally the stability of RBF and PUFK kernels which remained fascinatingly steady and solid during consistency test.