Performance Analysis of different Machine Learning Models for Intrusion Detection Systems

I n recent years, the world witnessed a rapid growth in attacks on the internet which resulted in deficiencies in networks performances. The growth was in both quantity and versatility of the attacks. To cope with this, new detection techniques are required especially the ones that use Artificial Intelligence techniques such as machine learning based intrusion detection and prevention systems. Many machine learning models are used to deal with intrusion detection and each has its own pros and cons and this is where this paper falls in, performance analysis of different Machine Learning Models for Intrusion Detection Systems based on supervised machine learning algorithms. Using Python Scikit-Learn library KNN, Support Vector Machine, Naïve Bayes, Decision Tree, Random Forest, Stochastic Gradient Descent, Gradient Boosting and Ada Boosting classifiers were designed. Performance-wise analysis using Confusion Matrix metric carried out and comparisons between the classifiers were a due. As a case study Information Gain, Pearson and F-test feature selection techniques were used and the obtained results compared to models that use all the features. One unique outcome is that the Random Forest classifier achieves the best performance with an accuracy of 99.96% and an error margin of 0.038%, which supersedes other classifiers. Using 80% reduction in features and parameters extraction from the packet header rather than the workload, a big performance advantage is achieved, especially in online environments.


INTRODUCTION
In recent years, communication technologies rapidly grown and become available everywhere and for everyone at an affordable price. This availability and accessibility make it easier for any user to become an attacker and rises amounts and types of attacks. In addition to increased users numbers, many smart devices have been introduced and connected to the internet. Attackers regularly invent new attack methods that network managers and security programs are not familiar with. Due to this massive use of networking and internet systems, security issues became more predominant and a major challenge for ordinary users, organizations, enterprises, and governments agencies. Thousands of Cyber security attacks are launched on the internet which resulted in loss of money, business, and reputation (Salih & Abdulazeez, 2021). To provide confidentiality, integrity and availability to countless users (and corporations) in addition to the task of keeping them safe from threats, we are in need for robust and powerful security system. Observing the communication and data transfer through internet networks is a major part service for internet providers nowadays. Intrusion Detection Systems (IDSs) have a prominent action in the frontlines and are considered as a second defense line to provide protection against intruders. To cope with the spread of attacks (in amount and versatility), new detection techniques are required and especially the ones that use Artificial Intelligence (AI) techniques named machine learning based intrusion detection and prevention systems (Salih & Abdulazeez, 2021) (Daniya, et al., 2021). The objectives of this work are evaluating (performance-wise) different AI based classifiers algorithms (used to build IDS) and nominating the best for intrusions detection. The metrics of concern are the confusion matrix, accuracy, recall, precision, f-score, specificity and sensitivity. As a case study three types of feature reduction selection techniques are used and comparisons were made with classifiers that uses all features. The performance of each classifier with different tuning parameters is analyzed to make a comparison using analytical and non-statistical techniques. Outcomes shows important details related to each classifier used in IDS design.
The organization of this paper is as follows; the start is with Theoretical Background at section 2. Next is the Related Work section that presents the most recent relevant works carried on IDSs (section 3). The Methodology and System Description is at section 4, which shows details of the used dataset and supervised machine learning models. Including explaining different stages procedural-wise. The Performance Analysis is at section 5, and it's at this section where experimental results are presented. Section 6 is the Outcomes section were findings from section 5 is highlighted. After that the section is Comparisons and Discussions (section 7), where the results of the whole work are analyzed. Before the last is Section 8, the Conclusions and Future Works, that highlights contributions of this work. The last is the References section.

Intrusion Detection System (IDS)
With the large-scale usage of internet networks, information security became a major concern for both organizations and regular users. The security of network communication devices from numerous threats and attacks is considered as an urgent task for networking systems administrators. Different techniques are used to have a secure communication and to protect the privacy of organizations against attacks like cryptography, firewalls, and access control. At the same time attackers also evolve their techniques and innovative new methods and tricks to breach systems' security. Hence, IDS has major responsibility of protecting and securing networks through sustaining their confidentiality, integrity and availability for all authorized users (Kaur & Kumar, 2020). IDS could be implemented on hardware or by software to automate intrusion detection processes. According to settings and configurations IDS can constantly observe systems' conditions and take the necessary action by generating alarms to alert system administration about possible attacks. The observation process is for incoming or outgoing data reaching or leaving the network, to detect suspicious activities efficiently and to guarantee optimal security in any part of the networking system (Kaur & Kumar, 2020).
A typical IDS generates and send alert signals on any illegitimate conditions the networking system may get exposed to, such as illegal emails, audio and video messages. Its role is to detect or examine IP (Internet Protocol) packets spreading in the network for irregular patterns and to collects data about attacks and apply countermeasure to confront these attacks (if we have detection and prevention). The basic structure of an IDS is shown in Fig. 1 (Gupta & Agrawal, 2020).
At present, a large number of internet systems are generally unprotected, which provides precious moments for hackers to illegitimately disclose authorized information. Attackers are attracted to access confidential information and always try to make denial of services attacks for authorized users. IDS are categorized based on structure and detection methods, they could also be classified based on features, as shown in

Structure Based IDS
Intrusion Detection Systems can be divided into three different types based on its structure: Host-Based, Network-Based and Application Based (Kaur & Kumar, 2020).

Network-Based IDS (NIDS)
NIDS fundamentally monitors and analyzes network data flow to search for an attack or an intrusion in real time to identify suspicious activities between any two networks through installed sensors and notify the systems' administrators about (Li, et al., 2018). The most important features of this type are cost effectiveness, as it can capture attacks that passed from HIDS (hostbased intrusion detection system) easily without further modification to the network used. On the other side, it can't detect encrypted data and require over time observation of the network (Anwar, et al., 2017). Network-Based Intrusion Detection System are usually deployed and placed inside the router. The NIDS network interface cards are placed into promiscuous mode; therefore, it receives and monitors all data packets transferred through the network irrespective of direction of destinations. It achieves most of its scrutiny at the application layer, e.g., Hypertext Transfer Protocol (HTTP), Simple Mail Transfer Protocol (SMTP), and Domain Name System (DNS). If there is an attack, a message or an alarm to the system administrator is initiated (Azhagiri, et al., 2015).

Host-Based Intrusion Detection System
Host-Based IDS are fundamentally installed on single specific devices like servers or host computers to analyze and monitor the computer system. All host-based IDSs have software detection identified as agents. Every agent checks activity on a single device and might take necessary actions. Some agents check a single explicit request service. These agents are called application-based IDSs (Anwar, et al., 2017).

Application Based IDS
It examines specific application protocols used in the system. Monitoring the files in these applications to distinguish any type of invasions or misuse of protocols to keep the network safe from intrusions. It also observes anomalies like negating files execution, exceeding authorization, or changing the behavior of protocols (Agrawal & Agrawal, 2020).

Detection Based IDS
An IDS core goal is to reduce the false alarms that means reducing false-positive and falsenegative ones. The fundamental types are the Signature and the Anomaly based detections.

Signature Detection
It is always known as misuse-based detection IDS preconfigured to match signatures of known type of attacks engaged over the received route. These signatures are stored in a database to assist in discovering attacks in a highly precise way. Signature detection is very effective in predicting prominent types of attacks, but inefficient in detecting unknown or new attacks due to the lack of signatures, hence it has a high false positive alarm rates (Kaur & Kumar, 2020).

Anomaly IDS (AIDS)
It is also named as the "behavior-based IDS", which basically depends on knowing the normal operation of authorized users and storing the signatures patterns of normal data in a database, to efficiently detect any abnormal or malicious activity taking place in the network by comparing it with stored normal patterns signatures. The drawback of this type lies in comparisons difficulties with large network data. It's a time-consuming online operation as well as it requires large storage memory. The performance of anomaly IDS is degraded by generating false alarms and identifying each suspected data packet. It has been also known as statistical IDS. To classify an ordinary packet from an intrusion one an enormous effort has to be done and in a typical process if there is any sort of discrepancy, the system will activate an alarm automatically. This type is usually related with proactive intervention (Ahmad, et al., 2021).

Attack Based
This type is classified as follows:

Normal Attack
This attack can be seen as a passive attack without having any pattern. It represents a state, where the network has no signs of change and no abnormal attack takes place in the status of the network (Kaur & Kumar, 2020).

DOS-Attack
This is usually named the denial of service (DOS) attack and it has many sub-types. In this type of attack the intruder or (the hacker) of the network carries out diverse ways of unauthorized actions like illegal computation or making the victim's computer memory flooded with invalid network packets. This makes the system unable to respond to authorized requests that have been instructed by authentic users. The hacker carries out a Botnet attack and get a benefit from the remoteness (Kaur & Kumar, 2020).

Probe Attack
This kind of attack comprises data collection by doing analytical processes to excerpt usable list of IP addresses related to privileged services. The goal is to carry out an intelligent and effective attack on these services (Kaur & Kumar, 2020).

Remote to Local Attack (R2L)
The R2L attacks typically contain access to authorized resources by attack software that awarded the hackers to make incompatible order of operations on the network Server. The rest of the R2L attacks carry the password guessing process (Kaur & Kumar, 2020).
2.4.5 User to Root Attack (U2R) User to root attack is an operation to identify an activity where a hacker uses a spoofed address to make the network vulnerable or it spreads malicious programs into the network to dissipate the victims' resources. The well-known U2R attack is the stream of buffer one, where the hacker gets benefit from a flaw in the system program to collect further data into a buffer obtained from an implementation malware (Kaur & Kumar, 2020).

RELATED WORKS
In 2014, Deeman Y.Mahmood et al. worked on intrusion detection system based on machine learning with binary classification using unsupervised machine learning. They used an unsupervised K-means clustering algorithm with k=2 to classify the input data into two classes normal and attack. KDD dataset was used with 41 features and by using the information gain (IG) these features were reduced to 23 most important features. The NSL-KDD dataset is separated into 60% training and 40% test sets. The experimental results showed that the proposed approach achieved high accuracy of 97.22% with low false positive rate of 2.  Table, MLP (multilayer perceptron), Naïve Bayes and Bayes Network have been evaluated. The models are implemented by using KDD dataset with focusing on false negative and false positive rates achieved through the applied models. The Performance metric showed that the Decision Table achieved low false negative rate of 0.2% and higher false positive rates of 7.3%, which means that 7.3% of the data packets are falsely classified as attacks (Almseidin, et al., 2017). Another author in 2018, Rahul Vigneswaran et al. proposed Classical and Deep Neural Networks for network intrusion detection systems in cyber security. The KDDCup99 dataset was used in both training and testing sets. Comparisons where made between DNN (Deep Neural Network) and classical machine learning algorithms like binary classifiers Boost, Decision Tree, K-Nearest Neighbor, Linear Regression, Naïve Bayes, Random Forest, SVM-Linear (Support Vector Machine) and SVM-rbf. The DNN was used with different layers ranging from one layer to five layers using learning rate of 0.1 and number of epochs equal to one thousand. The study showed that DNN with 3 layers achieved better performance with respect to all other models used in the tests (K, et al., 2018). In 2019, S. Sandosh et al. proposed an Enhanced Intrusion Detection System using Agent Clustering and K-Nearest Neighbor Classifiers on preprocessing outlier detection. The KDDCup99 dataset was preprocessed at first, to remove unwanted outlier data instances. The unlabeled data is clustered by K-means clustering algorithm using agent based clustering sub group. Attacks identifications have been made by K-Nearest Neighbor (KNN) to classify the received data into known (normal data) and unknown (attack data). The empirical results showed that the Enhanced Intrusion Detection System using Agent Clustering and K-Nearest Neighbor Classifier have better performance compared with other classifier models. A different metric obtained, the proposed model achieved 92.23% of accuracy and false negative rate of 0.7%, which is higher compared with other used models ( Decision Tree (DT) to classify data into five multiclass. One for normal data and four for intrusive data. The implementation was to improve detection prediction rates with highly complex features reduction. At first, the NSL-KDD dataset is preprocessed using four different sub-group of reduced features of the dataset (dimensionality reduction). Experimental results showed that each of Random Forest (RF), Extra-Tree Classifier (ETC) and Decision Tree (DT) performed over 99% of accuracy for all intrusive classes in all sub-groups (Abrar, et al., 2020). Also, in 2020, Alif Nur Iman et al. presented an improvement to intrusion detection system using optimum Random Forest parameters for solving infinite loops problem in Boruta algorithm. Using estimated selected features with NSL-KDD dataset, entropy and Gini index were employed as preprocessing. The Random Forest classifier model is used with different depth parameters and the number of trees. The empirical results revealed that the proposed design mitigate the infinite loop in Boruta algorithm with a depth parameter equal to 7, at the same time the running period and the number of iterations were improved (Iman & Ahmad, 2020). Gurbani Kaur et al. at 2020 also, proposed an Artificial Neural Network (ANN) algorithm for intrusion detection systems based on Gray Wolf Optimization (GWO) algorithm. The ANN was used to classify input data into normal and different types of attacks based on KDDCup99 dataset. In addition to GWO, PSO (Particles Swarm Optimization) and GA (Genetic Algorithm) were used to optimize the ANN parameters. ANN with GWO optimizer showed better performance in comparison with other techniques like ANN without optimization or ANN with PSO and ANN with GA (Kaur & Kumar, 2020). In 2021, CHAO LIU et al. proposed a hybrid intrusion detection system using a combination of K-means, Random Forest and Deep learning machine learning. They used multi stage design with unsupervised machine learning algorithm named K-means clustering with Random Forest binary classifier, implemented on Spark platform. The NSL-KDD and CIS-IDS2017 datasets were used for training and testing the model. Deep learning stage was added for further data classification manipulated by the first and second stages as normal or attack. Combined with significant improvement in accuracy, the response was quick. The empirical results showed that the presented approach achieved a high true positive rate for all types of attacks with quick response and less training time.

METHODOLOGY AND SYSTEM DESCRIPTION
The overview of the proposed classifier model is shown in Fig. 3, which consists of several blocks starting from the KDD99 dataset, preprocessing, model training, test set classifier and performance evaluation block, respectively in cascade.

Dataset Model
The KDD99 (Knowledge Discovery and Data Mining) dataset is used in this work due to the need of a large credible dataset for intrusion detection systems. It's a well-known standard benchmark dataset used to evaluate the performance of intrusion detection systems that uses machine learning and it could be accessed from the below link:

Content Group
These features use domain information to scrutinize attacks in the segments content of the tcpdump files. These features assist in detecting R2L and U2R attacks. It consists of 13 features, from the tenth feature to the twentieth feature (Xin, et al., 2018).

Time Group
These features inspect connections within 2 seconds of time and records statistical information for all connections. It has 9 feature attributes begin from the feature numbered of twenty-three to the feature attribute number thirty-one (Xin, et al., 2018) (Zhang, et al., 2018).

Host Group
These features provide statistical data about 100 connected windows with the same host and same service. It is comprised of 10 features attributes, from thirty-two to the forty-one (Xin, et al.,  2018). Although the KDD99 dataset is older than 20 years but it's still the most trustful benchmark dataset for intrusion detection by researchers for several reasons. Firstly, it's open source, available online and extensively used by researcher in more than 142 studies (from 2010 to2015). Secondly, about 24% of researches in the intrusion detection field are using this dataset (Ahmad, et al., 2021) . Thirdly, to compare this work with others who used the same dataset.

Preprocessing
To prepare the used KDD99 dataset embedded with 41 features with target labels preprocessing is required. Firstly, by converting it into 9 categorical attributes using One-hot encoding. Secondly, by changing the input feature values into the range 0 to 1 by using min-max normalization equation (1) (Farhana, 2020).

Information Gain Feature
Mutual information (MI) estimates mutual information for each input feature. The mutual information between two random variables is a value between 0 and 1, related to dependencies between the variables. It equals to zero when the two random variables are independent and reaches high values with dependencies. Actually, it measures the amount of information one can obtain from one random variable given the other. Using Information Gain (IG), forty-one (41) features of the dataset are ranked from the most important to the least, as shown in Table 3. The most important features are shown graphically in Fig. 4.

Pearson Correlation
Pearson correlation is used to measure the strength and direction, it's positive or negative depending on linear relationship between two variables. Result of this correlation between two input features and target output (outcome) is shown in Table 4.   Features that are highly correlated to the response (outcome) are good features to use for prediction of output, irrespective of its value, positive or negative. High correlation could be observed in colors hues of Fig. 5.

F-test
The ANOVA F-test, available in scikit-learn as "f_classif.ANOVA", stands for analysis of variance and the output of it consists of F-statistics and p-values. A comparison for each feature with the response variable (outcome) is shown in Table 5.   In scikit-learn library, the F-test parameter assist in univariate feature selection. This is helpful with large number of features where many of them are useless in the current scope. A quick way is required to short-list which ones are the most useful. For example, if we want to retrieve only 20% of the features with the highest F-statistics, the useful inputs features will equal to 8 (41*0.2=8), for all used types (Müller & Guido, 2016). The most important selected features chosen for all methods are shown in Table 6. Comparing all the three methods, the Pearson Correlation and F-test methods have the same ranking for all input features. The Information gain is different from them only in the last two features src_bytes and dst_bytes (as clearly indicated in Table 6). Dimensional input feature selection and reduction for binary classification of intrusion detection using different classifier models resulted in getting different Performance values.

Models Description
Classification is one of the purposes in using supervised machine learning. The model is trained to enabled intrusion detection and classification. The input data is classified into two classes, normal and abnormal (attack). Classifiers models are built using different supervised machine learning algorithms based on:  Distance approaches: KNN, SVM, and Linear Regression (Mukhopadhyay, 2018).  Probability approach: Naïve Bayes.  Rule approaches: Decision Tree and Random Forest.
The classifiers models are described in the next subsections.

KNN
Stands for K-nearest neighbors which is a lazy learning algorithm based on Euclidean distance equation that find the distance between input data and nearest points. For each input data, the model measures the distances between this point and several neighbor data points (k value). Therefore, the type or class of this data is dependent on the similarity of major neighbor types or classes of data points. Simplicity and high efficiency are significant characteristics of this model, slowness and longtime consumption are drawbacks of it.

Logistic Regression
Logistic regression is a linear model of supervised machine learning used in classification. It's a powerful model especially for high dimensional data. It prevents overfitting and embedded with a tuning parameter that enables controlling its performance. The regularization parameter C control the performance of the model. When the value of C is high we will have less regularization (complex models) and high performance. Low values of C lead to simple model with low performance.

Linear SVM (Support Vector Machine)
Linear SVM is implemented as support vector classification. Kernelized support vector machines are an extension that permits to build of more complex classifiers. One way to make a linear model more flexible is by adding more features to it, e.g., by adding interactions or polynomials of the input features. By adding these values, the model will no longer be linear. Here, the distance between data points is measured by the Gaussian kernel: ( 1 , 1 ) = ) )2( Where, ∥ 1 − 2 ∥ is the Euclidean distance, and the Gamma is a parameter that controls the width of the Gaussian kernel.

Naïve Bayes Classifier
It is a well-known widely used supervised machine learning classifier that works based on Bayesian theorem. it's well known for its simplicity. The posterior probability P(A|B) is calculated from Bayes rules by: Where P(A) and P(B) are independent features.

Decision Tree (DT)
It's a rule-based classifier that sorts inputs data by attribute values. Each node of the tree represents an input feature and branches represents feature's values. The classification process starts at the root level (according to feature's values) and splits the data by different measures used in samples identification, e.g., information gain and Gini index.

Random Forest (RF)
Its ensembles several Decision Trees classifiers combined to obtain accurate and robust predictors with overall improvement in outcomes.

Stochastic Gradient Descent (SGD)
An iterative algorithm that starts from an initial value and tries to minimize the cost error function values in order to obtain new value of Xnew from the current value X old using the following equation; Xnew = X oldderivative (X old) *learning rate. Where the derivative (X old) is gradient value. This algorithm works well with a high dimensional dataset (Klosterman, 2021).

Gradient Boosting Classifier
A powerful type that embed decision trees classifiers with precise prediction. This type is modeled with data arranged in tabular format (Klosterman, 2021) .

Ada Boost Classifier
Consists of many ensembles of machine learning models or estimators trained and arranged in a sequential way.

Evaluation Metrics For Binary Classification
Binary classification is the process by which data is divided into two types or classes. One for normal and the other for attack class. The metrics used with this classifier are categorized into four, arranged in a confusion matrix as follows:  TP (True Positive) is an attack truly detected by the model.  TN (True Negative) is normal data and correctly recognized by the model.  FP (False Positive) is normal data but the model considered it as an attack.
Journal of Engineering Volume 28 May 2022 Number 5 78  FN (False Negative) is actually an attack, but not recognized correctly by the classifier, which may cause breach of security and be catastrophic. Table 7 shows the detail of the confusion matrix. Relates to all data correctly classified by the model over total data.

Precision
Precision is truly classified data over all data predicted as attacks.

Recall
Recall is correctly recognized data divided by all attacks. Falsely classified as a normal data from all normal class, it's also called false alarm rate (FAR).

PERFORMANCE ANALYSIS
In this section performance analysis of all classifiers (under current scope) are presented. The analysis is based on the obtained results.

KNN
The KNN Model is implemented with 8 feature using information gain by different training and testing sets percentage ratios, 60%, 70% and 80%. Results are shown in Table 8. Results of KNN with 8 features using F-Statistic with three percentage ratios of separation between training and testing sets are shown in Table 9.   The KNN classification figures with separation of 70% (train-test ratio) and with n_neighbors equal to five is shown in Table 11.

Linear Regression Model
The performance of Linear Regression model is calculated by changing the tuning parameter C from 1 to 100 as shown in Table 12. More regularization with low values of C leads to a simple model with low performance. Less regularization with high values of C leads to a complex model with high performance.

Linear SVM Model
The performance of Linear SVM Model with different tuning parameters, C and gamma are evaluated and shown in next tables. Table 13 shows different scenarios for performances of Linear SVM Model with different values of tuning parameters and using information gain in a reduced feature space (reduced to 8). The parameters are data split of 70% (30% for test set), C and gamma. Where both parameters C and gamma are increased together.  Table 14 shows the performance of the Linear SVM Model, when gamma parameter is kept constant with value of one, and C is varied from 0.1 to 10.  In the second and third columns, C is kept constant with value of one, while gamma is changed from 0.1 to 10. The fourth column is for C equal to one thousand (1000) and gamma is 0.1. Keeping the percentages of train-test and attacks-normal ratios constant, classifications of Linear SVM is shown in Table 16.  Separation-wise results (train-test) for SGD, Naïve Bayes and Decision Tree models are shown in Table 18 using information gain with 8 features, 70% for a train set and 30% for the test set.

RF-GB-AB
In this part we focus on results obtained from Random Forest (RF), Gradient Boosting (GB) and Ada Boost (AB) classifiers (the title is hyphen concatenation of first letters). The performance metrics is shown in Table 19 stating that with decreasing the model complexity, the training set accuracy is reduced (expected) and lowering maximum depth of the tree provides a significant parametric improvement, while lowering the learning rate only increases generalization (slightly).

OUTCOMES
In the next subsections important outcomes obtained using different classifiers are presented. The start one is the KNN classifier and the final one is the Ada Boost.

KNN
In this subsection and based on values obtained from the Performance Analysis section certain key points related to KNN classifier are highlighted and focused on as listed below:  With 8 features, better performance was achieved using information gain (IG) than F-Statistic. The accuracy and error rate using information gain are higher than of F-Statistic as shown in Table 7 and Table 8.  Using 70% separation rate better performance was obtained compared to others, for both feature selection techniques IG and F-Statistic in terms of accuracy and error rates.  The n_neighbors parameter controls the performance of the classifier. Having n_neighbors equal to one (1) means that the boundaries between training data are close and the model is complex. Fig. 6 shows performance of KNN model with different n_neighbors, ranged from 1 to 10. With n_neighbors bigger than one (n_neighbors>1), smoother decision boundary is obtained and the model is classified as simple.  Two important parameters control the classification: Number of neighbors (works well for values 3 to 5 as shown in Fig. 6) and the rule used to calculate the distances between the datapoints (by default Euclidean distance is used).  Using eight features nominated by IG gave better results in terms of accuracy and error rates than using all the features, as shown in Fig. 7.

Decision Trees
In Decision Trees classifier the performance is evaluated under the receiver operating characteristic (ROC) curve as shown in Fig. 8. The parameter depth controls the number of levels in the tree. Limiting the depth of the tree to 4 (max_depth=4) decreases overfitting. This leads to lower accuracy on the training set, but an improvement on the test set.
The most important selection feature of Decision Trees (DT) classifier is the count feature, followed by src_bytes, dst_bytes and service. There is a big margin between the count and the rest. Fig. 9 shows that values of count is reaching 0.88, while next feature value is less than 0.05. Value of 0.88 means that all the necessary information has been taken from the count feature and at the same time this does not mean that other features are useless, but the contained information is either

C OMPAR IS ON ALL DATA & 8 FEATUR ES
All Datasets IG with 8 F Figure 7. Comparison of all and 8 features.
Journal of Engineering Volume 28 May 2022 Number 5 86 repeated or the same. The main drawback of DT is that it tends to overfit the training data. Random forests are one way to counteract this problem.

Ensembled Classifiers
Ensembled classifiers under current work scope are Random Forest, Gradient Boosting and Ada Boost. Fig. 10 (a, b, and c), shows features importance per each classifier, respectively. In Random Forest all eight (8) features have been used, but the highest importance value is for the count feature with value around 0.35 as shown in Fig. 10a. While in Gradient Boosting only two features are used with big margin in-between and importance value for count feature reaching above 0.85, as shown in Fig.10b.  The last ensembled classifier results is shown in Fig. 10c. This classifier shows that all eight (8) features are important with highest value for the service feature reaching 0.3. Next to service is the dst_bytes followed by src_bytes, protocol_type, count, srv_count and logged_in. Contrary to other classifiers the count feature is the least important one.

COMPARISONS AND DISCUSSION
Binary classification procedure is employed for intrusion detection using different classifier models based on a supervised machine learning algorithm. The performance of each model is calculated by many evaluation metrics in terms of accuracy, precision, recall, f-score, error rate, true positive rate, false positive rate and confusion matrix. The experimental results are arranged in tables for several models based on different rules and major key points are listed below:  Different dimensional selection reduction techniques have been used to reduce the input feature space and only 20% of all features in the KDD99 dataset were used. From 41 features, only 8 features were selected, achieving a significant impact in models' performance as well as minimizing the time consumption and memory storage required. The information gain shows a better performance outcome than other methods used in this work.   The percentage ratio of 70% for train-test set shows a better result than other ratios in terms of accuracy and error rates.  Number of attack and normal data instances is evaluated for all implemented models as were shown in the tables. In all cases the number of normal data is kept constant in the training and test sets for all implemented classifiers.  Different tuning parameters have been used to improve and control the performance of the employed models. The best outcome among all classifiers is achieved by the Random Forest classifier with 8 features, as it ensembles a number of decision trees to minimize overfitting.  Table 20 as the contribution of this work is emphasized. The second column (This work), shows the better achieved values.

CONCLUSIONS AND FUTURE WORKS
With advancements in intrusions and attacks, machine learning based classification became a necessity and this is where this work falls in, increasing the performance of intrusion detection system. The main contribution of this work is listed below:  Many techniques were employed and the information gain showed the best performance for feature selection.  The research work considered several models with varied tuning parameters to control their performances. The Random Forest classifier achieved the best performance with an accuracy of 99.96% and an error rate of 0.038%.  One valuable results of this work (in addition to the aforementioned one), is that only using 8features from the dataset (80% reduction in features) we obtained a very good performance value of (99.96%). These features are extracted from the packet header and not the payload. That means little processing and fast detection in online intrusion detection systems, where they are directly connected to the internet. The result is a dual speed-up factors, the first is through working with only 8 features and the second is by extracting data from the packet's header and not the payload.
Journal of Engineering Volume 28 May 2022 Number 5 89  Another point is that percentage ratio of 70% for train-test, showed better results than other ratios in terms of accuracy and error rates. Similar finding is not highlighting by other researchers working in this field. Also, false negative alarm rates were reduced to reach 0.037%.
For Future Work:  This work could be extended to other machine learning classifiers such as regression and multi classifications.  The dataset used was an unbalanced one, 75% attacks and 25% normal. Working with a balanced one of 50% separation between normal and attacks will be a direction in future.  Other datasets could be used in order to design a model with high robustness against any possible attacks in real time environment as the goal is to cope with various new intrusions, attacks and store all signatures in an updated database.