An Intelligent Technique to Predict the Autism Spectrum Disorder Using Big Data Platform

Autism or autism spectrum disorder (ASD) is considered a psychiatric disorder. It is a condition that puts constraints on the use of linguistic, cognitive, communicative, and social skills and abilities. Recently, many data mining techniques have been developed to help autism patients by discovering the main features of the condition and the correlation between them. In this paper, we employ the association classification (AC) technique as a data mining approach to predict whether or not an individual has an autism. The Intelligent Classification Based on Association rules (ICBA) algorithm is proposed for finding the correlations between the features to decide whether an individual has autism in its early stage, especially in childhood. The ICBA algorithm incorporates the chi-square method to select the best feature to make the decision, in addition to proposing new techniques in all phases and increasing number of folds to 2size of data/10. The proposed algorithm is compared against four well-known AC algorithms in terms of accuracy to evaluate their behavior in the prediction task using big data platform. The results show a better performance for the ICBA algorithm in most experiments. Moreover, all of the considered algorithms had an increased level of accuracy when the chi-square method was used.

easily understood to select the most important ones. However, the number of rules and their types may play a dominant role in the prediction phase (Liu et al., 1998;Wa'el et al., 2010;Abdelhamid et al., 2014;Ma et al., 2014;Abdelhamid et al., 2015;Alazaidah et al., 2015;Taware et al., 2015;Shahin et al., 2019). Sorting the generated association rules is one of the most critical issues in deciding which rules have the highest importance and which ones have the lowest importance and hence eliminated. Support and confidence measures are the main measures used to differentiate the association rules (Tan et al., 2006;Hadi, 2013;Alwidian e al., 2020).
The big data solution offers two strengths for any machine learning approach that are 1) increasing size of data exponentially in the training phase may enhance the performance of the machine learning approach and 2) increasing number of computations exponentially on normal size of data leads to enhance the training and testing phases Xing et al., 2015). In our proposed solution we employ the second strength to obtain accurate measurements.
In this paper, a statistical measure is investigated and tested to see how it would affect the accuracy of AC technique(s). in addition to generate the accuracy level based on exponential number of folds to make the measure more accurate. A set of experiments are conducted using autism datasets, which are selected from the UCI repository to evaluate the most common AC algorithms: Classification Based on Association Rules (CBA), Multi-class Classification based on Association Rule (MCAR), Fast Associative Classification Algorithm (FACA), and Fast Classification Based on Association rules (FCBA). All previously mentioned algorithms are compared against the proposed Enhanced FCBA algorithm (ICBA) in terms of accuracy in relation to autism patients.
The paper is organized as follows: Section 2 presents the background on AC. Section 3 gives details of autism. The related work is presented in section 4. Section 5 presents the proposed big data platform. The proposed technique is described in details in section 6. Extensive experiments and their results are presented in Section 7. Finally, the conclusions and future research suggestions are presented in Section 8.

AC Background
In Alwidian et al. (2018), the AC approach combined the association rules and classification task: for example, if a rule such as → , then is a class value. The training dataset T has m distinct attributes , ⋯ and as a list of class values. A training object in can be a set of attributes, , ⋯ A1, and a class ( ) and the items described as an attributes, , and values, , where an is a set of combined items contained in a training object.
A rule item is formed as < , > where is the class value. The actual occurrence ( ) of a rule item in is the number of tuples in that match the itemsets defined in , where the support-count ( _ ) of rule item is the number of tuples in that match ′ itemsets, and belong to a for , as shown in Equation 1.

Equation (3)
Any rule item that exceeds the _ value will be a frequent rule item and is described as: ( , ) ^ ( , ) ^… ^ ( , ) → .

Autism Spectrum Disorder (ASD) Background
ASD is a brain development disorder that limits communication and social behaviors (Bolton et al., 1994;Thabtah, 2017). A number of tools are used for ASD diagnosis. Examples of clinical diagnosis approaches are Autism Diagnostic Interview (ADI) [Lord et al., 1994) and Autism Diagnostic Observation Schedule-Revised (ADOS-R) (Lord et al., 2014]. To enhance the accuracy of ASD diagnosis, researchers recently adopted machine learning approaches (Bone et al., 2014;Duda et al., 2016;Wall et al., 2012a;Wall et al., 2012b). The main goals mas.ccsenet.org Vol. 17, No. 1;2023 30 of these approaches are to: (1) Improve the classification accuracy (2) Reduce the screening time (3) Identify the smallest number of ASD codes to reduce the complexity of this problem.
Data mining offers automated classification models for ASD that are effective and efficient. These models combine various mathematical and search methods adopted from the field of computer science (Thabtah, 2007;Thabtah, 2017). Researchers have recently developed a number of data mining techniques for the ASD issue, e.g. support vector machine (Platt, 1998), decision trees (Quinlan, 1993), rule neural network (Mohammad et al., 2014), and classifiers (Abdelhamid and Thabtah, 2014). ASD diagnosis is regarded as a typical data mining classification problem, in that known classified instances can be used to build a model. The diagnosis of a new instance (ASD, No-ASD) can then be predicted using this technique.
Currently, data scientists use existing open source software to achieve this; WEKA (Hall et al., 2009) is an example of such software. The processed dataset is firstly loaded and the data mining algorithm is then applied. Various measures can be used to determine the effectiveness of the selected data mining method for predicting the diagnosis. Examples include accuracy, false positive rates, false negative rates, the model building time, and true negative rate. Data mining software packages often incorporate such evaluation measures.

Related Work
Different AC algorithms were developed to increase the classifiers accuracy and the building time model based on using one of the rule generation techniques and prediction methods. These algorithms include CBA (Liu et al., 1998;Alsahlee et al., 2019), Classification based on Multiple Class Association Rules (CMAR) (Li et al., 2001) and MCAR (Thabtah et al., 2005;Alwedyan et al., 2011). They have common steps in the way they work, while they vary in their rule generation process.
Spark platform used to evaluate six machine-learning algorithms on five health datasets in Nagarajan and Babu, 2019). The evaluation process was in term of accuracy and computational time that show well performance for the random forest and logistic regression algorithms in term of accuracy while the Naïve Bayes was the best in term of computational time.
In Ajayi et al. (2019) the big data technologies used for health safety and risks analytics with very large size of dataset. The solution focused on building big data platform to serve this size of data and the lifecycle of health risk analytics, while the architecture prototype interfaced different technology artefacts was implemented in Java programming language to predict the likelihoods of health hazards occurrence. The proposed architecture was able to find relevant features and enhance the explanatory capacities and preliminary prediction accuracies.
In Liu et al. (1998), the CBA algorithm was proposed to merge the classification task with the association rules. This algorithm functions in three phases: rule generation, pruning, and prediction. In the rule generation phase, the Apriori algorithm was implemented to identify the most frequent that represents the Class Association Rules (CARs) which passes minimum-support and minimum-confidence measures. The following steps explain how this is done.
(1) Find the candidate single . (2) Find the frequent single . Select the items for which the support is greater than or equal to a given minimum-support of the candidate set, where the support of an item can be calculated using Equation 4.

= ( ∪ ). Equation (4)
where x is the attribute, y is the name of the class, and n is the number of rows in the dataset.
(3) Find the two-candidate rules (i.e. each rule should have two items on the left-hand side, for example ( , → ).
(4) Find the frequent two-that satisfies the minimum-support.
(5) Repeat to find the next until the set is empty.
(6) Generate the CARs from the produced set based on selecting the rules with confidence values greater than or equal a given minimum-confidence. The confidence of an item can be calculated using Equation 5.
After generating the rules, the M1 method will be used in the pruning phase to choose the best rules that cover the entire dataset. Finally, to predict the class value for any given instance, the class of the first rule that can match this instance will be assigned as its own predicted class. Li et al. (2001) developed a new association classification algorithm (CMAR). This algorithm was developed based on adopting new approaches in rule generation and classification, which are considered the two main steps in this algorithm. In the rule generation step, FP-tree and CR-tree were employed to generate rules. The classification step in the CMAR algorithm finds the value class for its input by finding all the rules that can predict this input and then evaluates all of these rules to predict the class value. At the end, CMAR was compared with some AC algorithms and the results showed that CMAR outperformed other algorithms. Thabtah et al. (2005) proposed the MCAR algorithm to overcome the CBA's dataset multi-scanning process for generating the rules. In MCAR, the single itemsets are selected using the Tid-list approach. In addition, the occurrences for each rule are kept, facilitating the next itemset generation step without scanning many times.
In , the Apriori algorithm was optimized using general rule generation to overcome the long time needed in the generation phase to achieve incremental application. The authors proposed the FCBA algorithm and compared this with a set of AC algorithms in terms of accuracy, recall, precision, F1, and building time model measures.
A new FACA algorithm was proposed in Hadi et al. (2016). The Diffset method used to generate rules to enhance the efficiency of the classifier. It also sorts the generated association rules according to the minimum number of values on the left-hand side. The FACA algorithm proposed a multi-rules method in the prediction step to enhance the accuracy level of the classifier. In this phase, this algorithm splits the pass rules to a set of groups based on the class and then selects the class that has strongest rules. The authors evaluated the efficiency of their algorithm by comparing it in terms of set measures with well-known AC algorithms. Table 2 shows the main stages and the internal techniques for the CBA, MCAR, FACA, and FCBA algorithms.
In the rule discovery stage, all of these algorithms generate the same rules based on minimum-support and minimum-confidence as estimated measures; the main difference between them is the data structures that are used to store the entire data. In the CBA algorithm, the rules are generated by visiting the database directly without any changes to its structure that would lead to more time being spent at this stage. The MCAR and FACA algorithms, meanwhile, convert the database to lists to solve the multi-scan database problem, which requires a long time for the rule generation process.
Ranking is the most critical stage in these algorithms. It sorts the generated rules, based on a suggested set of measures, from the highest priority to the lowest. Based on the pruning technique in the next stage, some of these rules will then be eliminated and the others retained. Thus, the number and type of rules can be different in these algorithms for the same dataset. Finally, at the prediction stage, the type of prediction method that is used plays a very important role in increasing or decreasing the accuracy level.  (1) The CBA algorithm needs more than one scan for the dataset, thus requiring more memory and time.
(2) The MCAR and CBA algorithms do not split the dataset based on the classes, and this can lead to the generation of more unnecessary rules, negatively affecting the classifier speed.
(3) The FACA algorithm prefers more specific rules than general rules, affecting the accuracy of the prediction method.
(4) All of these algorithms used the minimum-support and minimum-confidence measures assigned by the user, which are support and confidence. This, if there are any rules that have confidence or support values less than the minimum-confidence and minimum-support, these will not be selected in the generated rules, i.e. if minimum-support = 0.2 and minimum-confidence = 0.6, rules with support = 0.6 and confidence = 0.55 will not be selected.
These weaknesses motivated us to propose a new algorithm to serve autism patients. This algorithm generates a set of rules by using the harmonic mean (HM) measure to enhance the accuracy of the classifier.

Big Data Platform
Our proposed solution builds a big data platform to enhance the WEKA performance in the training phase and testing based on 2size of data folds where, size of data represents number of records in the dataset (i.e. if we have dataset of 1000 records, then number of folds will be 21000). The huge number of combinations that could be produced form this assumption leads to use parallel programming technique that already embedded in Spark apache. Our big data platform is Cloudera platform that contains some of selected components to serve the main functionality of our proposed solution as shown in Figure 1. Distributed WEKA has been integrated with SPARK apache to enhance the model building time for the machine learning algorithms that will be evaluated based on huge number of folds to make enhance the accuracy measure (Meng et al., 2016).
Furthermore, apache Hive is used to store our data above the Hadoop Distributed File System (HDFS) that uses MapReduce and Yet Another Resource Negotiator (YARN) apaches to run multi task and manage the resources at the same time within the environment (Vavilapalli et al., 2013). Finally, Hadoop User Experience (HUE) apache employed to investigate the data on the HDFS by using a user-friendly interface.
Figure 1. Our proposed big data platform

Proposed Model
The proposed Intelligent Classification Based on Association (ICBA) algorithm aims to overcome the estimated measures that occurred in the association classification algorithm, thus increasing the accuracy of classifier. Moreover, this algorithm uses the incremental application that is needed to rebuild the classifier for each new instance in order to reflect the changes on the classifier, thus enhancing the accuracy measure.
Assumption 1: We assume that the ICBA algorithm differentiates between the attributes in the selected dataset based on the weight assigned to each attribute by using the chi-square method to eliminate the weak attributes that affect the accuracy of the classifier.
Assumption 2: We assume that the ICBA algorithm differentiates between the generated rules in the selected dataset based on the HM value for each rule. In addition, it will generate general rather than specific rules, improving the accuracy model and covering a large portion of the dataset. Furthermore, the general rules work properly with the voting prediction method.

Detailed Description of the ICBA Algorithm
The ICBA algorithm contains four phases, as shown in Figure 2 and The ICBA algorithm employs the chi-square (χ2) test to select the features in any given dataset based on statistical measures to show the dependencies between the features. Chi-square is a very commonly used method (Wall et al., 2012b). It evaluates the strongest features by finding the value of the chi-square statistic with regard to the class value. The initial hypothesis H0 is that the two features are independent, and this is tested using the chi-square equation: Where Oij is the actual frequency and Eij is the estimated frequency, asserted by the null hypothesis. The greater the value of χ2, the greater the evidence contradicting the hypothesis H0.

Rule Generation Phase
After applying the chi-square method as the feature selection technique to select the best attributes from the original dataset based on the dependencies between these attributes, the ICBA algorithm begins the rule generation process.
Algorithm 2 uses the D' and T' as input for this phase, where D' is the dataset selected by the chi-square method and T' is the training data. The first step in this algorithm is to compute the minimum HM value based on the given minimum-support and minimum-confidence. The ICBA algorithm then generates the single itemset and computes the support, confidence, and HM values for each item (Line 6).
In the next step, the ICBA algorithm generates the single item rules: the generated rules should have an HM value greater than or equal to the minimum HM value (Line 9) and the others will be used to generate the next item rules. Finally, the ICBA algorithm evaluates the remaining items based on the support and confidence: the items that have support and confidence less than the minimum-support and minimum-confidence will be eliminated (Line 12) and the others will be used to generate the itemset. This process will be repeated until S is empty (Line 20).
A good reason to employ the HM measure in the first phase in the association classification technique is to overcome the problem of the given measures that are used by the AC techniques. In AC algorithms, rules that have confidence or support less than the estimated measures, even by very slight values, will be eliminated. As an example, if the minimum-confidence = 0.5 and the minimum-support = 0.3, then if there are rules with support = 0.29 and confidence = 0.8 (or vice versa), these rules will be eliminated. Therefore, the HM measure is used by the ICBA algorithm to produce a harmonic value.
Furthermore, using the HM measure instead of support and confidence will lead to the generation of general rules. For example, if we have three rules such as → , , → and , , → , we can observe that → covers , → and , , → , so there is no need to generate these rules. Algorithm 3 presents how the ICBA algorithm prunes the generated rules based on the HM value. All selected rules are sorted in ascending order based on their HM value; any rules with the same HM values are sorted by confidence value, support value, and the first generated, respectively (Line 2). The first occurrence refers to the rule that has been produced first. Finally, our algorithm removes the conflicting rules based on the class majority criteria, where the majority class is the one with maximum frequency in the dataset. The ICBA algorithm predicts the class of an unknown instance by selecting the rules that match the instance from the rule-set and categorizing these rules based on the class name. The category that has the most rules will be assigned to the instance. If there is more than one category with the same number of rules, the default class will be assigned, where the default class in this context points to the class that has maximum frequency in the

Running Example
This example demonstrates how the ICBA algorithm works, and can be applied to any domain under the same phases. To begin, assume there is a dataset (T) for weather as shown in Table 3 Table 3. Weather dataset T (1) Preprocessing phase In the preprocessing phase we apply the chi-square method to rank the weather dataset, and the result are shown in Table 4. The cut-off point value of 0.8 is used in the chi-square method, which means that the temperature attribute will be removed and the weather dataset will contain only three attributes, as shown in Table 5.  (2) Rule Generation Phase In this phase, the ICBA algorithm computes the HM measure for each value in the dataset based on the support and confidence values for each candidate rule, as shown in Table 6.  3, 5, 8, 9, and 11 are less than the minimum HM value (0.285), so these will be evaluated using the support and confidence values. The rule 8 is the only rule that passes the evaluation process, which means that this rule will be used in the next generation process, while the remaining rules will be removed from the list. Furthermore, the next generation process stops as it contains only one rule.
(3) Pruning Phase The ICBA algorithm sorts the rules in the CARs based on the minimum HM value, confidence, support, and first occurrence respectively in ascending order, as shown in Table 7.  Regarding the issue of conflicting rules, we can observe there is a conflict between rules 7 and 8 in Table 7, and the ICBA algorithm eliminates rule 8 based on the majority class criteria. As a final result, the CARs contain only seven rules, as shown in Table 8. The following illustrates the prediction phase: (1) For the instance "Overcast, High, False," the rules that can classify this instance are: → , ℎ → , and → . Two of these rules give "yes" and one gives "no." Thus, the class for this instance is "yes" and it is correct as shown in Table 3.
(2) For the instance "Sunny, High, False," the rules that can classify this instance are: → , ℎ → , and → . Two of these rules give "no" and one gives "yes." Thus, the class for this instance is "no" and it is correct as shown in Table 3.

Experimental Results
The CBA, MCAR, FACA, and ECBA algorithms are compared against ICBA in terms of accuracy, precision, and recall. We use autism datasets from the UCI repository (Shirabad and Menzies, 2005). To obtain fair results and reduce the error rate, a 270-fold cross-validation process is employed for all experiments where, 70 is number of records or instances in the dataset divided by 10.
All experiments are performed on cluster with 31 nodes (1 master and 30 workers). Specifications of the node is a 3GHz i7 with 32GB main memory and 1TB storage. The CBA, MCAR, FACA, and ECBA algorithms are implemented by their respective authors. The parameters of all algorithms are set as pairs for _ and _ as follows: (0.1, 0.5), (0.2, 0.5), (0.3, 0.5) and (0.4, 0.5). The ICBA algorithm is executed using the Java programming language under the WEKA tool (Hall et al., 2009).

Dataset
To test our proposed algorithm, an autism dataset is used from the UCI repository. The dataset contains 21 attributes and 704 instances, where 515 instances have no autism and 189 instances have autism, as follows: , and / . Figures 3, 4, 5, and 6 visualize the distribution of the autism dataset attributes. The ICBA algorithm employs the chi-square method to show the correlations between the attributes and their importance, as shown in Table 9. According to the chi-square scores, we choose the cut-off point of 10, which leads us to eliminate six attributes: , , _ _ , , _desc, and . Therefore, the autism dataset will contain 15 attributes with strong correlation.

Experiment I: AC Algorithms against ICBA Algorithm Using the Chi-square Method
We compare the ICBA algorithm against four AC algorithms -CBA, MCAR, FACA, and FCBA -based on the mas.ccsenet.org Modern Applied Science Vol. 17, No. 1; accuracy measure. All these algorithms are tested on the autism dataset after applying the chi-square method on the dataset as the preprocessing phase described in the previous section.
Different values for the _ and _ are selected to generate four extensive experiments. These values are (0.1, 0.5), (0.2, 0.5), (0.3, 0.5), and (0.4, 0.5), as shown in Table 10 and Figures 6, 7, 8, and 9. Figure 6 shows the performance of all the considered AC algorithms with _ = 0.1 and _ = 0.5. In this experiment, the ICBA algorithm outperforms the AC algorithms in term of accuracy, where the MCAR algorithm in second place and the CBA algorithm in last position.  The best accuracy value for the ICBA occurs in the third experiment: it is in first place with a value 98.6223%. The MCAR is in the second place with accuracy of 91.0511%, while the CBA has the lowest accuracy value of 73.1534%, as shown in Figure 8.  Vol. 17, No. 1; In the final experiment in this section, the ICBA achieves first place with accuracy of 93.608%, as shown in Figure 9. However, if this is compared with the previous experiments, this value is the lowest value for the ICBA algorithm owing to the small number of rules that are generated in the classifier that satisfy the high _ value. Figure 9. Accuracy of CBA, MCAR, FACA, FCBA, and ICBA Table 10 summarizes all of these experiments. The ICBA algorithm outperforms the other AC algorithms due to the type of rules that are generated in the classifier. Most of the rules are general rules as mentioned in assumption 1. Assumption 1 employs the HM measure in the first phase, which helps the classifier to generate the rules in the CARs directly. To show the impact of using the chi-square method on the considered AC algorithms, we compare the original AC algorithms that do not use the chi-square method with those that use chi-square. In this experiment, we use different values for the _ and _ from those in Experiment I to generate extensive analysis for these algorithms.   Vol. 17, No. 1; Meanwhile, the MCAR algorithm is enhanced when using the chi-square method in all runs except the third one, with the same accuracy value, as shown in Figure 11. Figure 11. Accuracy of original and modified MCAR algorithm The FACA algorithm also achieves improved performance when the chi-square method is used. When the _ value is 0.1, 0.3, and 0.4, the FACA algorithm achieves better performance using the chi-square method than without using this method, as shown in Figure 12.  Figure 13, which shows the improved performance of the FCBA algorithm in all runs when using the chi-square method. Figure 13. Accuracy of original and modified FCBA algorithm Our proposed algorithm is affected when the chi-square method eliminates the preprocessing phase. The ICBA algorithm is negatively affected in three runs, while it achieves the same accuracy value in the second run, as shown in Figure 14. Furthermore, the ICBA algorithm outperforms the considered AC algorithms in most runs both with and without the chi-square method, due to the type of rules generated in the classifier and the voting technique used in the prediction phase, as shown in Table 11. To identify the reasons behind the good performance achieved using the chi-square method in the AC algorithm, we show the top rules that are generated in the ICBA algorithm in the final classifier both with and without the chi-square method (Figures 15 and 16). Figure 15 shows the top rules that are used in the classifier without using the chi-square method. We can observe from that the generated rules contain many of attributes that are eliminated using the chi-square method due their weak relationship with other attributes and their possible effect on the classification process, such as , _desc, and _ _ . Figure 16 shows how the ICBA algorithm eliminates these weak attributes from the top rules that are generated in the classifier, leading to increased accuracy of the classifier.
Furthermore, most of the rules that are generated in the ICBA classifier have a small number of attributes, which reflects the correctness of assumption 2. Assumption 2 illustrates how the ICBA algorithm generates general rather than specific rules that could cover a huge number of instances in the dataset.  Figure 17 presents the distribution of 3_ , attributes on the class attribute that has two values ("No" that represents blue color and "YES" that represents red color). Where, class (no) could predicted if 3_ attribute has value with label 0 and attribute has value with label 0 as shown in lower left angle in the figure. The same issue emphasized in Figures 18 and 19, where Figure 18 generates two rules: autism(no), result(0) → class(no) and autism(no), result(10) → class(yes).
While Figure 19 generates one rule: Relation(self), A10_score(0) → class(no). Finally, the visualization process clarifies how the generated rules with small number of attributes have high confidence and support values, and this leads to enhance the accuracy level of the classifier.

Conclusions and Future Work
Data mining techniques can be used to improve the decision-making process in many critical areas, such as the medical field, website phishing, text analysis, social media, and many others. The AC techniques are of the most mas.ccsenet.org Vol. 17, No. 1; important techniques in data mining that use association rules in the classification process to enable more accurate decisions to be taken in many areas. The main challenge faced by this technique is that of obtaining a high level of accuracy. The proposed ICBA algorithm was built on two main factors: differentiating between the attributes by using a statistical approach, and generating general rules by using the harmonic mean measure to select the stronger association rules. Both of these factors contribute to the improved decision-making process in the field of ASD discovery in terms of accuracy measures. The proposed algorithm was compared against four well-known AC algorithms in terms of accuracy to evaluate their behavior in the prediction task using big data platform. It is worth mentioning that the proposed algorithm showed outstanding performance in all experiments.