Modified Group Method of Data Handling for Flood Quantile Prediction at Ungauged Site

Handling flood quantile with little data is essential in managing water resources. In this paper, we propose a potential model called Modified Group Method of Data Handling (MGMDH) to predict the flood quantile at ungauged sites in Malaysia. In this proposed MGMDH model, the principal component analysis (PCA) method is matched to the group method of data handling (GMDH) with various transfer functions. The MGMDH model consists of four transfer functions: polynomial, sigmoid, radial basis function, and hyperbolic tangent sigmoid transfer functions. The prediction performance of MGMDH models is compared to the conventional GMDH model. The appropriateness and effectiveness of the proposed models are demonstrated with a simulation study. Cauchy distribution is used in the simulation study as a disturbance error. The implementation of Cauchy Distribution as an error disturbance in artificial data illustrates the performance of the proposed models if the extreme value or extreme event occurs in the data set. The simulation study may say that the MGMDH model is superior to other comparison models, namely LR, NLR, GMDH and ANN models. Another beauty of this proposed model is that it shows a strong prediction performance when multicollinearity is absent in the data set.


Introduction
Prediction in the ungauged station has become a challenging topic in hydrological problems (Grimaldi et al., 2021). Especially in Malaysia, most gauged stations are only located at a strategic location or developing area. Based on Sivapalan et al. (2013), the definition of ungauged is that hydrological data is not available or partially available. Therefore, there are insufficient data to test the hydrological model's actual capability to predict the flood quantile at ungauged stations. Usually, for ungauged problems, the regionalization approach is the most common approach used in ungauged situations. The regionalization approach includes transferring information from gauged stations to the ungauged station (Guo et al., 2021;Desai et al., 2021;Golian et al., 2021). information hydrological information from gauged stations is transferred to the ungauged station using a hydrological model. Although the hydrological data is not available at the ungauged station, the physiographical and metrological data are available. A previous study by  stated that the best hydrological model for flood quantile prediction at ungauged stations is the modified group method of data handling (MGDMH) model. The MGMDH model consists of a combination of principal component analysis (PCA) and group method of data handling model (GMDH). Other than that, the MGMDH model employs four different transfer functions in a single model, which makes the model more robust for flood quantile prediction at the ungauged stations. PCA approach gives a significant boost to the GMDH model, which can enhance the performance of the GMDH model.
Usually, for multivariate problems, multicollinearity exists in the data set, making the prediction performance of a particular hydrological model performance poor. Applying PCA in the data set removes the multicollinearity in the data set and is expected to improve the prediction performance of the hydrological model (Barth et al., 2021;Al-Ashkar et al., 2021). The study done by  showed that the combination of PCA and various transfer functions outperformed other machine learning models such as the Artificial Neural Network (ANN) model. In that study, the MGMDH model is better than the ANN model, GMDH model, nonlinear regression model and multiple linear regression model for flood quantile prediction at the ungauged site.
This study focuses on simulation studies by generating artificial data to test the accuracy performance of the MGMDH model. The artificial data generated in this study mimic the characteristics of the real data use in the  study. The features of the real data pose multicollinearity, and extreme value is present in the data. The simulation study is a continuation of previous to support the findings of . The simulation study wants to prove that the MGMDH model can perform very well for flood quantile prediction at ungauged sites with multicollinearity and extreme value present in the data. The simulation study employs five different models: MGMDH model, GMDH model, ANN model, nonlinear regression, and multiple linear regression. Multiple linear regression and the ANN model are the most common model for flood quantile prediction at the ungauged site (Alobaidi et al., 2021;Campos et al., 2021;Desai and Ouarda, 2021) . Other than that, in this study, three conditions will be configured: sample size, multicollinearity level, and outlier percentage. The simulation data is generated using Lawrence and Arthur (2019).

Artificial Data Generation
Simulation studies are carried out for the purpose of generating data that are used to evaluate the proposed model MGMDH model. The data generation technique of Lawrence and Arthur (2019) is used in this study. The design of this experiment involves generating data for the following multivariate stochastic model. = 0 + 1 1 + 2 2 + (1) = 0 + 1 1 + 2 2 + 3 3 + (2) = 0 + 1 1 + 2 2 + 3 3 + 4 4 + (3) = 0 + 1 1 + 2 2 + 3 3 + 4 4 + 5 5 + (4) Using Eq. 1 -Eq. 5, four types of the dataset can be generated consisting of 2 input variables, three input variables, four input variables and five input variables. In this section, the sample sizes generated are 30, 50, 70 and 100. The data generation was performed using 0 = 1 = 2 = 3 = 4 = 5 = 1. The explanatory variables were generated as below, = (1 − 2 ) + , = 1, … ; = 1,2,3,4,5 where are independent standard normal random variables, (0,1). The value of 2 represents the correlation between explanatory variables. The chosen values were 2 = 0.0, 2 = 0.5 and 2 = 0.99, which represent no correlation, medium and high correlation between the explanatory variables. The final factor was the disturbance distribution. Cauchy distribution has been chosen as the disturbance distribution in these simulation studies based on residual fitting of real data. In order to broaden our case study, the normal distribution is also used as disturbance distribution in this study because Cauchy is non-normal distribution. Other than that, the use of normal distribution is to assess the performance of the model without the presence of a larger value in the data. The percentages of outliers used are 5% and 10%. The design of the outlier follows the design suggested by Rana et al. (2012) and Midi and Zahari (2012). The inverse CDF of Normal Distribution ( ) will be generated with uniform distribution, which is (0,1). The Cauchy random number is generated from the inverse cumulative distribution function of the Cauchy Distribution (Yao and Liu, 1996). The inverse cumulative distribution function of the Cauchy Distribution is Where is a random number that uniformly distributed is (0,1) and ∘ the random number generated from inverse CDF of Cauchy distributions. The method for artificial data generation technique for ungauged problems has been discussed in .

Modified Group Method of Data Handling
Modified Group Method of Data Handling (MGMDH) model is an improvement of the GMDH model established by Zadeh et al. (2002). PCA method is applied to reduce the complexity of the GMDH model. The shortcoming in the GMDH model is that it tends to produce a complex polynomial network despite having reasonably simple input data for the network. Onwubolu (2009) stated that the GMDH model's complexity increases at each training stage and the selection of a new layer because of the addition of new input variables. Due to the addition of input variables at each layer GMDH model, the number of PD also constructed increases, and the complexity of the GMDH model also increases. PCA method is most commonly employed in the dimensionality reduction of datasets, which can reduce the GMDH model's complexity where the GMDH network's complexity depends on the number of inputs. The input of the GMDH model will be converted into the principal component (PC) using the PCA method, and the least essential PC will be discarded according to rules that have been set in this study. Only the significant PC can become the input variable for the GMDH model. The input variables are treated using a principal component analysis (PCA) method for dimensionality reduction. The PCA method produces principal components equal tu the number of input variables. The number of principal components selected must have more than 90% total variance explained. The selected principal components become the input for the MGMDH model. The number of selected principal components is defined as , as described below. 1 = 11 1 + 12 2 + ⋯ + 1 2 = 21 1 + 22 2 + ⋯ + 2 3 = 31 1 + 32 2 + ⋯ + 3 (7) = 1 1 + 2 2 + ⋯ + ⋮ = 1 1 + 2 2 + ⋯ + The first principal component is required to have the largest variance. The second component must be orthogonal to the first component while capturing the largest variance within the data set in that direction. More generally, if a set of principal components has more than two, the first few PCs will have most of the variation compared to the last PCs, which have the least variation. Other than reducing the complexity of the GMDH model, the PCA method also removes the multicollinearity in the dataset, which is also one of the problems in the GMDH model established by Zadeh et al. (2002). GMDH model established by Zadeh (2002) used only a single transfer function that is quadratic polynomials as the transfer function. In the proposed MGMDH model, four types of the transfer function are employed in this model, namely polynomial, sigmoid, radial basis, and hyperbolic tangent. The polynomial transfer is the same as with the GMDH model established by Zadeh et al. (2002). Sigmoid and radial basis transfer functions are introduced by Euno (2002, 2006). One new addition of the transfer function used in the MGMDH model is the hyperbolic tangent transfer function. There are four types of transfer functions employed in the MGMDH model: polynomial, sigmoid, radial basis, and hyperbolic tangent. Quadratic or second-order polynomials can make a good transfer function for linking the input and output of the GMDH model. According to Kordik (2009), a model with mixed transfer functions usually has better performance than a model that uses a single transfer function because each data set is unique. The MGMDH model can auto-select the suitable transfer function for each data set at the selection process in the MGMDH model. The transfer function is shown in Table 1.
where is a set of the selected principal component and is a set of the target variable. Each layer of MGMDH layer, its produce = 4( ( − 1)/2)) number of PD. In order to select new inputs for the next layer of the MGMDH model, the selection criteria employ the mean squared error (MSE) value. In completing the previous process, possible new input variables for the next layer have been constructed. Then, the identification of the new input proceeds based on the MSE value, where the best variable is selected and the weakest are eliminated. It should be noted that, after determining the new input variable, the entire procedure is repeated until minimum ≥ −1 , where is the number of the current layer. The process stops when the MSE value for the current layer is greater than that from the previous layer. The MSE is defined by Eq. 10.

Comparison Model
There are three models used for comparison: artificial neural network (ANN), group method of data handling (GMDH), nonlinear regression, and linear regression. The methods have been described in .

Performance Criteria
The performance of each model is evaluated with the following error indices, which is the mean absolute percentage error. The definitions of RMSE and MAPE are provided in Eq. 11.
where is the observed flows, ̂ is the predicted flows, ̄ is the mean of the observed flows and is the number of flow series that have been model.

Result and Discussion
Historical hydrological record related to streamflow data carries vital information for decision making involved in planning, designing, and managing water projects. A long history of past data can extrapolate future events well and thus produce high accuracy of flood estimation. However, in a country like Malaysia, hydrological data and information are limited. In such cases, the simulation technique serves as a statistical problem-solving tool by using data simulation. The simulation technique generates the data according to the characteristics of the actual data used in the study. Simulation is a tool to evaluate the performance of existing and proposed under configured conditions of simulation data. Based on the actual data used by , the (variance inflation factor) VIF value for catchment area is 19.4803, VIF value elevation is 1.4887, VIF value Longest Drainage Path is 20.2307, VIF value for river slope is 1.2618, and VIF value for annual mean maximum total rainfall is 1.0147. Therefore, it can be concluded that there is always a presence of multicollinearity. Multicollinearity is a phenomenon in which two or more predictor variables in a model are highly correlated. The Cauchy Distribution is used to mimic the catchments that have an extreme value of the return period. In this section, the data for simulation is generated using the method discussed in Section 3.8. There are five models implemented in this study: MGMDH, LR, NLR, GMDH and ANN model. The result for simulation studies is shown in Table 2 until Table 7.  Table 2 shows the prediction performance when multicollinearity does not present in the simulated data. The result shows that the best performance prediction model is the MGMDH model followed by the ANN model. The MGMDH model shows good performance when the number of input variables increases. ANN model shows god prediction performance when the sample size is small.  Table 3 shows the prediction performance when the level of multicollinearity increase to 0.5. Based on Table 3, the proposed MGMDH model performs well when the number of input variables increases.  Table 4 shows the prediction performance when the level of multicollinearity increase to 0.99. Based on Table 4, the proposed MGMDH model performs well when the number of input variables increases.  Table 5 shows the prediction performance when multicollinearity is set to 0 but the outliers percentage to 10%. Table 5 shows that the proposed MGMDH model performs well when the number of input variables increases.  Table 6 shows the prediction performance when multicollinearity is increased to 0.5 and the outliers percentage to 10%. Table 6 shows that the proposed MGMDH model performs well when the number of input variables increases. Other than the MGMDH model, the ANN model also shows good prediction performance.  Table 7 shows the prediction performance when multicollinearity is increased to 0.99 and the outlier's percentage to 10%. Table 7 shows that the proposed MGMDH model performs well when the number of input variables increases. The design case study is to investigate the prediction performance with and without the presence of multicollinearity. Other than that, the Cauchy Distribution is used as error disturbance or noise. Cauchy distribution will have a significant tendency to produce extreme value because Cauchy Distribution is a heavy-tailed distribution. Heavy tailed distribution is a highly skewed distribution. It shows that the random number produce by the Cauchy Distribution is very large, which is suitable for the simulation of extreme value. On the other hand, a various number of input variables and sample sizes are used. The sample size chosen in this study, as the previous study done by , only used 70 stations in the data set. Therefore, the sample size generated is set between 30 and 100. The number of input variables generated to simulate the catchment characteristics. The winning model is the model that has the lowest MAPE for prediction. The summary of the simulation study is shown in Table 7, which shows the win frequency of each model in each case study design.  , the flood quantile characteristics are highly skewed, indicating that the three specific flood quantile contains extreme values. Another observation is that when the outliers increased from 5% to 10 %, the still result shows that MGMDH is superior to other models in all cases. Based on this observation, the results will be the same by increasing outlier to 20% or more. This simulation study showed that when extreme values were present in the output variable, MGMDH outperformed other models based average winning frequency percentage. On the other hand, the MGMDH model has a higher winning percentage than the GMDH model that is 50% for the MGMDH model and 8% for the GMDH model, respectively. It shows that the implementation of PCA and four type transfer functions have improved the GMDH model's prediction performance. Therefore, based on these results, the MGMDH model has the most efficient and robust prediction performance compared to other models when the data set contains extreme values. For future research, it is suggested to hybrid the GMDH model optimization tools such as the artificial bee colony algorithm.

Conclusions
This study explores the potential of the MGMDH model in prediction at ungauged sites. In this study, the combination of principal component analysis (PCA) and group method of data handling (GMDH) with various transfer functions, namely modified group method of data handling (MGMDH), is proposed for prediction at ungauged catchment. MGMDH model consists of four types of transfer function: polynomial, sigmoid, radial basis function, and hyperbolic tangent sigmoid transfer function compared to the conventional GMDH model. In order to demonstrate the appropriateness and effectiveness of the proposed models, a simulation study was done. The simulation study used Cauchy Distribution as a disturbance error for the simulation data. Implementation of Cauchy distribution as error disturbance in artificial data evaluated the model prediction performance if the extreme value or extreme event occurs in the data set. The simulation study showed that the MGMDH model is superior to other comparison models, namely LR, NLR, GMDH and ANN models. Other than that, the MGMDH model shows strong prediction performance when multicollinearity is not present in the data set.