Improvements of Automatic Extraction of FA Words Tendency using Non_linear Approach

Field association (FA) terms are used to identify the subject of text (document field) by extracting specific words in that text. In this paper we use FA terms to study the effect of time change on specific terms by calculating the frequency of this terms, which associated with the archive field in a specific period. This paper also introduces a new approach for automatic evaluation of the stabilization classes using non-linear approach. The stabilization classes refer to the changing of FA terms with time in a specific period. The new approach improves the performance of decision tree than linear approach by using non-linear approach. The corpus that used in this approach has number of 1,356 files, and it is about 7.49 MB, after comparing the presented approach with the traditional one, we conclusion that the new approach enhanced the F-measure for increment, steady, decrement classes by 7.7%, 3.1%, 2.2%, sequentially.


Introduction
To evaluate informational technology (IT), there are a huge amount of data that belongs to different of fields. This information can be used in classifying, retrieving information, clustering, and so on. Each field has some words that can be distinguished it, also this words occur repeatedly in that field. Information retrieval used to extract this words but it is a hard process. Information retrieval (IR) is the task, given a set of documents and a user query, of finding the relevant documents to an information need from a set of information resource and retrieving it (Samuel et al.2001). In each period of time there are some words that repeated in each field which can be defined by calculating its frequency. Examples of these words ''world cup'' is more spread in the period that the competition of the world cup has been played; and ''heatstroke'' is more spread in summer.
Old Approaches (Atlam et al. 2017(a,b); Atlam et al. 2006;Atlam et al. 2003;Atlam et al. 2002) neglect the time factor when extracting the words. Atlam et al. 2001 considered the spread of words with time change dependent on utilizing watchwords with documents. These keywords are not the most representative of texts and there are no relation between words and fields.
The contribution of this paper is as follows:  The effect of FA terms frequency within a specific period using non-linear approach.
 The suggested new approach that enhanced the evaluation of the changing classes using the decision tree (DT) algorithm C4.5 (Quinlan 1993;Lima 1996) on FA terms using non-linear approach between parameters. This paper focus on the nature of change of words with time change, and applying the closest mathematical relationship, that leads to the increasing of DT precision and solve the problem of data scattering. Section 2 in this study discusses related work. Section 3introduces our new upgrade algorithm and methodology. Section 4 presents the experimental evaluation. Section 5 focus in conclusion and future work.

Related Work
There are many studies in IR. All of this studies presented different methods, which were useful in classification, clustering, and analyzing documents however they ignore the relation between frequency of words and time changing in given period.
A method for extracting familiar subjects automatically with important keywords in web texts is presented by Morita et al (2012. Morita approach judges word changing with time and their fields in input Texts, and grouped them into two groups that related to the same subjects. Rokaya et al(2008) and Atlam et al (2018) presented another model of positioning a particular example words called field affiliation terms (FA Terms). The positioning of archives gives an exact orchestrating of results than old strategies. This investigation presents a composition of list items that utilizes relations between FA terms in the question, where the higher co-recurrence of two words implies the closer connection between them. Samuel et al (2001), presented a strategy for partitioning text into field-reasonable sections, and afterward separating FA terms or expressions from the content by decide how themes develop. In any case, Samuel's methodology couldn't effectively decide the significance of FA terms in a specific period.
Hashim et al (0219), introduces new method to extract, Arabic keywords from corpora based on their recurrences changes in a document over given periods of time using a decision tree. The new approach is applied on new data set field (computer science) which makes it different to traditionally used methods.
In this paper, we focus on the relation between frequencies of FA words with time variation. Moreover, we study the effect of word tendency related to a field using a new model called non-linear.

Field Association Terms
This section will introduce more details about the field association terms and their levels using the field tree.

FA Terms
It is normal that anybody can perceive the field of the record when they notice some particular words. These particular words are called field affiliation terms (FA Terms), which characterize as the littlest words that can decide a book field in a diagram named field tree. The tree structure speaks to information in a various leveled structure, and furthermore positions connections between record fields through the field tree (Dozawa 1999;Fukumoto et al. 1996;Ding et al. 2001;Azzopardi 2012). In this paper, a report field speaks to a well known information, which can be effectively utilized in human correspondence. For instance, <MIDICINE/Diseases/Cancer> express the way in tree with super field <MIDICINE> that has subfield < Diseases> and terminal field <Cancer>.

FA Terms Levels
Since FA terms has different scope to associate with a field. That mean some FA terms can be identify only one field, whereas others may identify 2 or more fields. There are 5 distinct levels defined to put FA term in its correct fields. These levels are defined as follow: i.
Perfect FA terms: words are associated to one subfield (e.g, cancer, flu, etc).
ii. Imperfect FA terms: words are associated to one or more subfield in one super field (e.g, arthritis, asthma, etc).
iii. Super FA terms: words are associated with one super field (e.g, patient, nurse, and hospital).
iv. Various FA terms: words are associated with more than one subfield of more than one super field (e.g, treat, winner, etc).
v. Non FA terms: words don not specify neither subfields nor super fields, and also include stop words (e.g, size, pronouns, etc).
These levels are resolved in the calculation of Atlam et al (2002), the calculation additionally consequently FA terms by utilizing standardization recurrence for each term in the wake of ascertaining it. In this paper, we will consider the calculation which decide the productive FA terms and the impact of time arrangement in word inclination utilizing new methodology (non-straight).

System Outcome
The outline of the new system is shown in Fig. 1

Data Set
The new methodology is prepared utilizing corpus gathered from the web. Especially, information corpus is gathered from Independent News and Medical News Today in different fields as appeared in Fig.3.

Figure 3. Number and size of Data Set in each Field
The quantity of documents in our corpus is 1,356 records, and it is about 7.49 MB. We utilized our corpus to discover FA terms with their levels. Besides, we have chosen great, semi-great, and super FA terms that are identified with the therapeutic field to examine the impacts of the time change by utilizing frequencies. We have focused on choosing the initial three levels (great, semi-great and super) as they are the best delegate of the report field. The gathered information are isolated into two gatherings; one speaks to the preparation information which thought about the information to the DT (C4.5)-training data and the other one is the test information that are vary from input information totally. T‫إ‬he highlights of the two gatherings determined multiple times one by utilizing customary strategy for direct technique and the new methodology by utilizing non-straight technique with the idea that recurrence of FA terms changes by time as Table 1.

Experimental Result
The general rationale of utilizing DT is to make a preparation model which can be utilized to foresee class by taking in choice guidelines induced from earlier information (preparing information). So DT is learned by the preparation information, after that there is an interconnection done between the aftereffect of DT and human results by reliance on the grouping of SB classes of the test information . Table 2 speaks to the conclusive outcome of DT utilizing the conventional technique. Qualities with super addition letter are speaking to the crossing point between right human choice and right DT choice. This numbers is the quantity of FA terms that are grouped effectively by both Manually and DT framework.  Fig. 4 shows the pace of accuracy, review, and F-measure to assess the SB classes that come about because of applying DT on FA terms frequencies in explicit period utilizing new methodology technique. Figure 4. Recall, Precision and F-measure using three stability classes From Fig. 4, the paces of exactness, review and F-measure show that the precision level of the new framework to characterize FA terms effectively that are evaluated consequently by the DT C4.5 dependent on recurrence change with time.

Comparison with the Traditional Method
Right now, non-direct methodology is utilized to assess the exactness of the new strategy, the amendment strategies for the accuracy of DT utilizing the straight pattern model of old methods which depended on utilizing basic words and dismissed the significant association among words and fields and furthermore utilizing pattern line to speak to connection between standardization recurrence and time as appeared in Table 3. Table 3 shows the comparison between the rates of Recall, Precision and F-measure for new (non-linear) and linear trend model. From the evaluation results shown in Table 3, it is clear that the rates of Recall, Precision and F-measure for new (non-linear) is increasing by 10% than the rate using the linear trend model.

Conclusion
In this paper, a new technique model called non-linear model is introduced to produce automatically SB classes for ordered FA terms. The viability of the new technique (non-linear model) is affirmed by F-measure for as 83.4% for (IC), as 92.8% for (CC), and as 6.6% for (DC), respectively. However, F-measure is 75.9% for (IC), 89.7 for 9CC), and 4.4% for (DC) using old method (linear model). In conclusion, the new methodology upgraded the F-measure for increase, consistent, decrement classes by 7.7%, 3.1%, 2.2%, sequentially. It turns out that the performance is better when using our new approach. Therefore, The new approach improves the performance of decision tree than linear approach by using non-linear approach and other traditional approaches. Future work could focus in applying the new approach for Arabic and other languages.