New Path Filling Method on Data Preprocessing in Web Mining

The article discusses the importance of data preprocessing in web mining and gives the topology structure for the website in the view of actual condition, analyzes the limitation of reference [3] and proposes a data structure based on adjacency list. The proposed method satisfies the actual condition of topology structure for the existed website. The special data structure and path filling algorithm based on adjacency list are given. The data structure satisfies the commonness of topology structure for the existed website and the time complexity is lower.


Introduction
With the rapid development of Internet, and the gradual increase the amount of information, it is estimated that there has 350 million web pages in 1999, and is increasing the speed of one million per day.Google has recently declared it has indexed 3000 million web pages.The World Wide Web is the largest database at present, and it is a challenging task how to access effectively these data [1] .
An effective approach to solving these problems is web mining.Web mining is that data mining technique is applied to web data to the discovery of the interesting usage patterns and implicit information [2] .
However in fact, data mining has the strict quality requirements to these data that are deal with.A key step of data mining is the establishment of appropriate data sets [7] , so it seems very important to carry out data preprocessing before data mining.According to statistics, two-thirds data mining analysts consider a complete data preprocessing spends about sixty percent of the whole mining time [8] .
Web mining is classified into three categories: web content mining, web log mining, and web structure mining [6] .To web content mining and web structure mining, it seems be not critical of users identity, but when users are browsing web pages, because of existing the local cache and the proxy server cache, the web page got by users pressing "backward" button on browser, hasn't corresponding records in server log to web log mining [2] , so we must carry out path filling, otherwise, it will seriously affect mining results.
The article [3] proposed an algorithm of STT, which a topology structure for web transforms into a binary tree, the method made some innovation indeed, but I think that there still exists some limitations in algorithm.

STT principle and limitation
The topology for website described in article [3] shows as figure 2-1, which is a tree structure, and in comparison with the actual condition exists in the following three problems: (1) The current real topology for website should be graph structure, shows as figure 2-2, and can completely exist path E→F; (2) Generally, the depth is much larger than the width in the topology of website, tree transforms into binary tree in the article [3], which certainly will cause its search depth increase and algorithm efficiency decrease.
(3) When general references solving path, all start from root node, while the actual condition is not true, solving path can start from any node.
3. Data structure based on adjacency list and path filling algorithm

Data structure of topology of web
After analysis and selection, data structure of topology for website uses the adjacent list data structure, constructed as follows.
Definition 3.1 Website nodes set can be described as a sequence L, L equals to {i |1≤ i ≤n, in which n is the total node numbers}, numbering of i starts from root node, the first is the internal layers from left to right, and the last is external layer from up to down, that is the width priority method sorts nodes.
For example, suppose that there exists user path ABEIF in figue2-2, and filling path is ABEIBF, B is obviously a node in searched path.
Prove: The reasons of causing discontinuous point is that users come true by "backward" button on browser, and the backward node is certain in searched path, so theorem holds.
Theorem 3.2 If the sub-node numbers of a node in a searched path are less than or equal to 1, the node can not be filling node.
Prove: If the sub-node numbers of a node in a searched path equal to 0, the node is leaf node and can not be accessed by "backward" button, so the ode can not be taken as filling node.
If the sub-node numbers of a node in a searched path equal to 1, and if the node can be arrived by "backward" button, the node is the direct parent node of its only sub-node, because it do not exist other sub-nodes, the node can not be taken as filling node.
Suppose i is the original path scan variable, j is the object path scan variable, and pre_ i is the original path backtracking scan variable.