A Generalized Tree Matching Algorithm Considering Nested Lists for Web Data Extraction

SDM(2010)

引用 43|浏览60
暂无评分
摘要
This paper studies structured data extraction from Web pages. One of the effective methods is tree matching, which can detect template patterns from web pages used for extraction. However, one major limitation of existing tree matching algorithms is their inability to deal with embedded lists with repeated patterns. In the Web context, lists are everywhere, e.g., lists of products, jobs and publications. Due to the fact that lists in trees may have different lengths, the match score of the trees can be very low although they follow exactly the same template pattern. To make the matter worse, a list can have nested lists in it at any level. To solve this problem, existing research uses various heuristics to detect candidate lists first and then applies tree matching to generate data extraction patterns. This paper proposes a generalized tree matching algorithm by extending an existing tree matching algorithm with the ability to handle nested lists through a novel grammar generation algorithm. To the best of our knowledge, this is the first tree matching algorithm that is able to consider lists. In addition, it is well-known that there are two problem formulations for Web data extraction: (1) pattern generation based on multiple pages following the same template, and (2) pattern generation based on a single page containing lists of data instances following the same templates (each list may use a different template). These two problems are currently solved using different algorithms. The proposed (single) algorithm is able to solve both problems effectively. Extensive experiments show that the new algorithm outperforms the state-of-the-art existing systems for both problems considerably.
更多
查看译文
关键词
we data extraction,web mining,structured data,web pages,generic algorithm
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要