Page Segmentation

PAGE SEGMENTATION via FACTOR GRAPH MODEL

Sen Wu, Tao Lei, Jie Tang, Cheng Chen, and Haixun Wang

Knowledge Engineering Group, Tsinghua University; Microsoft Research Asia

Introduction

Page segmentation is to automatically identify (and extract) visually and semantically distinctive regions from a given webpage. We consider the page segmentation problem in the heterogeneous Web space, in which page structures are very diverse, varying from plain text to complex script-based content. To deal with this problem, we integrate common spatial and textual features into an unsupervised factor graph model, which is able to efficiently determine the number of blocks and extract those blocks.

Datasets and demo software are available below. Please send email to us if you find any inconvenience or any problem.

Datasets

Datasets used in our experiments of the paper in KDD 2011 is publicly available here. The datasets include three genres of real-world webpages: Personal Homepages, Course Homepages and News Pages.

Personal homepages and course homepages are generally simple and flat webpages. We obtained the page URLs from Arnetminer database and crawled them from the Web. News pages are crawled from typical news website such as CNN, BBC and ESPN, with more complex page structure and richer content compared to homepages.

snapshots

Figure 1: Two distinctive webpages in segmentation datasets

In addition, we have tested the page segmentation on content extraction application. CleanEval is a content extraction task used to evaluate our performance. Additional information and the dataset of CleanEval task can be found at http://cleaneval.sigwac.org.uk/.

Software

We implemented a windows form application as the demo software of our page segmentation model, using fewer features (e.g. topic similarity feature is not used as it needs preprocessing to train a topic model). This form application is based on the open source project csEXWB, which supports advanced customization and total control of the webbrowser control. All programs are implemented in .Net Framework 4 on Visual Studio 2010. The demo software can be downloaded here.

Demo screenshot   Demo screeshot

Figure 2: Demo screenshots. A page block is displayed using HTML borders with same color. 

How to use? 

  • Before running the program, first register ComUtilities.dll using regsvr32.exe (make sure you have administrator rights to do this), for example:

regsvr32.exe C:\PageSegmentation\ComUtilities.dll

  • Run HTMLParser.exe in the demo package; 
    Type in a webpage URL in the top textbox and press ENTER;
    After the webpage is fully loaded, click the button "Segment" and wait to see the segmentation results. The program may take a few seconds to analyze the webpage.


FAQ

We will publish a FAQ list later. If you have any question, please contact us. We will be more than happy to hear your feedbacks and update the FAQ accordingly.

 

We are still updating this page. Last updated date: March 3, 2011, by Tao Lei.