Introduction
Page segmentation is to automatically identify (and extract) visually and semantically distinctive regions from a given webpage. We consider the page segmentation problem in the heterogeneous Web space, in which page structures are very diverse, varying from plain text to complex script-based content. To deal with this problem, we integrate common spatial and textual features into an unsupervised factor graph model, which is able to efficiently determine the number of blocks and extract those blocks.
Datasets and demo software are available below. Please send email to us if you find any inconvenience or any problem.
Datasets
Datasets used in our experiments of the paper in KDD 2011 is publicly available here. The datasets include three genres of real-world webpages: Personal Homepages, Course Homepages and News Pages.
Personal homepages and course homepages are generally simple and flat webpages. We obtained the page URLs from Arnetminer database and crawled them from the Web. News pages are crawled from typical news website such as CNN, BBC and ESPN, with more complex page structure and richer content compared to homepages.
Figure 1: Two distinctive webpages in segmentation datasets
In addition, we have tested the page segmentation on content extraction application. CleanEval is a content extraction task used to evaluate our performance. Additional information and the dataset of CleanEval task can be found at http://cleaneval.sigwac.org.uk/.
Software
We implemented a windows form application as the demo software of our page segmentation model, using fewer features (e.g. topic similarity feature is not used as it needs preprocessing to train a topic model). This form application is based on the open source project csEXWB, which supports advanced customization and total control of the webbrowser control. All programs are implemented in .Net Framework 4 on Visual Studio 2010. The demo software can be downloaded here.
![]() |
![]() |
Figure 2: Demo screenshots. A page block is displayed using HTML borders with same color.
How to use?
regsvr32.exe C:\PageSegmentation\ComUtilities.dll
FAQ
We will publish a FAQ list later. If you have any question, please contact us. We will be more than happy to hear your feedbacks and update the FAQ accordingly.
We are still updating this page. Last updated date: March 3, 2011, by Tao Lei.