Você está na página 1de 40

to Advance Knowledge for Humanity

VisHue: Web Page Segmentation for an


Improved Query Interface for MedlinePlus
Medical Encyclopedia

Aastha Madaan, Wanming Chu, Subhash


Bhalla
University of Aizu

12/10/2011

to Advance Knowledge for Humanity

Outline
1.

Introduction

2.

Background
a) Hierarchical structure
b) Page-Level Segmentation

3.

Web Page segmentation Algorithms


a) Features
b) Main focus
c) Comparison

4.

The Proposal: The VisHue Algorithm

5.

Query by Segment

6.

Performance Analysis

7.

Discussions

8.

Summary and Conclusions

12/10/2011

to Advance Knowledge for Humanity

1. Introduction
WWW is a common and the largest source of
information
Deep Querying Gaining importance
Understanding web page semantics Improves Users
search experience
Within a web page Identify semantic groups

Important Discovering these semantic blocks


12/10/2011

to Advance Knowledge for Humanity

1(i). The Statement [1]


A. Large variety of HTML pages suitable query and
search ?
B. Basic Requirements searching and querying

Simple querying and searching semantic querying and


searching

C. Significant Recognize the semantic and coherent


segments
Page-level Segment Level

D. Case Example Medical Encyclopedia

MedlinePlus various choices of medical encyclopedias

12/10/2011

to Advance Knowledge for Humanity

1(i). The Statement [2]

UML Class
Diagram

12/10/2011

to Advance Knowledge for Humanity

Outline
1.

Introduction

2.

Background
a) Hierarchical structure
b) Page-Level Segmentation

3.

Web Page segmentation Algorithms


a) Features
b) Main focus
c) Comparison

4.

The Proposal: VisHue Algorithm

5.

Query by Segment

6.

Performance Analysis

7.

Discussions

8.

Summary and Conclusions

12/10/2011

to Advance Knowledge for Humanity

2. Background: MedlinePlus
Web page:
i. Relevant content

a.

Relevant Content:
i.

b.

ii. Irrelevant content

Topic headings

ii. Topic wise contents

Irrelevant Content:
Navigation bars, header, footer, advertisements

Headings Identify hierarchical structure

Distinct blocks What a users perception identifies


Main focus Skilled and Semi-skilled users
Assumption Headings Query attributes
12/10/2011

to Advance Knowledge for Humanity

2(a). Hierarchical Structure


1. Hierarchical structure logical structure within the
Page(document)
2. Indicates the binary relationships (belongingness)
between a pair of segments
3. Accurate Hierarchical Representation User Level
Query Attributes (in segments)
4. Proposed hierarchical structure based on domain
knowledge (skilled and semi-skilled users)
Captures users perception
12/10/2011

to Advance Knowledge for Humanity

2(a).(i). Segmentation Semantic Query

Common
Web
User

12/10/2011

User
Semantic query
and search
(In future)

to Advance Knowledge for Humanity

2 (b). Page-Level Segmentation


Definition
A self-contained logical region within a Web page that is:

(i) not nested within any other segment;


(ii) represented by a pair (l; c)
Where, l label of the segment
c portion of text of the segment [1].

12/10/2011

10

to Advance Knowledge for Humanity

Outline
1.

Introduction

2.

Background
a) Hierarchical structure
b) Page-Level Segmentation

3.

Web Page segmentation Algorithms


a) Features
b) Main focus
c) Comparison

4.

The Proposal: VisHue Algorithm

5.

Query by Segment

6.

Performance Analysis

7.

Discussions

8.

Summary and Conclusions

12/10/2011

11

to Advance Knowledge for Humanity

3. Segmentation algorithms
i.

History segmentation traces back to the


year 2001 (continues till 2011)

ii. Various application domains


iii. Various techniques for segmenting
iv. Various terminologies used

v. Proposed MedlinePlus items of users


focus Query Attributes
12/10/2011

12

to Advance Knowledge for Humanity

3 (a). Features of Segmentation Algorithm


A. Match and Identify a users points of focus
B. Discover informative segments
i. Better search and query
ii. Segments become query-able attributes
iii. Skilled users aim to query the informative areas
(only)

C. Generate True hierarchical structure


D. Segmentation Process Low space and time
complexity
12/10/2011

13

to Advance Knowledge for Humanity

3(b). Main Focus


Find an algorithm best suited for:
1. Generate hierarchical structure
2. Convert segments to attributes in
database

3. Facilitates in-depth querying


12/10/2011

14

to Advance Knowledge for Humanity

3 (b). (i). Segmentation Methods Web Technologies

12/10/2011

15

to Advance Knowledge for Humanity

3 (b). (ii). Classification of Algorithms

12/10/2011

16

to Advance Knowledge for Humanity

3(b). (iii). Timeline Techniques


Algorithm

Year

Template Detection

[9], [6]

2002, 2007

Dom-Node Recognition

[8], [11], [10]

2001, 2002, 2006

Visual-DOM based
Rendering

[2]

2003

Visual-Heuristics based
Method

Proposed

Graph-theoretic Method

[3]

2008

Linguistics based
Method

[7]

2008

Image of the Web Page

[4], [5]

2010,2009

Site-Oriented Method

[1]

2011

Technique

12/10/2011

17

to Advance Knowledge for Humanity

3(c). Comparison

12/10/2011

18

to Advance Knowledge for Humanity

3(c).(i). Main Focus

12/10/2011

19

to Advance Knowledge for Humanity

3.(c).(ii).Comparison: Vision based Mtds.

12/10/2011

20

to Advance Knowledge for Humanity

3(c).(iii). Content Structure by VisHue

12/10/2011

21

to Advance Knowledge for Humanity

Outline
1.

Introduction

2.

Background
a) Hierarchical structure
b) Page-Level Segmentation

3.

Web Page segmentation Algorithms


a) Features
b) Main focus
c) Comparison

4.

The Proposal: VisHue Algorithm

5.

Query by Segment

6.

Performance Analysis

7.

Discussions

8.

Summary and Conclusions

12/10/2011

22

to Advance Knowledge for Humanity

4. The Proposal: VisHue Algorithm

12/10/2011

23

to Advance Knowledge for Humanity

4. (i). Query Interfaces


Querying v/s Searching
Searching: Recent Trends
1. Object based search
2. Block based search
3. Entity based search

Querying: Recent Trends


Very few efforts have been done
12/10/2011

24

to Advance Knowledge for Humanity

Outline
1.

Introduction

2.

Background
a) Hierarchical structure
b) Page-Level Segmentation

3.

Web Page segmentation Algorithms


a) Features
b) Main focus
c) Comparison

4.

The Proposal: VisHue Algorithm

5.

Query by Segment

6.

Performance Analysis

7.

Discussions

8.

Summary and Conclusions

12/10/2011

25

to Advance Knowledge for Humanity

5. Query by Segment
Query by Segment as Query by Tag (Heading) QBT
Based on Content Structure (VisHue algorithm) :
Query by Attributes
MedlinePlus medical encyclopedia 3886 web pages
Target Focused and explicit querying
i. Beneficial skilled and semi-skilled users
ii.

12/10/2011

Medical encyclopedia result of years of efforts


by experts
26

to Advance Knowledge for Humanity

5. (i). The QBT interface


QBT interface

Traditional search on MedlinePlus


medical encyclopedia

DB
Title

Caus
es

12/10/2011

Sympt
oms

PostCare

27

to Advance Knowledge for Humanity

5. (ii). QBT Interface Hierarchical Structure


Labels Query Attributes
QBT interface: Search and Query
Child nodes search attributes
Left siblings limit the scope of search of right
siblings in the interface
Segments Attributes for Deep Query over all
pages of MedlinePlus
12/10/2011

28

to Advance Knowledge for Humanity

Outline
1.

Introduction

2.

Background
a) Hierarchical structure
b) Segmentation

3.

Web Page segmentation Algorithms


a) Features
b) Main focus
c) Comparison

4.

The Proposal: VisHue Algorithm

5.

Query by Segment

6.

Performance Analysis

7.

Discussions

8.

Summary and Conclusions

12/10/2011

29

to Advance Knowledge for Humanity

6. Performance Analysis
i. Qualitative comparison with traditional
keyword search
ii. Query formulation and interpretation
iii. Quantitative performance analysis of the
interface
12/10/2011

30

to Advance Knowledge for Humanity

6.(i). QBT vs. Keyword Search

12/10/2011

31

to Advance Knowledge for Humanity

6. (ii). Query Formulation: A Comparison

12/10/2011

32

to Advance Knowledge for Humanity

6. (iii). Query Example


Query 1: Cases where patient has
hypertension but not high blood pressure
QBT query :
Symptoms: Hypertension
Symptoms: NOT High Blood Pressure

33

to Advance Knowledge for Humanity

6. (iv). Query Attributes

34

to Advance Knowledge for Humanity

6. (v). Query Results

35

to Advance Knowledge for Humanity

6. (vi). Quantitative Performance Analysis


QBT Query
Symptom: Hypertension
Symptom: NOT High
Blood Pressure
Before Procedure: Stop
After Procedure:
Normal
Cause: High Blood
Pressure

Symptom: Heart Attack


Food Source: Fish
Side Effect: Poisoning

12/10/2011

36

to Advance Knowledge for Humanity

7. Discussions
Content fragments as perceived by skilled and semiskilled domain users determined by web page
segmentation process
Proposed effort Formulating a generic heuristic
design-rule and visual features based algorithm
The QBT interface Query over user identified
segments (attributes)

Aim Convert MedlinePlus pages DB


Contention web page good source easy to use
new query language interface for segments
12/10/2011

37

to Advance Knowledge for Humanity

8. Summary and Conclusions


A. Heuristics + visual features based segmentation
turning point:
A. Provides independent solution
B. Improves Query interfaces for chosen domain

B. The medical domain need to make the information


accessible to the end-users
C. Query by Segment or Tag (QBT) An attempt
A. Aim return the users query-able attributes

12/10/2011

38

to Advance Knowledge for Humanity

References
1.
2.

3.

4.
5.

6.
7.
8.
9.
10.

11.

A Site Oriented Method for Segmenting Web Pages, David Fernandes, Edleno S. de Moura, Altigran S.
da Silva, Berthier Ribeiro-Neto, Edisson Braga, SIGIR11, July 24-28, 2011.
Extracting Content Structure for Web Pages based on Visual Representation, Deng Cai, Shipeng Yu, JiRong Wen and Wei-Ying Ma, Web Technologies and Applications: 5th Asia-Pacific Web Conference,
APWeb 2003, Xian, China, April 23-25, 2003. Proceedings (2003), pp. 596-596.
Graph-Theoretic Approach to Webpage Segmentation, Deepayan Chakrabarti, Ravi Kumar, Kunal
Punera, WWW 2008 / Refereed Track: Search - Corpus Characterization & Search Performance, Beijing,
China.
A segmentation method for web page analysis using shrinking and dividing, Jiuxin Cao, Bo Mao &
Junzhou Luo (2010): International Journal of Parallel, Emergent and Distributed Systems, 25:2, 93-104.
Web Page Layout via Visual Segmentation, Ayelet Pnueli, Ruth Bergman, Sagi Schein, Omer Barkol, HP
Laboratories, 2009.
Page-level template detection via isotonic smoothing. D. Chakrabarti, R. Kumar, and K. Punera. In 16th
WWW, pages 6170, 2007.
"A Densitometric Approach to Web Page Segmentation", Christian Kohlschtter, Wolfgang Nejdl, CIKM08,
October 2630, 2008
HTML Page Analysis Based on Visual Cues , Yudong Yang and HongJiang Zhang, IEEE 2001
Template Detection via Data Mining and its Applications , Ziv Bar Yossef, Sridhar Rajagopalan, In
Proceedings of WWW'02, May 711, 2002, Honolulu, Hawaii, USA.
"DeSeA: A Page Segmentation based Algorithm for Information Extraction", He Juan, Gao Zhiqiang, Xu
Hui, Qu Yuzhong, Proceedings of the First International Conference on Semantics, Knowledge, and Grid
(SKG 2005).
"Reverse Engineering for Web Data: From Visual to Semantic Structures", Christina Yip Chung, Michael
Gertz, Neel Sundaresan, In proceedings of the 18th International Conference on Data Engineering
(ICDE02).

12/10/2011

39

to Advance Knowledge for Humanity

Thank you
Questions

12/10/2011

40