Você está na página 1de 3

TEXT: Automatic Template Extraction from Heterogeneous Web Pages

Abstract:
World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the WebPages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms. Existing System: In existing system templates have consistent structure even though the templates are not explicitly announced. For machines unknown templates are harmful. Unknown templates degrade the accuracy and performance due to the irrelevant terms in templates. Limited no of templates.

Clustering of web documents such that the documents in the same group belong to the same template is required, and thus, the correctness of extracted templates depends on the quality of clustering.

Proposed System: We introduced a novel approach of the template detection from heterogeneous web documents. We employed the MDL principle to manage the unknown number of clusters. Introduced our extended MinHash technique to speed up the clustering process. Experimental results with real life data sets confirmed the effectiveness of our algorithms.

KEYWORDS: Generic Technology Keywords: Database, User Interface, Programming Specific Technology Keywords: C#.Net, ASP.Net, MS SqlServer-08 Project Keywords: Presentation, Business Object, Data Access Layer, Database SDLC Keywords: Analysis, Design, Code, Testing, Implementation, Maintenance

SYSTEM CONFIGURATION
HARDWARE CONFIGURATION S.NO 1 2 3 4 5 HARDWARE Operating System RAM Processor (with Speed) Hard Disk Size Monitor CONFIGURATIONS Windows 2000 & XP 1GB Intel Pentium IV (3.0 GHz) and Upwards 40 GB and above 15 CRT

SOFTWARE CONFIGURATION S.NO 1 2 3 4 5 SOFTWARE Platform Framework Language Front End Back End CONFIGURATIONS Microsoft Visual Studio .Net Framework 4.0 C#.Net Windows application SQL Server 2008

Você também pode gostar