Alarte-Aleixandre, J.; Insa Cabrera, D.; Silva, J.; Tamarit Muñoz, S. (2016). Site-Level Web Template Extraction Based on DOM Analysis. Lecture Notes in Computer Science. 9609:36-49. https://doi.org/10.1007/978-3-319-41579-6
Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10251/82004
Título:
|
Site-Level Web Template Extraction Based on DOM Analysis
|
Autor:
|
Alarte-Aleixandre, Julián
Insa Cabrera, David
Silva, Josep
Tamarit Muñoz, Salvador
|
Entidad UPV:
|
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació
Universitat Politècnica de València. Escola Tècnica Superior d'Enginyeria Informàtica
|
Fecha difusión:
|
|
Resumen:
|
One of the main development resources for website engineers
are Web templates. Templates allow them to increase productivity by
plugin content into already formatted and prepared pagelets. For the
final user templates ...[+]
One of the main development resources for website engineers
are Web templates. Templates allow them to increase productivity by
plugin content into already formatted and prepared pagelets. For the
final user templates are also useful, because they provide uniformity and
a common look and feel for all webpages. However, from the point of view
of crawlers and indexers, templates are an important problem, because
templates usually contain irrelevant information such as advertisements,
menus, and banners. Processing and storing this information leads to a
waste of resources (storage space, bandwidth, etc.). It has been measured
that templates represent between 40 % and 50 % of data on the Web.
Therefore, identifying templates is essential for indexing tasks. In this
work we propose a novel method for automatic web template extraction
that is based on similarity analysis between the DOM trees of a collection
of webpages that are detected using an hyperlink analysis. Our implementation
and experiments demonstrate the usefulness of the technique.
[-]
|
Palabras clave:
|
Information retrieval
,
Content extraction
,
Template extraction
|
Derechos de uso:
|
Reserva de todos los derechos
|
Fuente:
|
Lecture Notes in Computer Science. (issn:
0302-9743
)
|
DOI:
|
10.1007/978-3-319-41579-6
|
Editorial:
|
Springer Verlag (Germany)
|
Versión del editor:
|
https://link.springer.com/chapter/10.1007/978-3-319-41579-6_4
|
Título del congreso:
|
10th International Andrei Ershov Informatics Conference in Memory of Helmut Veith (PSI)
|
Lugar del congreso:
|
Russia
|
Fecha congreso:
|
Aug 24-27, 2015
|
Código del Proyecto:
|
info:eu-repo/grantAgreement/MINECO//TIN2013-44742-C4-1-R/ES/VALIDACION ASISTIDA DE PROGRAMAS MEDIANTE METODOS PRECISOS Y RIGUROSOS PARA UNA INGENIERIA DEL SOFTWARE ROBUSTA/
info:eu-repo/grantAgreement/GVA//PROMETEOII%2F2015%2F013/ES/SmartLogic: Logic Technologies for Software Security and Performance/
info:eu-repo/grantAgreement/ME//AP2010-4415/ES/AP2010-4415/
|
Descripción:
|
The final publication is available at Springer via http://dx.doi.org/10.1007/978-3-319-41579-6_4
|
Agradecimientos:
|
This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Econom´ıa y Competitividad (Secretar´ıa de Estado de Investigaci´on, Desarrollo e Innovaci´on) under grant TIN2013-44742-C4-1-R and by ...[+]
This work has been partially supported by the EU (FEDER) and the Spanish Ministerio de Econom´ıa y Competitividad (Secretar´ıa de Estado de Investigaci´on, Desarrollo e Innovaci´on) under grant TIN2013-44742-C4-1-R and by the Generalitat Valenciana under grant PROMETEOII/2015/013. David Insa was partially supported by the Spanish Ministerio de Eduaci´on under FPU grant AP2010-4415.
[-]
|
Tipo:
|
Artículo
Comunicación en congreso
|