- -

Information extraction from Webpages based on DOM distances

RiuNet: Repositorio Institucional de la Universidad Politécnica de Valencia

Compartir/Enviar a

Citas

Estadísticas

  • Estadisticas de Uso

Information extraction from Webpages based on DOM distances

Mostrar el registro sencillo del ítem

Ficheros en el ítem

dc.contributor.author Castillo, Carlos es_ES
dc.contributor.author Valero Llinares, Héctor es_ES
dc.contributor.author Guadalupe Ramos, José es_ES
dc.contributor.author Silva Galiana, Josep Francesc es_ES
dc.date.accessioned 2014-02-24T07:29:47Z
dc.date.issued 2012
dc.identifier.isbn 978-3-642-28600-1
dc.identifier.issn 0302-9743
dc.identifier.uri http://hdl.handle.net/10251/35896
dc.description.abstract Retrieving information from Internet is a difficult task as it is demonstrated by the lack of real-time tools able to extract information from webpages. The main cause is that most webpages in Internet are implemented using plain (X)HTML which is a language that lacks structured semantic information. For this reason much of the efforts in this area have been directed to the development of techniques for URLs extraction. This field has produced good results implemented by modern search engines. But, contrarily, extracting information from a single webpage has produced poor results or very limited tools. In this work we define a novel technique for information extraction from single webpages or collections of interconnected webpages. This technique is based on DOM distances to retrieve information. This allows the technique to work with any webpage and, thus, to retrieve information online. Our implementation and experiments demonstrate the usefulness of the technique. es_ES
dc.format.extent 13 es_ES
dc.language Inglés es_ES
dc.publisher Springer Verlag (Germany) es_ES
dc.relation.ispartof Computational Linguistics and Intelligent Text Processing es_ES
dc.relation.ispartofseries Lecture Notes in Computer Science;7182
dc.rights Reserva de todos los derechos es_ES
dc.subject.classification LENGUAJES Y SISTEMAS INFORMATICOS es_ES
dc.title Information extraction from Webpages based on DOM distances es_ES
dc.type Capítulo de libro es_ES
dc.embargo.lift 10000-01-01
dc.embargo.terms forever es_ES
dc.identifier.doi 10.1007/978-3-642-28601-8_16
dc.rights.accessRights Abierto es_ES
dc.contributor.affiliation Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació es_ES
dc.description.bibliographicCitation Castillo, C.; Valero Llinares, H.; Guadalupe Ramos, J.; Silva Galiana, JF. (2012). Information extraction from Webpages based on DOM distances. En Computational Linguistics and Intelligent Text Processing. Springer Verlag (Germany). 181-193. doi:10.1007/978-3-642-28601-8_16 es_ES
dc.description.accrualMethod S es_ES
dc.relation.conferencename 13th International Conference, CICLing 2012 es_ES
dc.relation.conferencedate March 11-17, 2012 es_ES
dc.relation.conferenceplace New Delhi, India es_ES
dc.relation.publisherversion http://link.springer.com/chapter/10.1007%2F978-3-642-28601-8_16 es_ES
dc.description.upvformatpinicio 181 es_ES
dc.description.upvformatpfin 193 es_ES
dc.type.version info:eu-repo/semantics/publishedVersion es_ES
dc.relation.senia 214743
dc.description.references Dalvi, B., Cohen, W.W., Callan, J.: Websets: Extracting sets of entities from the web using unsupervised information extraction. Technical report, Carnegie Mellon School of computer Science (2011) es_ES
dc.description.references Kushmerick, N., Weld, D.S., Doorenbos, R.: Wrapper induction for information extraction. In: Proceedings of the Fifteenth International Joint Conference on Artificial Intelligence (IJCAI 1997) (1997) es_ES
dc.description.references Cohen, W.W., Hurst, M., Jensen, L.S.: A flexible learning system for wrapping tables and lists in html documents. In: Proceedings of the international World Wide Web conference (WWW 2002), pp. 232–241 (2002) es_ES
dc.description.references Lee, P.Y., Hui, S.C., Fong, A.C.M.: Neural networks for web content filtering. IEEE Intelligent Systems 17(5), 48–57 (2002) es_ES
dc.description.references Anti-Porn Parental Controls Software. Porn Filtering (March 2010), http://www.tueagles.com/anti-porn/ es_ES
dc.description.references Kang, B.-Y., Kim, H.-G.: Web page filtering for domain ontology with the context of concept. IEICE - Trans. Inf. Syst. E90, D859–D862 (2007) es_ES
dc.description.references Henzinger, M.: The Past, Present and Future of Web Information Retrieval. In: Proceedings of the 23th ACM Symposium on Principles of Database Systems (2004) es_ES
dc.description.references W3C Consortium. Resource Description Framework (RDF), www.w3.org/RDF es_ES
dc.description.references W3C Consortium. Web Ontology Language (OWL), www.w3.org/2004/OWL es_ES
dc.description.references Microformats.org. The Official Microformats Site (2009), http://microformats.org es_ES
dc.description.references Khare, R., Çelik, T.: Microformats: a Pragmatic Path to the Semantic Web. In: Proceedings of the 15h International Conference on World Wide Web, pp. 865–866 (2006) es_ES
dc.description.references Khare, R.: Microformats: The Next (Small) Thing on the Semantic Web? IEEE Internet Computing 10(1), 68–75 (2006) es_ES
dc.description.references Gupta, S., et al.: Automating Content Extraction of HTML Documents. World Wide Archive 8(2), 179–224 (2005) es_ES
dc.description.references Li, P., Liu, M., Lin, Y., Lai, Y.: Accelerating Web Content Filtering by the Early Decision Algorithm. IEICE Transactions on Information and Systems E91-D, 251–257 (2008) es_ES
dc.description.references W3C Consortium, Document Object Model (DOM), www.w3.org/DOM es_ES
dc.description.references Baeza-Yates, R., Castillo, C.: Crawling the Infinite Web: Five Levels Are Enough. In: Leonardi, S. (ed.) WAW 2004. LNCS, vol. 3243, pp. 156–167. Springer, Heidelberg (2004) es_ES
dc.description.references Micarelli, A., Gasparetti, F.: Adaptative Focused Crawling. In: The Adaptative Web, pp. 231–262 (2007) es_ES
dc.description.references Nielsen, J.: Designing Web Usability: The Practice of Simplicity. New Riders Publishing, Indianapolis (2010) ISBN 1-56205-810-X es_ES
dc.description.references Zhang, J.: Visualization for Information Retrieval. The Information Retrieval Series. Springer, Heidelberg (2007) ISBN 3-54075-1475 es_ES
dc.description.references Hearst, M.A.: TileBars: Visualization of Term Distribution Information. In: Proceedings of the ACM SIGCHI Conference on Human Factors in Computing Systems, Denver, CO, pp. 59–66 (May 1995) es_ES
dc.description.references Gottron, T.: Evaluating Content Extraction on HTML Documents. In: Proceedings of the 2nd International Conference on Internet Technologies and Applications, pp. 123–132 (2007) es_ES
dc.description.references Apache Foundation. The Apache crawler Nutch (2010), http://nutch.apache.org es_ES


Este ítem aparece en la(s) siguiente(s) colección(ones)

Mostrar el registro sencillo del ítem