Mostrar el registro sencillo del ítem
dc.contributor.author | Stateva, Galya | es_ES |
dc.contributor.author | Cierpiał-Wolan, Marek | es_ES |
dc.date.accessioned | 2022-11-15T08:06:04Z | |
dc.date.available | 2022-11-15T08:06:04Z | |
dc.date.issued | 2022-09-20 | |
dc.identifier.isbn | 9788413960180 | |
dc.identifier.uri | http://hdl.handle.net/10251/189763 | |
dc.description.abstract | [EN] The aim of the first part of presentation is to tap into the potential of new web data sources, which will have the potential to be integrated in the Web Intelligence Hub, developed by Eurostat. Parallel to the exploration of the data sources, we aspire to produce experimental statistics, using these new web data sources, given that they meet the quality criteria. The presentation will delve deeper into Work Package 3, part of the European Statistical System Collaborative Network (ESSnet) Web Intelligence Network (WIN) project, dedicated to the exploration of nontraditional data sources for official statistical production. Work package 3’s activities are divided into six use cases, each having distinct characteristics and specific goals: • Use Case 1 aims to explore new data sources and monitor the real estate market. • Use Case 2 aims to derive early estimates of construction activities, pertaining to both already built and planned buildings, based on real estate web portals. • Use Case 3 aims to collect data about online prices of household appliances and audio visual, photographic and information processing equipment by web scraping of online shops and at a later stage compare the data with scanner data for the shop’s sales. • Use Case 4 aims to develop new indices for tourism statistics, using the data from booking portals, air traffic portals, travel agencies portals and portals related to quality of life. • Use Case 5 is concentrated on mass web scraping, primarily for the enhancement of the quality of the business register via linking URLs of enterprises and predicting main economic activity codes (NACE) • Use Case 6 aims to explore the use of publicly available traffic camera data in order to produce new indicators. In this use case a peculiar data source is used – pictures from traffic cameras and induction loops. Use cases 1-4 share similar characteristics in terms of data sources and expected experimental indicators and adhere to pre-defined process steps in compliance with Big data life cycle, which include “New data sources exploration”, “Programming, production of software”, “Data acquisition and recording”, “Data processing”, “Modelling and interpretation” and “Dissemination of the experimental statistics and results”. Use cases 5 and 6 take a slightly different approach due to their extraordinary data sources and do not adhere to the aforementioned process steps. During the first project’s year, the Work package 3 achieved meaningful results, such as a Checklist used as a tool for assessment and justification of web data sources, defined a set of mandatory and optional variables to be extracted from the data sources, sets of minimal indicators, based on the mandatory variables, successfully set up and tested their working environment and software solutions for the upcoming data collection, literature review focused on URL finding methodology and tools and the use of business websites to predict economic activity of enterprises, preparation of training and tests sets and accompanying methodology for URL finding, preparation of the upcoming NACE prediction and classification, exploration of the available assessment of the model results, implementation of Machinelearning pipeline for publicly accessible traffic camera data. We are also scheduled to begin testing of Eurostat’s Web Intelligence Hub for specific use cases from our Work package, which volunteered in the endeavor. While we have successfully implemented our initial planned activities for the first project year we continue our work, constantly monitoring the available resources, arising issues and quality of the data, which is to be collected and processed during the second project year. The different use cases have already encountered potential and expected issues like the possible changes in the source of web data structure and web site changes, checks for legal and copyright constraints, non-standard variables, mechanisms blocking extraction of data (e.g. javascript, captchas, etc), viability of training and test sets for both URL finding and NACE prediction, difficulties when comparing results with other partners, since NACE code classification is knowledge-intensive and language-specific sources have to be used, regular update of the data source. Due to the peculiar data sources for some use cases we have also encountered unsolvable issues like weather variation (e.g. snow,rain, darkness). Some of the issues have been solved, while others still remain. A Case Study for innovative tourism statistics aims to show the achievements of two projects: ESSnet Big Data II and ESSnet WIN concerning the use of unstructured data sources in the field of tourism. The work in the Big Data II project started with an inventory of data sources related to tourism statistics, which can be used for research of tourist accommodation establishments as well as for estimating tourist traffic and related expenditures. The VisNet tool was developed to visualise the links between the identified sources. The gathering of data from digital sources required the preparation of a scalable solution for data retrieval using web scraping techniques. The developed author's method allowed for continuous and non-invasive extraction of data from selected accommodation booking portals. The process of integrating statistical databases with data derived from web scraping required the development of a fully automated innovative tool, which unified the structure of identification data and assigned them geographical coordinates. The preparation of appropriate structures allowed the implementation of methods of combining data from different sources. The project also developed a methodology for estimating the volume of tourist traffic and tourist expenditures using spatial-temporal disaggregation methods or the method of flash estimates of accommodation establishments. As a result of the work carried out, a prototype of the Tourism Integration and Monitoring System (TIMS) was prepared, together with dedicated micro services, which will support statistical production in the area of tourism statistics and assist in monitoring changes in the tourism sector. The continuation of the work initiated in ESSnet Big Data II is the ESSnet WIN project, in which new methods for assessing the quality of external data sources have been introduced and web scraping has been expanded to other types of portals related to tourism. The main objective of the project is to develop new indicators, which will be an integral part of the developed prototype. | es_ES |
dc.format.extent | 3 | es_ES |
dc.language | Inglés | es_ES |
dc.publisher | Editorial Universitat Politècnica de València | es_ES |
dc.relation.ispartof | 4th International Conference on Advanced Research Methods and Analytics (CARMA 2022) | |
dc.rights | Reconocimiento - No comercial - Sin obra derivada (by-nc-nd) | es_ES |
dc.title | Exploration and experience with new web data sources. A Case Study for innovative tourism statistics | es_ES |
dc.type | Capítulo de libro | es_ES |
dc.type | Comunicación en congreso | es_ES |
dc.rights.accessRights | Abierto | es_ES |
dc.description.bibliographicCitation | Stateva, G.; Cierpiał-Wolan, M. (2022). Exploration and experience with new web data sources. A Case Study for innovative tourism statistics. En 4th International Conference on Advanced Research Methods and Analytics (CARMA 2022). Editorial Universitat Politècnica de València. 272-274. http://hdl.handle.net/10251/189763 | es_ES |
dc.description.accrualMethod | OCS | es_ES |
dc.relation.conferencename | CARMA 2022 - 4th International Conference on Advanced Research Methods and Analytics | es_ES |
dc.relation.conferencedate | Junio 29-Julio 01, 2022 | es_ES |
dc.relation.conferenceplace | Valencia, España | |
dc.relation.publisherversion | http://ocs.editorial.upv.es/index.php/CARMA/CARMA2022/paper/view/15779 | es_ES |
dc.description.upvformatpinicio | 272 | es_ES |
dc.description.upvformatpfin | 274 | es_ES |
dc.type.version | info:eu-repo/semantics/publishedVersion | es_ES |
dc.relation.pasarela | OCS\15779 | es_ES |