This README.txt file was generated on <28-09-2024> by ------------------- GENERAL INFORMATION ------------------- Title of Dataset: Spanish Institutional repositories: An A-SEO data collection Author Information: Author #1: Orduña-Malea, E., Universitat Politècnica de València, Camino de Vera s/n, 46022, Valencia (Spain), enorma@upv.es, https://orcid.org/0000-0002-1989-8477 Author #2: Font-Julián, Cristina I., Universitat Politècnica de València, Camino de Vera s/n, 46022, Valencia (Spain), crifonju@upv.es, https://orcid.org/0000-0003-2351-4816 Author #3: Serrano-Cobos, J., Universitat Politècnica de València, Camino de Vera s/n, 46022, Valencia (Spain), enorma@upv.es, https://orcid.org/0000-0002-4394-4883 Date of data collection: SEO data collection: from October to December 2022; ChatGPT classification: May 2024. Geographic location of data collection: Valencia (Spain). 39.48512,-0.34134 Information about funding sources or sponsorship that supported the collection of the data: Grant PID2022-142569NA-I00 funded by MCIN/AEI/ 10.13039/501100011033 and, by “ERDF A way of making Europe”, by the “European Union”. Grant GV/2021/141, funded by Regional Government of Valencia (Spain). General description: This dataset includes supplementary material (code, raw data) created and collected to support a study on the visibility of Spanish Institutional Repositories on Google Search results. Keywords: Academic search engine optimization; institutional repositories; Spain; Universities; Altmetrics; Open Access; search engines; Google Search; SEO metrics. -------------------------- SHARING/ACCESS INFORMATION -------------------------- Open Access to data: Open. Date end Embargo: N/A Licenses/restrictions placed on the data, or limitations of reuse: Creative Commons (CC-BY) Citation for and links to publications that cite or use the data: Orduña-Malea, E.; Font-Julián, Cristina I. and Serrano-Cobos, J. (accepted for publication). Open access publications drive few visits from Google Search results to institutional repositories. Scientometrics. https://doi.org/10.1007/s11192-024-05175-0 Links/relationships to previous or related data sets: N/A Links to other publicly accessible locations of the data: N/A -------------------- DATA & FILE OVERVIEW -------------------- File list: dataset.zip -> Code: this fold includes scripts used to collect data. -> code/MetaDC.py: includes a python script that allows the extration of DC metadata from HTML pages. -> Data: this fold includes the raw data collected from Ubersuggest for each repository. -> data/Universities.csv: This file includes the names and URLs of all the universities that hold institutional repositories in Spain. For each university, it includes the total staff (from SIIU) and publications (from Scopus). In addition, for each repository, the number of records indexed in Google Scholar, according to the transparent ranking (CSIC), is available. -> data/Repositories.csv: This file includes all metrics gathered for each institutional repository and the data collection date. -> data/Objects.csv: This file includes bibliographic details (title, year of publication, document type, and language) for each record hosted in the institutional repositories. -> data/Subjects.csv: This file includes the first 20 DC subject fields related to each bibliographic record indexed on October 2022 on Google Search from the Spanish institutional repositories. In addition, the thematic classification of each record (main category and secondary category) provided by ChatGPT is also available. -> data/Keywords_Oct.csv: This file includes all the keywords that show at least one institutional repository in the top 100 results offered by Google Search. The measure corresponds to October 2022. SEO-related metrics are offered for each keyword. -> data/Keywords_Nov.csv: This file includes all the keywords that show at least one institutional repository in the top 100 results offered by Google Search. The measure corresponds to November 2022. SEO-related metrics are offered for each keyword. -> data/Keywords_Dec.csv: This file includes all the keywords that show at least one institutional repository in the top 100 results offered by Google Search. The measure corresponds to December 2022. SEO-related metrics are offered for each keyword. -> Text: this file includes the prompts used to query ChatGPT to classify each record through their DC subject metadata fields. -> Text/Prompter-1.txt: This file includes the prompt queried to ChatGPT v4 to preprocess each record's first 20 DC subject metadata fields. -> Text/Prompter-2.txt: This file includes the prompt queried to ChatGPT v4 to classify each record's first 20 DC subject metadata fields into broad research disciplines (formal sciences, natural sciences, applied sciences, health and medicine sciences, social sciences, human sciences, and Art). Relationship between files: The script (code/MetaDC.py) was used to collect DC metadata from a pool of URLs (data/Objects.csv). These URLs belong to Spanish Institutional Repositories (data/Repositories.csv), hosted in public and private Spanish universities (data/Universities.csv). Finally, these URLs appear on Google Search's top 100 results when a search term was used to query on Google's search box in October 2022 (data/Keywords_Oct.csv), November 2022 (data/Keywords_Nov.csv), and December 2022 (data/Keywords_Dec.csv). Type of version of the dataset: raw data Total size: dataset (176 MB); Data(176 MB); Code (230KB); Text (6.57KB) -------------------------- METHODOLOGICAL INFORMATION -------------------------- Description of methods used for collection/generation of data: The SIIU was used to gather all the official universities recognized in the Spanish university system. Then, ROAR, OpenDOAR, and manual inspection were used to discover all the institutional repositories (IR). For each IR, the Ubersuggest tool was used to collect SEO-related data through the repositories' domain names, especially the keywords that trigger the appearance of each repository in the top 100 results in Google Search and the specific URLs ranked. All URLs were normalized, and those related to publications (PID-based URLs as a proxy of them) were identified. Then, a Python script was written to collect DC metadata from each publication. These data were then cleaned and statistically analyzed. Finally, each publication was thematically classified into broad disciplines using the DC subject metadata fields used to describe them. This process was conducted through two prompts queried to ChatGPT v4. Methods for processing the data: descriptive statistics was used to analysed SEO-related metrics. The classification of DC subject metadata fields was performed by Artificial Intelligence algorithms employed by ChatGPT v4. Software- or Instrument-specific information needed to interpret the data, including software and hardware version numbers: Data files are stored in CSV format, text files in TXT, and code in PY. This way, contents are not restricted to be opened by proprietary software. Standards and calibration information, if appropriate: Does not apply. Environmental/experimental conditions: Does not apply. Describe any quality-assurance procedures performed on the data: Does not apply. -------------------------- DATA-SPECIFIC INFORMATION -------------------------- data/Universities.csv Number of variables: 7 Number of cases/rows: 73 (header not counted) Variable list: URL_id; Repository; Uni_ID; University; SIIU_PDI; SCOPUS_Output; TransparentRanking_Records Missing data codes: void cell. Specialized formats or other abbreviations used: Does not apply. data/Repositories.csv Number of variables: 22 Number of cases/rows: 888 (header not counted) Variable list: URL_id; RepositoryURL; Date; Keywords_100; Keywords_10; Keywords_3; Keywords_4-10; Keywords_10-50; Keywords_50-100; Visits_All; Visits_Spain; Objects_All; Visits_Objects; Null-Visits_Objects; Null-Visits%_Objects; i10-index_visits; Objects_Variability; Keywords_Variability; Keywords_Strength; Keywords_DL; Links_Objects; Null-Links%_Objects. Missing data codes: void cell. Specialized formats or other abbreviations used: Does not apply. data/Subjects.csv Number of variables: 23 Number of cases/rows: 143755 (header not counted) Variable list: URL; Subject_1; Subject_2; Subject_3; Subject_4; Subject_5; Subject_6; Subject_7; Subject_8; Subject_9; Subject_10; Subject_11; Subject_12; Subject_13; Subject_14; Subject_15; Subject_16; Subject_17; Subject_18; Subject_19; Subject_20; Main_Category; Secondary_Category. Missing data codes: Keywords: no missing data; Categories: no data. Specialized formats or other abbreviations used: data/Keywords_Oct.csv Number of variables: 12 Number of cases/rows: 265201 (header not counted) Variable list: KEY_id; Key; URL_id; Repository; Date; Coverage; Volume; Difficulty; BestPosition; BestVisits; BestURL; UniCounts. Missing data codes: no missing data. Specialized formats or other abbreviations used: Does not apply. data/Keywords_Nov.csv Number of variables: 12 Number of cases/rows: 264747 (header not counted) Variable list: KEY_id; Key; URL_id; Repository; Date; Coverage; Volume; Difficulty; BestPosition; BestVisits; BestURL; UniCounts. Missing data codes: no missing data. Specialized formats or other abbreviations used: Does not apply. data/Keywords_Dec.csv Number of variables: 12 Number of cases/rows: 264789 (header not counted) Variable list: KEY_id; Key; URL_id; Repository; Date; Coverage; Volume; Difficulty; BestPosition; BestVisits; BestURL; UniCounts. Missing data codes: no missing data. Specialized formats or other abbreviations used: Does not apply.