This README.txt file was generated on <2022-09-25> by <Enrique Orduna-Malea>

-------------------
GENERAL INFORMATION
-------------------

Title of Dataset: The Eurekalert! project: dataset of mentions to press releases

Author Information:

	Author #1: Orduña-Malea, E., Universitat Politècnica de València, Camino de Vera s/n, 46022, Valencia (Spain), enorma@upv.es, https://orcid.org/0000-0002-1989-8477
	Author #2: Costas, R., Centre for Science and Technology Studies (CWTS), Leiden University (the Netherlands), rcostas@cwts.leidenuniv.nl, https://orcid.org/0000-0002-7465-6462. Extraordinary Associate Professor at the Centre for Research on Evaluation, Science and Technology (CREST) of Stellenbosch University (South Africa).

Date of data collection: From March 2021 to May 2021.

Geographic location of data collection: Valencia (Spain). 39.48512,-0.34134 

Information about funding sources or sponsorship that supported the collection of the data: Generalitat Valenciana, Posicionamiento académico web de las universidades españolas: diseño y aplicación de un modelo de análisis multinivel y multidimensional (UniverSEO), GV/2021/141.

General description: This dataset includes the raw data used to carry out a long-term study on press releases from Eurekalert! The first work in which the dataset has been included is a book chapter titled "A Scientometric-inspired framework to analyze EurekAlert! press releases". The dataset includes the bibliometric and webometric indicators collected to characterize all the press releases published by the EurekAlert! platform.
Keywords: Press releases, EurekAlert, Bibliometrics, Research evaluation, Science communication, Altmetrics, Social media, Twitter, link analysis.


--------------------------
SHARING/ACCESS INFORMATION
-------------------------- 

Open Access to data: Open.

Date end Embargo: N/A

Licenses/restrictions placed on the data, or limitations of reuse: Creative Commons (CC-BY)

Citation for and links to publications that cite or use the data: 

Orduña-Malea, E. & Costas, R. (in press). A Scientometric-inspired framework to analyze EurekAlert! press releases. In Irene Broer, Steffen Lemke, Athanasios Mazarakis, Isabella Peters and Christian Zinke-Wehlmann (Eds.). The Science-Media-Interface: On the relation between internal and external science communication. De Gruyter.

Links/relationships to previous or related data sets: N/A
Links to other publicly accessible locations of the data: N/A



--------------------
DATA & FILE OVERVIEW
--------------------

File list: 

dataset.zip
-> data: this fold includes files with raw data from Eurekalert!, Twitter and Majestic.
-> eurekalert-data.csv: includes metadata for 455702 press releases. The metadata fields are as follows: url, title, description, keywords, funder, journal, type, institution, meeting, zone, day and hour of publication.
-> majestic-data.csv: includes metadata for 239809 domain names linking to press releases. The metadata fields are as follows: links, Trust Flow (TF) and Citation Flow (CF). In addition, metadata (external backlinks and referring domains) for 748227 Eurekalert internal webpages is offered.
-> twitter-data.json: includes a collection of 1,496,125 tweets mentioning Eurekalert! press releases. Data obtained from the Academic Twitter API. For each tweet, descriptive and engagement data is included.
-> scripts: this fold includes python scripts: metaextractor, twitter-search, and unshortener.
-> metaextractor.py: includes a script written in Python to extract metadata for each press release. The press release should be first downloaded in HTML.
-> twitter-search.py: includes a python script that allows collecting tweets from the Twitter Academic API (historic search endpoint). It requires user authentication.
-> unshortener.py: includes a python script that allows to unshorten a short URL.
-> supplements: this fold includes supplementary material created to accompany specific publications. Each file will correspond to one publication. The metadata information related to each publication is offered in the cover page.
-> supplement_1.pdf: includes the supplementary material related to a book chapter to be published in De Gruyter (see Sharing/Access information to obtain more information).

Relationship between files: the Python script was used to obtain the metadata included in the eurekalert-data.csv, while twitter-data and majestic-data include online metrics related to each of the press releases included in the eurekalert.data file. The purpose of the supplementary material fold is to include specific analysis with the data included in the previous files, that accompany specific publications.

Type of version of the dataset: raw data

Total size: dataset.zip (1.78 GB); eurekalert-data.csv (274 MB), majestic-data.csv (24.3 MB); metaextractor.py (4 KB); twitter-data.json (1.49 GB); supplements (1 MB).



--------------------------
METHODOLOGICAL INFORMATION
--------------------------

Description of methods used for collection/generation of data: Press releases were downloaded through the SocScibot software. Majestic data was retrieved directly using the PRO subscription. To do this, all data related to eurekalert.org (domain level) was retrieved. Finally, Twitter data was obtained via the Academic Twitter API. All tweets mentioning eurekalert.org string were retrieved.

Methods for processing the data: all data previously collected was exported to spreadsheets to generate descriptive statistics.

Software- or Instrument-specific information needed to interpret the data, including software and hardware version numbers: any spreadsheet application.

Standards and calibration information, if appropriate: N/A

Environmental/experimental conditions: N/A

Describe any quality-assurance procedures performed on the data: N/A



--------------------------
DATA-SPECIFIC INFORMATION <Crear secciones para cada archivo o conjunto de datos, según proceda>
--------------------------

journal-level-metrics.csv

eurekalert-data.csv

Number of variables: 12

Number of cases/rows: 455702

Variable list:
url, title, description, keywords, funder, journal, type, institution, meeting, zone, day, hour.
   
Missing data codes: missing data is noted with the absence of value.

Specialized formats or other abbreviations used: N/A

majestic-data.csv

Number of variables: 2

Number of cases/rows: sheet 1: 239809 (domains); sheet 2: 748227 (pages).

Variable list:
sheet 1: domain, links, trust flow, citation flow
sheet 2: url, referring external backlinks, referring external domains.


Missing data codes: N/A (Not Applicable); absense of value (missing data).

Specialized formats or other abbreviations used: N/A

twitter-data.json

Number of variables: 30

Number of cases/rows: 1496125

Variable list:
id, text, author_id, created_at, entities url (url, end, start, display url, expanded_url, title, status, description, unwound url, images width, images height), entities hashtags (start, end, tag), entities mentions (start, end, username), entities annotations (type, start, probability, normalized text), public metrics (like_count, reply_count, quote_count, retweet_count), attachments (media keys).