This README.txt file was generated on <2023-10-31> by <Enrique Orduna-Malea and Núria Bautista-Puig>

-------------------
GENERAL INFORMATION
-------------------

Title of Dataset: DORA Declaration Tweet Collection

Author Information:

Author #1: Orduña-Malea, E., Universitat Politècnica de València, Camino de Vera s/n, 46022, Valencia (Spain), enorma@upv.es, https://orcid.org/0000-0002-1989-8477

Author #2: Bautista-Puig, Departamento de Biblioteconomía y Documentación, Universidad Complutense de Madrid, Calle Santísima Trinidad, 37, 28010, Madrid (Spain), nuriabau@ucm.es, https://orcid.org/0000-0003-2404-0683

Date of data collection: From 24 April 2015 to 31 May 2022.

Geographic location of data collection: Valencia (Spain). 39.48512,-0.34134 

Information about funding sources or sponsorship that supported the collection of the data: Generalitat Valenciana, Posicionamiento académico web de las universidades españolas: diseño y aplicación de un modelo de análisis multinivel y multidimensional (UniverSEO), GV/2021/141.

General description: This dataset includes the raw data used to carry out a study related to the analysis of the DORA Declaration on Twitter. 
The dataset includes the tweets collected from the Twitter Academic API (comprising three collections: tweets published by DORA, tweets mentioning DORA, and tweets including a DORA-related hashtag), supplementary material (including extra tables and figures), and the scripts used to collect data from Twitch API.
Keywords: DORA Declaration; Scientometrics; social media metrics; Twitter; research evaluation.

--------------------------
SHARING/ACCESS INFORMATION
-------------------------- 

Open Access to data: Open.

Date end Embargo: N/A

Licenses/restrictions placed on the data, or limitations of reuse: Creative Commons (CC-BY)

Citation for and links to publications that cite or use the data: 

Orduña-Malea, E. & Bautista-Puig, N. (in press). Research assessment under debate: disentangling the interest around the DORA Declaration on Twitter. Scientometrics.
https://doi.org/10.1007/s11192-023-04872-6

Links/relationships to previous or related data sets: N/A
Links to other publicly accessible locations of the data: N/A

--------------------
DATA & FILE OVERVIEW
--------------------

File list: 

dataset.zip
-> data: this fold includes the CSV files with raw data from Twitter.
-> data/tweets.csv: includes the tweets collected.
-> data/hashtags.csv: includes the hashtags extracted from the tweets categorized.
-> code: this fold includes python scripts.
-> code/twitter.py: includes a python script that allows collecting tweets from Twitter using the Academic Twitter API (this API is currently deprecated).
-> supplement.pdf: includes 11 annexes with extra material.


Relationship between files: this dataset provides data related to an study oriented to analyze the online debate about DORA on Twitter. The Python script was used to collect data from the Academic Twitter API (currently unavailable). The data file includes all Tweet IDs collected. Finally, the purpose of the supplementary material is to include specific analyses carried out that complement the study.

Type of version of the dataset: raw data

Total size: dataset (3.26 MB); code (3.55 KB); data (8.66 KB); supplement (2.40 MB).

--------------------------
METHODOLOGICAL INFORMATION
--------------------------

Description of methods used for collection/generation of data: The Academic Twitter API was used to collect all tweets published by the DORA official account as well as all tweets mentioning DORA's official Twitter account.
Methods for processing the data: A technique called Natural Language Processing (NLP) was used to identify the topics of each tweet, using the CorTexT Manager tool (more specifically, the lexical extraction and mapping tool). For sentiment analysis, we used the sentiment analysis from this tool that employs the Python library textblob.
Software- or Instrument-specific information needed to interpret the data, including software and hardware version numbers: CorTexT Manager for data extraction and analysis; R v.4.2.0 (R Core Team, 2022) along with the following libraries tidytext (Silge & Robinson, 2016), dplyr (Wickham et al., 2022), stringr (Wickham, 2022) and stopwords (Benoit, Muhr, & Watanabe, 2021) for cleaning data and graphic visualization and Gephi (v. 0.10.1) for the visualization of the co-ocurrence map. 
Standards and calibration information, if appropriate: N/A
Environmental/experimental conditions: N/A
Describe any quality-assurance procedures performed on the data: N/A

--------------------------
DATA-SPECIFIC INFORMATION
--------------------------

tweets.csv
Number of variables: 3
Number of cases/rows: 27717
Variable list: tweet_id, author_id, Subset
Missing data codes: no applicable.
Specialized formats or other abbreviations used: Not found

hashtags.csv ->
Number of variables: 2
Number of cases/rows: 1554
Variable list: hashtags, categories
Missing data codes: no applicable.
Specialized formats or other abbreviations used: Not found