Moros Daval, Yael(Universitat Politècnica de València, 2023-09-19)
[EN] Large language models can be used for a wide range of tasks. The performance on each task instance depends on the specific characteristics of the question (e.g., knowledge or reasoning required) but also on its ...
Sánchez García, Pablo(Universitat Politècnica de València, 2024-09-11)
[EN] AI systems are usually evaluated with a variety of benchmarks to determine their
performance for specific tasks, using a single metric which provides a simplistic image
of their capabilities. However, this procedure ...
Martínez-Plumed, Fernando; Hernández-Orallo, José(Institute of Electrical and Electronics Engineers (IEEE), 2020-06)
[EN] With the purpose of better analyzing the result of artificial intelligence (AI) benchmarks, we present two indicators on the side of the AI problems, difficulty and discrimination, and two indicators on the side of ...
José Hernández-Orallo(Springer Verlag (Germany), 2016-08-19)
The evaluation of artificial intelligence systems and components is crucial for the
progress of the discipline. In this paper we describe and critically assess the different ways
AI systems are evaluated, and the role ...
This supplementary material serves as technical appendix of the paper When AI Difficulty is Easy: The Explanatory Power of Predicting IRT Difficulty (Martínez-Plumed
et al. 2022), published in The Thirty-Sixth AAAI ...
Jiang Chen, Ke-Xin(Universitat Politècnica de València, 2024-09-17)
[CA] El camp de la intel·ligència artificial ha portat al desenvolupament de grans models
de llenguatge avançats amb impressionants habilitats lingüístiques. No obstant això, encara no està clar fins a quin punt aquests ...
Zhou, Lexin(Universitat Politècnica de València, 2023-06-20)
[EN] Pretrained artificial intelligence models are made more human-like and human-aligned by scaling them up in resources (e.g., by increasing compute, training data and parameter size) and shaping them up with human ...