Martínez-Plumed, F.; Hernández-Orallo, J. (2020). Dual Indicators to Analyse AI Benchmarks: Difficulty, Discrimination, Ability and Generality. IEEE Transactions on Games. 12(2):121-131. https://doi.org/10.1109/TG.2018.2883773
Por favor, use este identificador para citar o enlazar este ítem: http://hdl.handle.net/10251/169021
Title:
|
Dual Indicators to Analyse AI Benchmarks: Difficulty, Discrimination, Ability and Generality
|
Author:
|
Martínez-Plumed, Fernando
Hernández-Orallo, José
|
UPV Unit:
|
Universitat Politècnica de València. Departamento de Sistemas Informáticos y Computación - Departament de Sistemes Informàtics i Computació
|
Issued date:
|
|
Abstract:
|
[EN] With the purpose of better analyzing the result of artificial intelligence (AI) benchmarks, we present two indicators on the side of the AI problems, difficulty and discrimination, and two indicators on the side of ...[+]
[EN] With the purpose of better analyzing the result of artificial intelligence (AI) benchmarks, we present two indicators on the side of the AI problems, difficulty and discrimination, and two indicators on the side of the AI systems, ability and generality. The first three are adapted from psychometric models in item response theory (IRT), whereas generality is defined as a new metric that evaluates whether an agent is consistently good at easy problems and bad at difficult ones. We illustrate how these key indicators give us more insight on the results of two popular benchmarks in AI, the Arcade Learning Environment (Atari 2600 games) and the General Video Game AI competition, and we include some guidelines to estimate and interpret these indicators for other AI benchmarks and competitions.
[-]
|
Subjects:
|
Artificial intelligence
,
Games
,
Benchmark testing
,
Task analysis
,
Adaptation models
,
Guidelines
,
Indexes
,
Artificial intelligence (AI) benchmarks
,
AI evaluation
,
Generality
,
Item response theory (ITR)
|
Copyrigths:
|
Reserva de todos los derechos
|
Source:
|
IEEE Transactions on Games. (issn:
2475-1502
)
|
DOI:
|
10.1109/TG.2018.2883773
|
Publisher:
|
Institute of Electrical and Electronics Engineers (IEEE)
|
Publisher version:
|
https://doi.org/10.1109/TG.2018.2883773
|
Project ID:
|
info:eu-repo/grantAgreement/INCIBE//INCIBEI-2015-27345/
...[+]
info:eu-repo/grantAgreement/INCIBE//INCIBEI-2015-27345/
info:eu-repo/grantAgreement/EC//CT-EX2018D335821-101/EU//HUMAINT/
info:eu-repo/grantAgreement/UPV//SP20180210/
info:eu-repo/grantAgreement/MECD//PRX17%2F00467/
info:eu-repo/grantAgreement/GVA//BEST%2F2017%2F045/
info:eu-repo/grantAgreement/FLI//RFP2-152/
info:eu-repo/grantAgreement/UPV//PAID-06-18/
info:eu-repo/grantAgreement/AFOSR//FA9550-17-1-0287/
info:eu-repo/grantAgreement/MINECO//TIN2015-69175-C4-1-R/ES/SOLUCIONES EFECTIVAS BASADAS EN LA LOGICA/
info:eu-repo/grantAgreement/GVA//PROMETEOII%2F2015%2F013/ES/SmartLogic: Logic Technologies for Software Security and Performance/
[-]
|
Thanks:
|
This work was supported by the U.S. Air Force Office of Scientific Research under Award FA9550-17-1-0287; in part by the EU (FEDER) and the Spanish MINECO under Grant TIN 2015-69175-C4-1-R; and in part by the Generalitat ...[+]
This work was supported by the U.S. Air Force Office of Scientific Research under Award FA9550-17-1-0287; in part by the EU (FEDER) and the Spanish MINECO under Grant TIN 2015-69175-C4-1-R; and in part by the Generalitat Valenciana PROMETEOII/2015/013. The work of F. Mart ' inez-Plumed was supported by INCIBE (Ayudas para la excelencia de los equipos de investigaci ' on avanzada en ciberseguridad), the European Commission, JRC's Centre for Advanced Studies, HUMAINT project (Expert Contract CT-EX2018D335821-101), and UPV PAID-06-18 Ref. SP20180210. The work of J. Hern ' andez-Orallo was supported in part by Salvador de Madariaga grant (PRX17/00467) from the Spanish MECD, in part by the BEST Grant (BEST/2017/045) from the GVA for research stays at the CFI, and in part by the FLI grant RFP2-152.
[-]
|
Type:
|
Artículo
|