Content area
Full Text
We report on a series of new platforms and events dealing with AI evaluation that may change the way in which AI systems are compared and their progress is measured. The introduction of a more diverse and challenging set of tasks in these platforms can feed AI research in the years to come, shaping the notion of success and the directions of the field. However, the playground of tasks and challenges presented there may misdirect the field without some meaningful structure and systematic guidelines for its organization and use. Anticipating this issue, we also report on several initiatives and workshops that are putting the focus on analyzing the similarity and dependencies between tasks, their difficulty, what capabilities they really measure and - ultimately - on elaborating new concepts and tools that can arrange tasks and benchmarks into a meaningful taxonomy.
Through the integration of more and better techniques, more computing power, and the use of more diverse and massive sources of data, AI systems are becoming more flexible and adaptable, but also more complex and unpredictable. There is thus increasing need for a better assessment of their capacities and limitations, as well as concerns about their safety (Amodei et al. 2016). Theoretical approaches might provide important insights, but only through experimentation and evaluation tools will we achieve a more accurate assessment of how an actual system operates over a series of tasks or environments.
Several AI experimentation and evaluation platforms have recently appeared, setting a new cosmos of AI environments. These facilitate the creation of various tasks for evaluating and training a host of algorithms. The platform interfaces usually follow the reinforcement learning (RL) paradigm, where interaction takes place through incremental observations, actions, and rewards. This is a very general setting and seemingly every possible task can be framed under it.
These platforms are different from the Turing test - and other more traditional AI evaluation benchmarks proposed to replace it - as summarized by an AAAI 2015 workshop1 and a recent special issue of the AI Magazine.2 Actually, some of these platforms can integrate any task and hence in principle they supersede many existing AI benchmarks (Hernández-Orallo 2016) in their aim to test general problem-solving ability.
This topic has also attracted mainstream attention....