Abstract
Artificial intelligence (AI) has achieved remarkable success in sequential decision-making. However, evaluating its neural agents remains challenging, as current methods often rely on interpreting training curves only, overlooking key statistical factors. Existing tools that allow a formal evaluation also require white-box formal models, making them impractical for most AI benchmarks based on the black-box Gymnasium interface. We introduce PyDSMC, a lightweight and easy-to-use Python tool for statistical model checking of neural agents on arbitrary Gymnasium environments. PyDSMC automates the selection of statistical methods to compute confidence intervals, supporting both convergence-based and resource-limited evaluation settings. We empirically demonstrate the importance of rigorous agent evaluation and showcase PyDSMC ’s capabilities to more reliably judge and report an AI agent’s performance.
| Original language | English |
|---|---|
| Title of host publication | Quantitative Evaluation of Systems and Formal Modeling and Analysis of Timed Systems |
| Subtitle of host publication | Second International Joint Conference, QEST+FORMATS 2025, Proceedings |
| Editors | Pavithra Prabhakar, Andrea Vandin |
| Place of Publication | Cham |
| Publisher | Springer |
| Pages | 134-156 |
| Number of pages | 23 |
| Edition | 1 |
| ISBN (Electronic) | 978-3-032-05792-1 |
| ISBN (Print) | 978-3-032-05791-4 |
| DOIs | |
| Publication status | Published - 2026 |
| Event | 2nd International Joint Conference on Quantitative Evaluation of Systems and Formal Modeling and Analysis of Timed Systems, QEST+FORMATS 2025 - Aarhus University, Aarhus, Denmark Duration: 26 Aug 2025 → 28 Aug 2025 Conference number: 2 https://www.qest.org/qest-formats-2025/ |
Publication series
| Name | Lecture Notes in Computer Science |
|---|---|
| Publisher | Springer |
| Volume | 16143 LNCS |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Conference
| Conference | 2nd International Joint Conference on Quantitative Evaluation of Systems and Formal Modeling and Analysis of Timed Systems, QEST+FORMATS 2025 |
|---|---|
| Abbreviated title | QEST+FORMATS 2025 |
| Country/Territory | Denmark |
| City | Aarhus |
| Period | 26/08/25 → 28/08/25 |
| Other | QEST - International Conference on Quantitative Evaluation of SysTems; FORMATS - International Conference on Formal Modeling and Analysis of Timed Systems. |
| Internet address |
Keywords
- This work was part of the MISSION (Models in Space Systems: Integration, Operation, and Networking) project, funded by the European Union’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Actions grant number 101008233.
- 2026 OA procedure
Fingerprint
Dive into the research topics of 'PyDSMC: Statistical Model Checking for Neural Agents Using the Gymnasium Interface'. Together they form a unique fingerprint.Cite this
- APA
- Author
- BIBTEX
- Harvard
- Standard
- RIS
- Vancouver