TY - UNPB
T1 - Intrinsic fluctuations of reinforcement learning promote cooperation
AU - Meylahn, Janusz
AU - Barfuss, Wolfram
N1 - Preprint working paper
∗Both authors contributed equally to this work
1University of T¨ubingen, GER |
2Dutch Institute of Emergent Phenomena, University of Amsterdam, NL |
3Department of Applied Mathematics, University of Twente, NL
PY - 2022/9/1
Y1 - 2022/9/1
N2 - In this work, we ask for and answer what makes classical reinforcement learning cooperative. Cooperating in social dilemma situations is vital for animals, humans, and machines. While evolutionary theory revealed a range of mechanisms promoting cooperation, the conditions under which agents learn to cooperate are contested. Here, we demonstrate which and how individual elements of the multi-agent learning setting lead to cooperation. Specifically, we consider the widely used temporal-difference reinforcement learning algorithm with epsilon-greedy exploration in the classic environment of an iterated Prisoner's dilemma with one-period memory. Each of the two learning agents learns a strategy that conditions the following action choices on both agents' action choices of the last round. We find that next to a high caring for future rewards, a low exploration rate, and a small learning rate, it is primarily intrinsic stochastic fluctuations of the reinforcement learning process which double the final rate of cooperation to up to 80\%. Thus, inherent noise is not a necessary evil of the iterative learning process. It is a critical asset for the learning of cooperation. However, we also point out the trade-off between a high likelihood of cooperative behavior and achieving this in a reasonable amount of time. Our findings are relevant for purposefully designing cooperative algorithms and regulating undesired collusive effects.
AB - In this work, we ask for and answer what makes classical reinforcement learning cooperative. Cooperating in social dilemma situations is vital for animals, humans, and machines. While evolutionary theory revealed a range of mechanisms promoting cooperation, the conditions under which agents learn to cooperate are contested. Here, we demonstrate which and how individual elements of the multi-agent learning setting lead to cooperation. Specifically, we consider the widely used temporal-difference reinforcement learning algorithm with epsilon-greedy exploration in the classic environment of an iterated Prisoner's dilemma with one-period memory. Each of the two learning agents learns a strategy that conditions the following action choices on both agents' action choices of the last round. We find that next to a high caring for future rewards, a low exploration rate, and a small learning rate, it is primarily intrinsic stochastic fluctuations of the reinforcement learning process which double the final rate of cooperation to up to 80\%. Thus, inherent noise is not a necessary evil of the iterative learning process. It is a critical asset for the learning of cooperation. However, we also point out the trade-off between a high likelihood of cooperative behavior and achieving this in a reasonable amount of time. Our findings are relevant for purposefully designing cooperative algorithms and regulating undesired collusive effects.
U2 - 10.48550/arXiv.2209.01013
DO - 10.48550/arXiv.2209.01013
M3 - Preprint
SP - 1
EP - 9
BT - Intrinsic fluctuations of reinforcement learning promote cooperation
PB - ArXiv.org
ER -