Journal Papers

Solving Two-Person Zero-Sum Stochastic Games With Incomplete Information Using Learning Automata With Artificial Barriers

Anis Yazidi | Daniel Silvestre | B. John Oommen
Abstract:
Learning automata (LA) with artificially absorbing barriers was a completely new horizon of research in the 1980s (Oommen, 1986). These new machines yielded properties that were previously unknown. More recently, absorbing barriers have been introduced in continuous estimator algorithms so that the proofs could follow a martingale property, as opposed to monotonicity (Zhang et al., 2014), (Zhang et al., 2015). However, the applications of LA with artificial barriers are almost nonexistent. In that regard, this article is pioneering in that it provides effective and accurate solutions to an extremely complex application domain, namely that of solving two-person zero-sum stochastic games that are provided with incomplete information. LA have been previously used (Sastry et al., 1994) to design algorithms capable of converging to the game’s Nash equilibrium under limited information. Those algorithms have focused on the case where the saddle point of the game exists in a pure strategy. However, the majority of the LA algorithms used for games are absorbing in the probability simplex space, and thus, they converge to an exclusive choice of a single action. These LA are thus unable to converge to other mixed Nash equilibria when the game possesses no saddle point for a pure strategy. The pioneering contribution of this article is that we propose an LA solution that is able to converge to an optimal mixed Nash equilibrium even though there may be no saddle point when a pure strategy is invoked. The scheme, being of the linear reward-inaction ( $L_{R-I}$ ) paradigm, is in and of itself, absorbing. However, by incorporating artificial barriers, we prevent it from being “stuck” or getting absorbed in pure strategies. Unlike the linear reward-εpenalty ( $L_{R-ε P}$ ) scheme proposed by Lakshmivarahan and Narendra almost four decades ago, our new scheme achieves the same goal with much less parameter tuning and in a more elegant manner. This article includes the nontrial proofs of the theoretical results characterizing our scheme and also contains experimental verification that confirms our theoretical findings.
Impact factor:
URL:
10.1109/TNNLS.2021.3099095

IEEE Transactions on Neural Networks and Learning Systems ( Early Access ), pp 1-12