Learning-based control for discrete-time constrained nonzero-sum games

2021-09-06 09:26:18ChaoxuMuJiangwenPengYufeiTang

CAAI Transactions on Intelligence Technology 2021年2期

Chaoxu Mu |Jiangwen Peng |Yufei Tang

1School of Electrical and Information Engineering,Tianjin University,Tianjin,China

2Department of Computer,Electrical Engineering and Computer Science,Florida Atlantic University,USA

Abstract A generalized policy-iteration-based solution to a class of discrete-time multi-player nonzero-sum games concerning the control constraints was proposed.Based on initial admissible control policies,the iterative value function of each player converges to the optimum approximately,which is structured by the iterative control policies satisfying the Nash equilibrium.Afterwards,the stability analysis is shown to illustrate that the iterative control policies can stabilize the system and minimize the performance index function of each player.Meanwhile,neural networks are implemented to approximate the iterative control policies and value functions with the impact of control constraints.Finally,two numerical simulations of the discrete-time two-player non-zero-sum games for linear and non-linear systems are shown to illustrate the effectiveness of the proposed scheme.

1|INTRODUCTION

Reinforcement learning,one of the popular research fields in machine learning,is devoted to constructing an interactive bridge which lets an agent be contacted with the environment to achieve a goal.The interactive bridge relies on the influence between an agent and the environment through a series of operations that an agent acts based on the reward supplied by the environment and the environment states are changed by the action [1-4].Related to reinforcement learning and optimal control,Werbos advocated adaptive dynamic programming(ADP)for the first time[5].Different from dynamic programming(DP),the traditional optimal control solution,it solves the optimal control problem forward-in-time rather than backwards,avoiding the difficulty brought by the dimensional explosion in the traditional method for high-dimensional and complex systems.The actor-critic structure is the basic one in ADP algorithm where the actor structure chooses control policy for the system based on the perceived environment and the critic structure improves control policy by system performance evaluation.So that an agent can interact with the environment and it will converge to the approximate optimal point in the continuous iteration for the optimal control problem[6-8].

It is shared for the iterative method in ADP algorithm which is divided into policy iteration (PI) and value iteration(VI) algorithm.VI algorithm consists of two parts which are value update and policy improvement.Meanwhile,PI algorithm also consists of two parts which are policy evaluation and policy improvement.As for VI algorithm,the convergence speed is low for the reason that the iterative control policies may not be admissible ones in the iteration process.By contrast,they are all admissible control policies in the policy iteration procedure with the problem of computational complexity [9,10].In order to find the balance on the properties of PI and VI algorithm,generalized policy iteration algorithm(GPI)put forward in[5]for the first time.Besides,PI algorithm and VI algorithm are all special forms in the GPI algorithm [11].Even though the idea of GPI algorithm has been put forward for many years,the proof of stability is not given until the appearance of [11].As for the actor-critic structure and iterative method,it becomes an efficient and simple approximate optimal method for high-dimensional and complex systems with the effort of many researchers,such as multi-agent optimal control [12,13],zero-sum game (NZ)[14,15],non-zero-sum game (NZS) [10,16].

NZS game is a complex and difficult optimal control problem for multiplayer which is what we called multiple controllers.Different from other optimal control problems,which should guarantee the stability of the system and the minimum of performance index function for each player,it also needs to meet Nash equilibrium formed by a sequence of control inputs.Most of the discussion is on how to solve the coupled Hamilton-Jacobi equation for the Nash equilibrium solution.It can be seen that many researchers paid attention to the continuous-time NZS games.An online ADP method based on policy iteration for the continuous-time multi-player NZS game was presented in [17].Afterwards,a near-optimal control was put forward for the continuous-time non-linear system using just a critic network rather than an actor-critic dual network in[18].More recent studies which present an offpolicy learning and experience replay method used for continuous system are aroused in [19,20] respectively.It is obvious that there are few pieces of research on discrete-time multi-player NZS game except [10].Meanwhile,it is widespread for control constraints in control systems with the damage of system stability and performance such as [21-23].Therefore,there is a tendency to form a unified statement on discrete-time multi-player NZS game solved by the idea of ADP algorithm,which motivates our research drawing on the idea of GPI algorithm considering control constraints.

The contributions of this paper are provided as follows.In this paper,the GPI algorithm is the first time to solve the discrete-time multi-player NZS games considering control constraints.Benefit from the idea of the GPI algorithm,compared with the traditional PI and VI algorithms,is unified solution to the discrete-time multi-payer NZS games.In addition,the proofs of convergence and optimality for the control-constraint multi-player NZS games is provided in detail.Next,simulations of the linear system and non-linear system are provided respectively to show the effectiveness of the proposed scheme.In the end,based on statistic technique,it is the first time to prove that the GPI algorithm has slightly less energy loss and greatly less needed runtime compared with the traditional PI and VI algorithms.

The rest of article is organized as follows.In Section 2,the background of discrete-time control-constraint multi-player NZS games is presented in detail.In Section 3,the GPI algorithm for discrete-time control-constraint multi-player NZS game is developed.Besides,stability is demonstrated.Meanwhile,the actor-critic neural-network structure is shown.In Section 4,two numerical examples to illustrate the effectiveness of the proposed scheme.In Section 5,a brief conclusion is provided.

2|PRELIMINARIES

Consider a class of discrete-time multi-player NZS games with control constraints described by

which is deemed a Nash equilibrium solution.

Let the value function of theqth player be

Hence,the multi-player NZS game can be expressed as

In order to satisfy Equation (9) and control constraints [22],Equation (6) can be expressed as

which is known as the coupled NZS game Bellman optimality equation [10].

3|GENERALIZED POLICY ITERATION FOR CONTROLCONSTRAINT NONZERO-SUM GAME

3.1|Generalized policy iteration algorithm

There are two iterations in the GPI algorithm,which is different from the traditional policy iteration and value iteration algorithm.Meanwhile,two iteration indicesiandjirelated to two iterations are presented,which increase from 1 and 0,respectively.

Then,two iterations in GPI algorithm can be expressed as follows:

1) Policy improvement

2) Policy evaluation

whichji=0,1,…,NiandNiis denoted as an arbitrary nonnegative integer.In summary,the total process of the GPI algorithm can be presented in Figure 1.

It should be noted that they,PI and VI algorithms,are the special forms of GPI algorithm.When the iteration sequence{N1,N2,…,Ni} is all settled as 0,then the algorithm can be seemed as VI algorithm.Similarly,ifNi→∞,the algorithm can be regarded as PI algorithm [11].

3.2|Stability analysis

Proof.This proof is divided into two parts.

Part 1:Equation (18) will be proved with mathematical induction as follows:

Firstly,leti=1.According to Equations (13)-(17),for allxk∈Ωx,

Then,forj1=0,there is

Assume that Equation(18)holds forj1=l1-1,l1=1,2,…,N1.Then,forj1=l1and for allxk∈Ωx,

Therefore,Equation(18)holds fori=1 andj1=0,1,…,N1.

Assume that Equation (18) holds fori=r,r=1,2,…,that is

According to Equations(16)and(17),for allxk∈Ωx,there is

Then,forj(r+1)=0,there is

Therefore,Equation(18)holds fori=r+1 andj(r+1)=0,1,…,N(r+1).Above all,Equation (18) holds fori=1,2,… andji=0,1,…,Ni.

Part 2:Equation (19) will be proved with mathematical induction as follows.

Firstly,fori=1,let 0 ≤j1≤N1and 0 ≤j(r+1)≤N(r+1).According to Equations (20)-(27),for allxk∈Ωx,

Therefore,Equation (19) can be deduced fori=1,2,….The proof is completed.

Lemma 1Let the iterative value function of the qth（xk）be defined as Equation(17).Then,for i=1,2,…,the iterative value function of the qth（xk）is a monotonically non-increasing and convergent sequence.

Lemma 2If a monotonically non-increasing sequence{an},n=1,2,…,contains an arbitrary convergentsubsequence,the sequence{an}is convergent to the same limit of its subsequence.

Proof.Define a sequence of the iterative value functions as

Therefore,the proposition stated in Theorem 2 turns into the following statement:

According to Equations (17) and (18),there is

3.3|Neural networks approximation

In this subsection,a neural network-based approach is presented to approximate the iterative value function and iterative control policy of each player respectively,which are illustrated as

whereWqdenotes the weight matrix of theqth player;φq(·)denotes the activation function of theqth player.

Firstly,the critic network is constructed to approximate the iterative value functions of theqth playerwhich is denoted as

wherel=0,1,2,… is denoted as the iteration index of the following gradient descent iteration method.Recall thatis the critic network weights and:Rn→Rsis the activation function for the critic neural network.sis the number of neurons in the hidden layer.Define the approximation error of the critic network for theqth player as

Then,define the objective function to be minimized in the critic network as

Hence,the gradient-based weight update law [24,25] of the critic network is derived as

wherel=0,1,2,… is denoted as the iteration index of the following gradient descent iteration method.Recall thatis the actor network weights and:Rn→Rsis the activation function for the actor neural network.sis the number of neurons in the hidden layer.Define the approximation error of the actor network for theqth player as

Then,define the objective function to be minimized in the actor network as

Hence,the gradient-based weight update law [24,25] of the actor network is derived as where＞0 is the learning rate of the actor network for theqth player.

4|SIMULATION RESULTS

In this section,there are two simulation examples to show the performance of the proposed scheme,which solves discretetime control-constraint multi-player NZS games.

4.1|Example 1

Consider the following control-constraint two-player linear NZS game based on [26,27] of the form

The performance index function is defined as Equation(2)where the weight matrices in the utility function for two players are expressed as

Firstly,define the state space asΩx={xk∣-0.5 ≤x1k≤0.5,-0.5 ≤x2k≤0.5}.Besides,states are chosen inΩxat 0.1 intervals to apply the GPI algorithm for the acquirement of the optimal control policies.Four neural networks are utilized to approximate the iterative value functions and control policies,whose activation functions are

in the actor networks and

in the critic networks respectively.In addition,the initial control policies chosen for two players should obtain admissible control inputs.The iteration sequence {N1,N2,…,Ni}for each player can be randomly chosen.Let the initial state bex0=[0.5,0.5,0.5,0.5]T,the iteration indexi=50 and actuator saturation ∣ν1∣≤0.2，ν2≤0.2.Then,train the actor and critic networks of two players under the learning rate 0.65,0.05,0.8,0.75,respectively.The trajectories of the system states and the control inputs are shown in Figure 2.Trajectories of the performance index function for system and each player under initial control policies and obtained control policies are presented in Figure 3.The trajectories of the actor and critic network weights for two players are shown in Figure 4,where the different coloured lines represent each term of the actor and critic network weights.

From the above results,it is obvious to see that the system states converge to equilibrium after around 550-time steps.Besides,it can be observed that the obtained control inputs of two players achieve convergence under control constraints,as is shown in Figure 2.Compared with the initial control policies,the system and each player performance index functions obtained under the obtained control policies are significantly reduced,as is shown in Figure 3.Meanwhile,the actor and critic network weights approach to equilibrium after about 30 iteration steps.

4.2|Example 2

Consider the following control-constraint two-player nonlinear NZS game based on [9,28] of the form

The performance index function is defined in Equation(2)where the weight matrices in the utility function for two players are expressed as

FIGURE 2 Trajectories of the system states and the control inputs.(a) System states.(b) Control inputs

FIGURE 3 Trajectories of the performance index function under initial control policies and obtained control policies.(a) System.(b) First player.(c) Second player

Firstly,define the state space asΩx={xk∣-1 ≤x1k≤1,-1 ≤x2k≤1}.Besides,states are chosen inΩxat 0.02 intervals to apply the GPI algorithm for the acquirement of the optimal control policies.Four neural networks are utilized to approximate the iterative value functions and control policies,whose activation functions are

in the actor networks and

FIGURE 4 Trajectories of the actor network weights and the critic network weights.(a) First player,(b) second player,(c) first player,(d) second player

in the critic networks,respectively.In addition,the initial control policies chosen for two players should obtain admissible control inputs.The iteration sequence {N1,N2,…,Ni}for each player can be randomly chosen.Let the initial state bex0=[0.5,0.5]T,the iteration indexi=150 and actuator saturation ∣ν1∣≤0.3，∣ν2∣≤0.15.Then,train the actor and critic networks of two players under the learning rate 0.05,0.05,0.15,0.2,respectively.The trajectories of the system states and the control inputs are shown in Figure 5.Trajectories of the performance index function for system and each player under initial control policies and obtained control policies are presented in Figure 6.The trajectories of the actor and critic network weights for two players are shown in Figure 7,where the different coloured lines represent each term of the actor and critic network weights.As shown in Figure 8,the proposed scheme by using GPI algorithm,PI algorithm or VI algorithm is tested for 100 times respectively in the same condition,which the parameterNiis settled as 0 for PI algorithm and 50 for VI algorithm,respectively.Besides,the error of each test is presented as a shaded part related to the colour of the algorithm used.Meanwhile,the runtime results of 10 times tests are presented in Figure 9.Finally,some concrete numerical analysis of runtime results and system performance index function results are given in Tables 1 and 2.

FIGURE 5 Trajectories of the system states and the control inputs.(a) System states,(b) control inputs

FIGURE 6 Trajectories of the performance index function under initial control policies and obtained control policies.(a) System,(b) first player,(C) second player

From the above results,it is clear to see that the system states converge to equilibrium after around 700 time steps.Besides,it can be observed that obtained control inputs of two players achieve convergence under control constraints,as shown in Figure 5.Compared with the initial control policies,the system and each player performance index functions obtained under the obtained control policies are significantly reduced,as shown in Figure 6.Meanwhile,the actor and critic network weights approach to equilibrium after about 100 iteration steps.What's more,it can be obviously seen from Figure 8 that,on the whole,GPI algorithm is similar with the traditional PI and VI algorithms in the results of the system performance index function.When comparing carefully,it is obvious that slightly less energy loss is required in GPI algorithm whether it is the mean value or the minimum value.Although the GPI algorithm is slightly better than the PI algorithm in the results of the system performance index function,it will be much better in runtime results.As shown in Figure 9,in 100 times of tests,runtime results constructed by the GPI algorithm are more intensive than the ones in the traditional PI and VI algorithms.Besides,the runtime results constructed by the GPI algorithm are significantly less than the ones in the traditional PI and VI algorithms whether it is the mean,the minimum or the maximum value.From tables 1 and 2,the runtime is greatly reduced by more than 25% in GPI algorithm compared with the PI algorithm whose results of system performance index function is suboptimal.Additionally,the standard deviation of the runtime results constructed by the GPI algorithm is smaller than the one used the PI algorithm up to around 40%,and the VI algorithm up to around 60%,which reflects the stability and reliability of the GPI algorithm are improved up to 40% compared with the PI algorithm.Therefore,the GPI algorithm holds higher effectiveness compared with PI and VI algorithms for discretetime control-constraint multi-player NZS games.

FIGURE 7 Trajectories of the actor network weights and the critic network weights.(a)first player,(b)second player,(c)first player,(d)second player

FIGURE 8 Trajectories of the system performance index function by using GPI,PI and VI algorithms

FIGURE 9 Trajectories of the runtime results by using GPI,PI and VI algorithms

TABLE 1 System performance index function results

TABLE 2 Runtime results

It should be noted that,for the sake of clarity,a mark point is made every 20 time intervals in Figures 2 and 5.Besides,a mark point is also made every five iteration intervals in Figures 4 and 7.In addition,the index ‘Mean’ means the average of the results.The index‘Std’means the standard deviation of the results.The index ‘Trimmean’ means the average of the results removing ten maximum and ten minimum in Tables 1 and 2.

5|CONCLUSION

In this article,the discrete-time control-constraint multi-player NZS games based on generalized policy iteration ADP approach is presented.Besides,the stability analysis of the proposed scheme is illustrated in detail.Meanwhile,a neural network-based implement is developed to approximate the iterative control policy and value function.Finally,the scheme we discussed is employed to show the performance in the linear and non-linear simulation examples,respectively.

CAAI Transactions on Intelligence Technology2021年2期

CAAI Transactions on Intelligence Technology的其它文章: Studies on situation reasoning approach of autonomous underwater vehicle under uncertain environment; Selective kernel networks for weakly supervised relation extraction; Binaural sound source localization based on weighted template matching; Flow-assisted visual tracking using event cameras; Protecting artificial intelligence IPs:a survey of watermarking and fingerprinting for machine learning; Why AI still doesn’t have consciousness?

国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡