Heuristic dynamic programming-based learning control for discrete-time disturbed multi-agent systems

2021-10-13 07:16:34YaoZhangChaoxuMuYongZhangYangheFeng

Control Theory and Technology 2021年3期

Yao Zhang·Chaoxu Mu·Yong Zhang·Yanghe Feng

Abstract Owing to extensive applications in many fields,the synchronization problem has been widely investigated in multi-agent systems.The synchronization for multi-agent systems is a pivotal issue,which means that under the designed control policy,the output of systems or the state of each agent can be consistent with the leader.The purpose of this paper is to investigate a heuristic dynamic programming (HDP)-based learning tracking control for discrete-time multi-agent systems to achieve synchronization while considering disturbances in systems.Besides,due to the difficulty of solving the coupled Hamilton–Jacobi–Bellman equation analytically,an improved HDP learning control algorithm is proposed to realize the synchronization between the leader and all following agents,which is executed by an action-critic neural network.The action and critic neural network are utilized to learn the optimal control policy and cost function,respectively,by means of introducing an auxiliary action network.Finally,two numerical examples and a practical application of mobile robots are presented to demonstrate the control performance of the HDP-based learning control algorithm.

Keywords Multi-agent systems·Heuristic dynamic programming (HDP)·Learning control·Neural network·Synchronization

1 Introduction

Owing to the rapid development of artificial intelligence technology,multi-agent systems have gradually turned into an attractive topic causing intense discussion among researchers in recent years [1–3].Multi-agent system control has been widely studied in both theoretical and practical research,such as formation control [4],consensus control[5] and flocking [6].In these research topics,many scholars especially focus on the consensus control because of widely application in engineering,for instance,formation control of unmanned aerial vehicles,cooperation control of undersea robots and attitude control of satellites.The consensus problem is that states of the leader can be synchronized by all the following agents under local coupling among agents.For multi-agent systems,since the behavior of each agent is jointly determined by its neighbors and itself,a coupled Hamilon-Jacobi-Bellman (HJB) equation is established.Therefore,the key to solving the consensus control is to find the solution of the coupled HJB equation.However,because of existing the partial differential terms,it is difficult to get the solution of the HJB equation directly.Hence,many effective algorithms have been extended to solve this problem.

Recently,reinforcement learning (RL) has made remarkable advances in the field of artificial intelligence [7–9].The learning process is roughly divided into two steps.First,the system reward is constructed by the interaction with the environment.Second,the optimal control policy is obtained using the feedback mechanism [10,11].Adaptive dynamic programming (ADP) is an important branch of RL,and its prominent role is to effectively approximate the optimal solution of HJB equation [12–14].The theoretical research on neural networks has further promoted the development of ADP method [15–17].The ADP method usually consists of two processes:offline iteration [18] and online implementation [19,20].The ADP method mainly includes three basic types:heuristic dynamic programming (HDP),dual heuristic programming (DHP),and globalized dual heuristic programming (GDHP).Recently,the ADP method is widely used in the consensus control of multi-agent systems.In [21],an ADP technique was used to find the optimal controllers for continuous-time linear systems with single agent rather than multi-agent.In [22],a multi-agent formulation online algorithm of team games was developed to solve synchronization control by combining cooperative control,RL and game theory.In [23],an optimal coordinated control scheme for multi-agent consensus problem based on fuzzy ADP algorithm was proposed.It combined game theory,generalized fuzzy hyperbolic model and ADP method.However,in the above studies,the existence of disturbances in systems was not considered.Specifically,if disturbances are considered in systems,the control performance based on these methods may be declined.

In practical applications,due to the complexity and variability of environment,the control problem of multi-agent systems is often affected by various disturbances,such as modeling uncertainty disturbances caused by specific system models that cannot be determined,model parameter disturbances,and external disturbances caused by many factors such as wind,noise,temperature,etc.The existence of these disturbances is not conducive to the stability of the systems and ultimately leads to the difficulty in achieving the control objectives.Therefore,in the modern control theory,how to deal with disturbances in systems becomes an important problem [24–26].In [27],through the cyclic small gain theorem,the asymptotic stability could be realized by devising a decentralized optimal control policy in the disturbed system.Lin [28] adopted a method of projection to deal with disturbances.The idea in [28] was further used for nonlinear systems considered unmatched disturbances using the ADP method in [29].Furthermore,in the research of multi-agent consensus control,the importance of studying on systems with disturbances is more critical.Cao et al.[30] proposed a distributed extended state observer.The ultimate purpose is to achieve consensus of multi-agent systems with the same linear dynamics and unknown external disturbances.In [31],a disturbance observer was designed to study sliding mode control of second-order multi-agent systems under mismatched uncertainties.In [32],the accurate optimization solution of multi-agent systems with uncertain external parameters was obtained by estimating unknown frequencies and rejecting bounded disturbances.In [33],a distributed optimization controller was proposed to eliminate the bounded disturbance composed of a set of known frequency sinusoid signals.

The proposed methods have certain limitations in the scope of application,and they are mainly applied to the continuous-time multi-agent systems.However,for the discrete-time multi-agent systems with a wider range of application,there are few studies that focus on the consensus problem of discrete-time multi-agent systems with disturbances which is solved by HDP method;meanwhile,the disturbed multi-agent systems have practical application significance as the control objects.Besides,the HDP algorithm has many unique advantages.Therefore,a novel learning control scheme for discrete-time multi-agent systems with disturbances is proposed in this paper.The ultimate goal is to make all following agents synchronize with the leader under the communication graph.The contributions of this work are enumerated as follows:(1) A learning control method which is essentially an approximate optimal control scheme is formulated for the discrete-time disturbed multiagent system,and the optimal control policy is learned from the partial neighborhood communication.(2) The improved HDP algorithm is developed to obtain the optimal control in the way of estimating both the iterative control policy and the cost function implemented by neural networks.(3) The theoretical guarantee of learning control for the discrete-time disturbed multi-agent system is presented.(4) The proposed HDP algorithm is compared with the LQR method in terms of rapidity and accuracy,and the superiority of the HDP algorithm is proved by the simulation results.

The paper is organized as follows.In Sect.2,the preliminary of discrete-time multi-agent system with disturbances is established.In Sect.3,the HDP algorithm design methodology,the action-critic neural network implementation and the stability analysis are proposed.Section 4 substantiates the validity of the above method by two numerical examples and one practical application.The paper is summarized in Sect.5.

2 Problem formulation

2.1 Algebraic graph theory and synchronization problem

With a communication graphF,the studied discretetime multi-agent system containingNagents is generally described as follows:

Remark 1The leader should generate divergent signal or sinusoidal reference trajectory,so that all eigenvalues ofAshould lie outside or on the edge of the unit disk.The reason is that the command trajectory is finally convergent ifAis stable.Therefore,it is more significant to design control policies for an unstableA.

Since in the process of exploring the optimal control policy,it only involves the agent itself and corresponding neighboring agents.Therefore,the synchronization problem can be depicted asfor any agenti.Then the partial neighborhood error for each agentiis defined as

whereη(k) is the global synchronization error vector andη(k)∈?nN.(L+B) is nonsingular under conditions that the graph contains a spanning tree and the agentiis connected with the leader directly.

Lemma 1If(L+B)is nonsingular,the global synchronization error η(k)is given by

where λmin(L+B)is the minimum singular value of(L+B).

According to Lemma 1,if the global tracking errorε(k)converges to zero,the global synchronization errorη(k) converges to zero,too.Thus the system will achieve synchronization if the global tracking error tends to be small enough.

The partial neighborhood error is derived by

Obviously,the disturbance is contained in the partial neighborhood error,which implies that there may exist deviation among agents in the process of the information communication.In the following part,the consensus control of disturbed multi-agent system is studied.

2.2 Consensus control formulation of disturbed multi-agent system

For the disturbed multi-agent system (1),we decompose the disturbanceDici(k) into the sum of matched and unmatched components by projectingDici(k) onto the range of matrixBi(k) .Thus,it can derive that

The information about neighboring agents is required to design the control input of each agenti,so the neighboring agents’ control policies of agentiare described as

3 HDP-based learning control of disturbed multi-agent system

3.1 Convergence of iteration algorithm

Theorem 1Suppose that a spanning tree is contained inthe graph,letsatisfy (19)and the optimal controlpolicy satisfy (20).Then the partial neighborhood error εi(k)is asymptotically stable under Lemma1and the goal of synchronization can be achieved.

ProofFirst,define the difference ofJi(εi(k)) and its gradient as follows:

Lemma 2According to the Hamiltonian equation (18),the local performance index satisfies the following discrete-time Hamilton–Jacobi equation:

Based on (18),(22) and (23),(25) can further be deduced that

3.2 Multi-agent system learning control implementation with heuristic dynamic programming

An action-critic neural network is developed for implementing the learning control with HDP algorithm for disturbed multi-agent system.

The critic network is designed to estimate the cost function approximately for each agentiand the action network is designed to estimate the control policy approximately.It is known thatri(k) is not the real control policy of the system with the disturbance,but an auxiliary control policy which helps to approximate the optimal control policyui(k) .The outputof the critic,action and auxiliary action network are,respectively,expressed as

Next,the approximation error of critic network is expressed asEc(k),and the objective function of approximation is denoted as

Then the weight updating processes of action and auxiliary action networks are given as follows:

where 0<ηu <1 and 0<ηr <1 are the learning rates of action and auxiliary action networks,respectively.

The algorithm procedure under HDP structure implemented by neural networks is carried out based on the following steps:

The framework of approximate optimal tracking control with the HDP structure is shown in Fig.1.In essence,the learning control of disturbed multi-agent systems is an optimal control problem,and its ultimate goal is to minimize the cost function.Through the deformation of the uncertainty,the auxiliary system related to the disturbed multi-agent system is constructed to solve the control problem.By introducing the auxiliary action network,the critic neural network and action neural network are utilized to learn the cost function and the control policy,respectively.It is worth noting that,different from general neural networks,the auxiliary control policyri(k) is helpful to obtain the actual optimal control policyui(k),so thatri(k) is not the real control policy.

Fig.1 Learning-based control with the HDP structure for disturbed multi-agent system

3.3 Stability analysis

4 Simulation studies

In this section,three typical simulation examples are investigated.

4.1 Four-agent system

A four-agent system is studied and the directed graph communication structure is given in Fig.2.The model of fouragent system with the disturbance is chosen as

Fig.2 Network structure with four agents

and the leader is modeled by

where the system matrices are chosen as follows:

The disturbance is

The pinning gains areb1=b4=0,b2=b3=1 .The edge weights are selected asa12=0.8,a23=0.6,a43=0.5 .Choose the performance index weights matricesQ11=Q22=Q33=Q44=I2×2,R11=R22=R33=R44=1,R13=R21=R24=R32=R34=R41=R43=0,R12=R14=R23=R31=R42=1,Y11=Y22=Y33=Y44=1,Y13=Y21=Y32=Y24=Y34=Y41=0,Y12=Y14=Y23=Y31=Y42=Y43=1 .The initial states of the leader and each agent are both randomly set from [0,1].Choose the learning rates asηc=ηu=ηr=0.5 .The maximal stepsNis selected as 1500,which is big enough to keep all agents synchronized with the leader.

The dynamics of agents and the leader are presented in Fig.3.Figure 4 shows the phase plane plot of the system.From the figures,we can get that all agents track the leader accurately.Figure 5 reflects the consensus control policies.

Fig.3 Agents states versus iteration steps

Fig.4 Phase plane plot

Fig.5 Consensus control policies of four-agent system

4.2 Algorithm comparison with linear quadratic regulator (LQR)

Next,a three-agent system is further researched and the directed graph communication structure is given in Fig.6.

Fig.6 Network structure with three agents

The system matrices are chosen as follows:

Choose the disturbance as follows:

whereθiis the unknown parameter of the system.In the training process,the unknown parameterθi=[θ1,θ2]Tis selected andθ1,θ2∈[?10,10].

Fig.7 Tracking performance of agent 1

We choose the pinning gainsb1=1,b2=b3=0 .The edge weights are selected asa12=0.8,a23=0.6,a31=0.8 .The performance index weights matrices are chosen asQ11=Q22=Q33=I2×2,R11=R22=R33=1,R13=R21=R32=0,R12=R23=R31=1,Y11=Y22=Y33=1,Y13=Y21=Y32=0,Y12=Y23=Y31=1 .The initial states of the leader and each agent are both randomly obtained from [0,1].Set the learning rates asηc=1,ηu=ηr=0.2,respectively.

For the purpose of comparing the performance,the HDP algorithm and the LQR method are adopted to obtain control policy,respectively,and the simulation results are shown as follows:the performance of each agent tracking the leader and the tracking error dynamics are shown in Figs.7,8,9,10,11 and 12,respectively.The control policiesui(k) are performed in Fig.13.

Fig.8 Tracking errors of agent 1

Fig.9 Tracking performance of agent 2

Fig.10 Tracking errors of agent 2

Fig.11 Tracking performance of agent 3

Fig.12 Tracking errors of agent 3

Fig.13 Control policies of three-agent system

To further illustrate the control performance of the algorithm,the root mean square error,absolute mean error and iteration steps for LQR method and HDP algorithm are listed in Table 1,so as to compare performance indicators of two algorithms from two aspects of rapidity and accuracy in a clearer and more rigorous way.

Table 1 Performance comparison between HDP algorithm and LQR method

The comparison results show that LQR method and HDP algorithm can achieve synchronization between all agents and the leader.However,on the one hand,the root mean square error and absolute mean error based on HDP control method are less than those of LQR method,which shows that the HDP algorithm is better than LQR method in the accuracy of synchronization.On the other hand,a convergence accuracy is defined as 10?4and the steps are showed when the error converges to a certain range.All agents can track the leader after about 700 iteration steps under the LQR method,while the synchronization is achieved after about 300 iteration steps under the HDP algorithm.Compared with the LQR method,it indicates that the HDP method has advantages in convergence speed when considering disturbances.

On the whole,after comparing three performance indicators,we can conclude that the control policy derived by the proposed HDP algorithm has better tracking performance than the LQR method in terms of accuracy and rapidity.

In addition,the parameter uncertainty is considered in the multi-agent system to show the effectiveness of the proposed control scheme.

Consider a linear multi-agent system composed ofNagents,in which theith agent’s dynamics can be expressed as

where ΔAis the real matrix function representing time-varying parameter uncertainty in the multi-agent system.The uncertainty is the result of model linearization and is usually assumed to be of the form:

whereDa,Eaare the real known constant matrices that represent how the uncertain parameters inFaenter the nominal matrixA.Fais the unknown real time-varying matrices with Lebesgue measurable elements satisfying

The relevant system matrices are given as follows:

Other parameter settings are the same as 4.2.

The corresponding tracking curves of the system under the given parameters are plotted in Fig.14.It can be easily observed that the states of all agents are exactly synchronized with the leader.Moreover,the corresponding tracking error dynamics are given in Fig.15.In addition,we also studied tracking effect of this method and LQR method with uncertain parameters under the same conditions.The tracking errors are shown in Fig.16.It can be clearly concluded that under the premise of uncertain parameters,the tracking effect of the multi-agent system using HDP algorithm is still better than LQR method in terms of rapidity.

Fig.14 Agents states versus iteration steps

Fig.15 Tracking errors under HDP algorithm

Fig.16 Tracking errors under LQR method

4.3 A practical application for multi-agent system

To verify the validity of the above theoretical results in practical application scenarios,an applied multi-agent system is proposed,which consists of three mobile robots and one leader robot.

The robots move in the one-dimensional Euclidean space and the purpose is to achieve synchronization of both state and velocity eventually.

Three follower robots are divided into two subsystems.The first subsystem is stated as follows:

wherexi(k)∈?,vi(k)∈? andui(k)∈? are the state vector,velocity vector and control policy of robotiat time instantkT.m=0.9963,T1=0.0498 andT2=0.8 are the coefficient of the state and sampling intervals,respectively.ζ1=?0.2492 andζ2=0.9888 are the designed parameters.Dicidenotes the disturbance which is described as follows:

where the unknown parameterτiis chosen asτi∈[?10,10] .The other parameters and initializers are the same as in Example 2.The dynamics of the leader robot are stated as follows:

The communication structure of three mobile robots and one leader robot is described as Fig.17.The state and velocity responses of three follower mobile robots and one leader robot are showed as Figs.18 and 19,respectively.The state and velocity errors of three follower mobile robots and one leader robot are depicted in Figs.20 and 21,respectively.

Fig.17 Communication structure of three follower mobile robots and one leader robot

Fig.18 State responses of three robots and one leader

5 Conclusions

In this paper,the goal of learning tracking control is to achieve synchronization between the leader and all following agents.The multi-agent system with disturbances is derived to the tracking control of its nominal system.The HDP algorithm is applied and implemented by neural networks.Meanwhile,the stability of action-critic neural network is presented.Finally,three representative simulations are investigated to demonstrate correctness and superiority of the performance for HDP-based learning tracking control strategy.It is clear that the improved HDP algorithm mentioned in this paper has good performance in terms of tracking speed and effect considering disturbances.At the same time,in practical applications,the algorithm also can achieve the synchronization in a good way.Some meaningful works extended to nonlinear multi-agent systems,application in actual systems can be mainly concerned in the future work.

Fig.19 Velocity responses of three robots and one leader

Fig.20 State errors for three robots and one leader

Fig.21 Velocity errors for three robots and one leader

AcknowledgementsThis work was supported by Tianjin Natural Science Foundation under Grant 20JCYBJC00880,Beijing key Laboratory Open Fund of Long-Life Technology of Precise Rotation and Transmission Mechanisms,and Guangdong Provincial Key Laboratory of Intelligent Decision and Cooperative Control.

Control Theory and Technology2021年3期

Control Theory and Technology的其它文章: Distributed projection subgradient algorithm for two-network zero-sum game with random sleep scheme; Neural network-based adaptive decentralized learning control for interconnected systems with input constraints; H∞ output feedback control for large-scale nonlinear systems with time delay in both state and input; A characteristic modeling method of error-free compression for nonlinear systems; Adaptive Kalman filter for MEMS IMU data fusion using enhanced covariance scaling; Extremum seeking-based optimal EGR set-point design for combustion engines in lean-burn mode

国产日韩欧美一区二区三区三州_亚洲少妇熟女av_久久久久亚洲av国产精品_波多野结衣网站一区二区_亚洲欧美色片在线91_国产亚洲精品精品国产优播av_日本一区二区三区波多野结衣 _久久国产av不卡