Yao Zhang·Chaoxu Mu·Yong Zhang·Yanghe Feng
Abstract Owing to extensive applications in many fields,the synchronization problem has been widely investigated in multi-agent systems.The synchronization for multi-agent systems is a pivotal issue,which means that under the designed control policy,the output of systems or the state of each agent can be consistent with the leader.The purpose of this paper is to investigate a heuristic dynamic programming (HDP)-based learning tracking control for discrete-time multi-agent systems to achieve synchronization while considering disturbances in systems.Besides,due to the difficulty of solving the coupled Hamilton–Jacobi–Bellman equation analytically,an improved HDP learning control algorithm is proposed to realize the synchronization between the leader and all following agents,which is executed by an action-critic neural network.The action and critic neural network are utilized to learn the optimal control policy and cost function,respectively,by means of introducing an auxiliary action network.Finally,two numerical examples and a practical application of mobile robots are presented to demonstrate the control performance of the HDP-based learning control algorithm.
Keywords Multi-agent systems·Heuristic dynamic programming (HDP)·Learning control·Neural network·Synchronization
Owing to the rapid development of artificial intelligence technology,multi-agent systems have gradually turned into an attractive topic causing intense discussion among researchers in recent years [1–3].Multi-agent system control has been widely studied in both theoretical and practical research,such as formation control [4],consensus control[5] and flocking [6].In these research topics,many scholars especially focus on the consensus control because of widely application in engineering,for instance,formation control of unmanned aerial vehicles,cooperation control of undersea robots and attitude control of satellites.The consensus problem is that states of the leader can be synchronized by all the following agents under local coupling among agents.For multi-agent systems,since the behavior of each agent is jointly determined by its neighbors and itself,a coupled Hamilon-Jacobi-Bellman (HJB) equation is established.Therefore,the key to solving the consensus control is to find the solution of the coupled HJB equation.However,because of existing the partial differential terms,it is difficult to get the solution of the HJB equation directly.Hence,many effective algorithms have been extended to solve this problem.
Recently,reinforcement learning (RL) has made remarkable advances in the field of artificial intelligence [7–9].The learning process is roughly divided into two steps.First,the system reward is constructed by the interaction with the environment.Second,the optimal control policy is obtained using the feedback mechanism [10,11].Adaptive dynamic programming (ADP) is an important branch of RL,and its prominent role is to effectively approximate the optimal solution of HJB equation [12–14].The theoretical research on neural networks has further promoted the development of ADP method [15–17].The ADP method usually consists of two processes:offline iteration [18] and online implementation [19,20].The ADP method mainly includes three basic types:heuristic dynamic programming (HDP),dual heuristic programming (DHP),and globalized dual heuristic programming (GDHP).Recently,the ADP method is widely used in the consensus control of multi-agent systems.In [21],an ADP technique was used to find the optimal controllers for continuous-time linear systems with single agent rather than multi-agent.In [22],a multi-agent formulation online algorithm of team games was developed to solve synchronization control by combining cooperative control,RL and game theory.In [23],an optimal coordinated control scheme for multi-agent consensus problem based on fuzzy ADP algorithm was proposed.It combined game theory,generalized fuzzy hyperbolic model and ADP method.However,in the above studies,the existence of disturbances in systems was not considered.Specifically,if disturbances are considered in systems,the control performance based on these methods may be declined.
In practical applications,due to the complexity and variability of environment,the control problem of multi-agent systems is often affected by various disturbances,such as modeling uncertainty disturbances caused by specific system models that cannot be determined,model parameter disturbances,and external disturbances caused by many factors such as wind,noise,temperature,etc.The existence of these disturbances is not conducive to the stability of the systems and ultimately leads to the difficulty in achieving the control objectives.Therefore,in the modern control theory,how to deal with disturbances in systems becomes an important problem [24–26].In [27],through the cyclic small gain theorem,the asymptotic stability could be realized by devising a decentralized optimal control policy in the disturbed system.Lin [28] adopted a method of projection to deal with disturbances.The idea in [28] was further used for nonlinear systems considered unmatched disturbances using the ADP method in [29].Furthermore,in the research of multi-agent consensus control,the importance of studying on systems with disturbances is more critical.Cao et al.[30] proposed a distributed extended state observer.The ultimate purpose is to achieve consensus of multi-agent systems with the same linear dynamics and unknown external disturbances.In [31],a disturbance observer was designed to study sliding mode control of second-order multi-agent systems under mismatched uncertainties.In [32],the accurate optimization solution of multi-agent systems with uncertain external parameters was obtained by estimating unknown frequencies and rejecting bounded disturbances.In [33],a distributed optimization controller was proposed to eliminate the bounded disturbance composed of a set of known frequency sinusoid signals.
The proposed methods have certain limitations in the scope of application,and they are mainly applied to the continuous-time multi-agent systems.However,for the discrete-time multi-agent systems with a wider range of application,there are few studies that focus on the consensus problem of discrete-time multi-agent systems with disturbances which is solved by HDP method;meanwhile,the disturbed multi-agent systems have practical application significance as the control objects.Besides,the HDP algorithm has many unique advantages.Therefore,a novel learning control scheme for discrete-time multi-agent systems with disturbances is proposed in this paper.The ultimate goal is to make all following agents synchronize with the leader under the communication graph.The contributions of this work are enumerated as follows:(1) A learning control method which is essentially an approximate optimal control scheme is formulated for the discrete-time disturbed multiagent system,and the optimal control policy is learned from the partial neighborhood communication.(2) The improved HDP algorithm is developed to obtain the optimal control in the way of estimating both the iterative control policy and the cost function implemented by neural networks.(3) The theoretical guarantee of learning control for the discrete-time disturbed multi-agent system is presented.(4) The proposed HDP algorithm is compared with the LQR method in terms of rapidity and accuracy,and the superiority of the HDP algorithm is proved by the simulation results.
The paper is organized as follows.In Sect.2,the preliminary of discrete-time multi-agent system with disturbances is established.In Sect.3,the HDP algorithm design methodology,the action-critic neural network implementation and the stability analysis are proposed.Section 4 substantiates the validity of the above method by two numerical examples and one practical application.The paper is summarized in Sect.5.
With a communication graphF,the studied discretetime multi-agent system containingNagents is generally described as follows:
Remark 1The leader should generate divergent signal or sinusoidal reference trajectory,so that all eigenvalues ofAshould lie outside or on the edge of the unit disk.The reason is that the command trajectory is finally convergent ifAis stable.Therefore,it is more significant to design control policies for an unstableA.
Since in the process of exploring the optimal control policy,it only involves the agent itself and corresponding neighboring agents.Therefore,the synchronization problem can be depicted asfor any agenti.Then the partial neighborhood error for each agentiis defined as
whereη(k) is the global synchronization error vector andη(k)∈?nN.(L+B) is nonsingular under conditions that the graph contains a spanning tree and the agentiis connected with the leader directly.
Lemma 1If(L+B)is nonsingular,the global synchronization error η(k)is given by
where λmin(L+B)is the minimum singular value of(L+B).
According to Lemma 1,if the global tracking errorε(k)converges to zero,the global synchronization errorη(k) converges to zero,too.Thus the system will achieve synchronization if the global tracking error tends to be small enough.
The partial neighborhood error is derived by
Obviously,the disturbance is contained in the partial neighborhood error,which implies that there may exist deviation among agents in the process of the information communication.In the following part,the consensus control of disturbed multi-agent system is studied.
For the disturbed multi-agent system (1),we decompose the disturbanceDici(k) into the sum of matched and unmatched components by projectingDici(k) onto the range of matrixBi(k) .Thus,it can derive that
The information about neighboring agents is required to design the control input of each agenti,so the neighboring agents’ control policies of agentiare described as
Theorem 1Suppose that a spanning tree is contained inthe graph,letsatisfy (19)and the optimal controlpolicy satisfy (20).Then the partial neighborhood error εi(k)is asymptotically stable under Lemma1and the goal of synchronization can be achieved.
ProofFirst,define the difference ofJi(εi(k)) and its gradient as follows:
Lemma 2According to the Hamiltonian equation (18),the local performance index satisfies the following discrete-time Hamilton–Jacobi equation:
Based on (18),(22) and (23),(25) can further be deduced that
An action-critic neural network is developed for implementing the learning control with HDP algorithm for disturbed multi-agent system.
The critic network is designed to estimate the cost function approximately for each agentiand the action network is designed to estimate the control policy approximately.It is known thatri(k) is not the real control policy of the system with the disturbance,but an auxiliary control policy which helps to approximate the optimal control policyui(k) .The outputof the critic,action and auxiliary action network are,respectively,expressed as
Next,the approximation error of critic network is expressed asEc(k),and the objective function of approximation is denoted as
Then the weight updating processes of action and auxiliary action networks are given as follows:
where 0<ηu <1 and 0<ηr <1 are the learning rates of action and auxiliary action networks,respectively.
The algorithm procedure under HDP structure implemented by neural networks is carried out based on the following steps:
The framework of approximate optimal tracking control with the HDP structure is shown in Fig.1.In essence,the learning control of disturbed multi-agent systems is an optimal control problem,and its ultimate goal is to minimize the cost function.Through the deformation of the uncertainty,the auxiliary system related to the disturbed multi-agent system is constructed to solve the control problem.By introducing the auxiliary action network,the critic neural network and action neural network are utilized to learn the cost function and the control policy,respectively.It is worth noting that,different from general neural networks,the auxiliary control policyri(k) is helpful to obtain the actual optimal control policyui(k),so thatri(k) is not the real control policy.
Fig.1 Learning-based control with the HDP structure for disturbed multi-agent system
In this section,three typical simulation examples are investigated.
A four-agent system is studied and the directed graph communication structure is given in Fig.2.The model of fouragent system with the disturbance is chosen as
Fig.2 Network structure with four agents
and the leader is modeled by
where the system matrices are chosen as follows:
The disturbance is
The pinning gains areb1=b4=0,b2=b3=1 .The edge weights are selected asa12=0.8,a23=0.6,a43=0.5 .Choose the performance index weights matricesQ11=Q22=Q33=Q44=I2×2,R11=R22=R33=R44=1,R13=R21=R24=R32=R34=R41=R43=0,R12=R14=R23=R31=R42=1,Y11=Y22=Y33=Y44=1,Y13=Y21=Y32=Y24=Y34=Y41=0,Y12=Y14=Y23=Y31=Y42=Y43=1 .The initial states of the leader and each agent are both randomly set from [0,1].Choose the learning rates asηc=ηu=ηr=0.5 .The maximal stepsNis selected as 1500,which is big enough to keep all agents synchronized with the leader.
The dynamics of agents and the leader are presented in Fig.3.Figure 4 shows the phase plane plot of the system.From the figures,we can get that all agents track the leader accurately.Figure 5 reflects the consensus control policies.
Fig.3 Agents states versus iteration steps
Fig.4 Phase plane plot
Fig.5 Consensus control policies of four-agent system
Next,a three-agent system is further researched and the directed graph communication structure is given in Fig.6.
Fig.6 Network structure with three agents
The system matrices are chosen as follows:
Choose the disturbance as follows:
whereθiis the unknown parameter of the system.In the training process,the unknown parameterθi=[θ1,θ2]Tis selected andθ1,θ2∈[?10,10].
Fig.7 Tracking performance of agent 1
We choose the pinning gainsb1=1,b2=b3=0 .The edge weights are selected asa12=0.8,a23=0.6,a31=0.8 .The performance index weights matrices are chosen asQ11=Q22=Q33=I2×2,R11=R22=R33=1,R13=R21=R32=0,R12=R23=R31=1,Y11=Y22=Y33=1,Y13=Y21=Y32=0,Y12=Y23=Y31=1 .The initial states of the leader and each agent are both randomly obtained from [0,1].Set the learning rates asηc=1,ηu=ηr=0.2,respectively.
For the purpose of comparing the performance,the HDP algorithm and the LQR method are adopted to obtain control policy,respectively,and the simulation results are shown as follows:the performance of each agent tracking the leader and the tracking error dynamics are shown in Figs.7,8,9,10,11 and 12,respectively.The control policiesui(k) are performed in Fig.13.
Fig.8 Tracking errors of agent 1
Fig.9 Tracking performance of agent 2
Fig.10 Tracking errors of agent 2
Fig.11 Tracking performance of agent 3
Fig.12 Tracking errors of agent 3
Fig.13 Control policies of three-agent system
To further illustrate the control performance of the algorithm,the root mean square error,absolute mean error and iteration steps for LQR method and HDP algorithm are listed in Table 1,so as to compare performance indicators of two algorithms from two aspects of rapidity and accuracy in a clearer and more rigorous way.
Table 1 Performance comparison between HDP algorithm and LQR method
The comparison results show that LQR method and HDP algorithm can achieve synchronization between all agents and the leader.However,on the one hand,the root mean square error and absolute mean error based on HDP control method are less than those of LQR method,which shows that the HDP algorithm is better than LQR method in the accuracy of synchronization.On the other hand,a convergence accuracy is defined as 10?4and the steps are showed when the error converges to a certain range.All agents can track the leader after about 700 iteration steps under the LQR method,while the synchronization is achieved after about 300 iteration steps under the HDP algorithm.Compared with the LQR method,it indicates that the HDP method has advantages in convergence speed when considering disturbances.
On the whole,after comparing three performance indicators,we can conclude that the control policy derived by the proposed HDP algorithm has better tracking performance than the LQR method in terms of accuracy and rapidity.
In addition,the parameter uncertainty is considered in the multi-agent system to show the effectiveness of the proposed control scheme.
Consider a linear multi-agent system composed ofNagents,in which theith agent’s dynamics can be expressed as
where ΔAis the real matrix function representing time-varying parameter uncertainty in the multi-agent system.The uncertainty is the result of model linearization and is usually assumed to be of the form:
whereDa,Eaare the real known constant matrices that represent how the uncertain parameters inFaenter the nominal matrixA.Fais the unknown real time-varying matrices with Lebesgue measurable elements satisfying
The relevant system matrices are given as follows:
Other parameter settings are the same as 4.2.
The corresponding tracking curves of the system under the given parameters are plotted in Fig.14.It can be easily observed that the states of all agents are exactly synchronized with the leader.Moreover,the corresponding tracking error dynamics are given in Fig.15.In addition,we also studied tracking effect of this method and LQR method with uncertain parameters under the same conditions.The tracking errors are shown in Fig.16.It can be clearly concluded that under the premise of uncertain parameters,the tracking effect of the multi-agent system using HDP algorithm is still better than LQR method in terms of rapidity.
Fig.14 Agents states versus iteration steps
Fig.15 Tracking errors under HDP algorithm
Fig.16 Tracking errors under LQR method
To verify the validity of the above theoretical results in practical application scenarios,an applied multi-agent system is proposed,which consists of three mobile robots and one leader robot.
The robots move in the one-dimensional Euclidean space and the purpose is to achieve synchronization of both state and velocity eventually.
Three follower robots are divided into two subsystems.The first subsystem is stated as follows:
wherexi(k)∈?,vi(k)∈? andui(k)∈? are the state vector,velocity vector and control policy of robotiat time instantkT.m=0.9963,T1=0.0498 andT2=0.8 are the coefficient of the state and sampling intervals,respectively.ζ1=?0.2492 andζ2=0.9888 are the designed parameters.Dicidenotes the disturbance which is described as follows:
where the unknown parameterτiis chosen asτi∈[?10,10] .The other parameters and initializers are the same as in Example 2.The dynamics of the leader robot are stated as follows:
The communication structure of three mobile robots and one leader robot is described as Fig.17.The state and velocity responses of three follower mobile robots and one leader robot are showed as Figs.18 and 19,respectively.The state and velocity errors of three follower mobile robots and one leader robot are depicted in Figs.20 and 21,respectively.
Fig.17 Communication structure of three follower mobile robots and one leader robot
Fig.18 State responses of three robots and one leader
In this paper,the goal of learning tracking control is to achieve synchronization between the leader and all following agents.The multi-agent system with disturbances is derived to the tracking control of its nominal system.The HDP algorithm is applied and implemented by neural networks.Meanwhile,the stability of action-critic neural network is presented.Finally,three representative simulations are investigated to demonstrate correctness and superiority of the performance for HDP-based learning tracking control strategy.It is clear that the improved HDP algorithm mentioned in this paper has good performance in terms of tracking speed and effect considering disturbances.At the same time,in practical applications,the algorithm also can achieve the synchronization in a good way.Some meaningful works extended to nonlinear multi-agent systems,application in actual systems can be mainly concerned in the future work.
Fig.19 Velocity responses of three robots and one leader
Fig.20 State errors for three robots and one leader
Fig.21 Velocity errors for three robots and one leader
AcknowledgementsThis work was supported by Tianjin Natural Science Foundation under Grant 20JCYBJC00880,Beijing key Laboratory Open Fund of Long-Life Technology of Precise Rotation and Transmission Mechanisms,and Guangdong Provincial Key Laboratory of Intelligent Decision and Cooperative Control.
Control Theory and Technology2021年3期