摘 要:
隨著深度學(xué)習(xí)和強(qiáng)化學(xué)習(xí)而來的人工智能新浪潮,為智能體從感知輸入到行動決策輸出提供了“端到端”解決方案。多智能體學(xué)習(xí)是研究智能博弈對抗的前沿課題,面臨著對抗性環(huán)境、非平穩(wěn)對手、不完全信息和不確定行動等諸多難題與挑戰(zhàn)。本文從博弈論視角入手,首先給出了多智能體學(xué)習(xí)系統(tǒng)組成,進(jìn)行了多智能體學(xué)習(xí)概述,簡要介紹了各類多智能體學(xué)習(xí)研究方法。其次,圍繞多智能體博弈學(xué)習(xí)框架,介紹了多智能體博弈基礎(chǔ)模型及元博弈模型,均衡解概念和博弈動力學(xué),學(xué)習(xí)目標(biāo)多樣、環(huán)境(對手)非平穩(wěn)、均衡難解且易變等挑戰(zhàn)。再次,全面梳理了多智能體博弈策略學(xué)習(xí)方法,離線博弈策略學(xué)習(xí)方法,在線博弈策略學(xué)習(xí)方法。最后,從智能體認(rèn)知行為建模與協(xié)同、通用博弈策略學(xué)習(xí)方法和分布式博弈策略學(xué)習(xí)框架共3個方面探討了多智能體學(xué)習(xí)的前沿研究方向。
關(guān)鍵詞:
博弈學(xué)習(xí); 多智能體學(xué)習(xí); 元博弈; 在線無悔學(xué)習(xí)
中圖分類號:
TP 391
文獻(xiàn)標(biāo)志碼: A""" DOI:10.12305/j.issn.1001-506X.2024.05.17
Research progress of multi-agent learning in games
LUO Junren, ZHANG Wanpeng, SU Jiongming, YUAN Weilin, CHEN Jing*
(College of Intelligence Science and Technology, National University of Defense Technology, Changsha 410073, China)
Abstract:
The new wave of artificial intelligence brought about by deep learning and reinforcement learning provides an “end-to-end” solution for agents from perception input to action decision-making output. Multi-agent learning is a frontier subject in the field of intelligent game confrontation, and it faces many problems and challenges such as adversarial environments, non-stationary opponents, incomplete information and uncertain actions. This paper starts from the perspective of game theory, firstly gives the organization of multi-agent learning system, gives an overview of multi-agent learning, and briefly introduces the classification of various multi-agent learning research methods. Secondly, based on the multi-agent learning framework in games, it introduces the basic multi-agent game and meta-game models, game solution concepts and game dynamics, as well as challenges such as diverse learning objectives, non-stationary environment (opponent), and equilibrium hard to compute and easy to transfer. Then comprehensively sort out the multi-agent game strategy learning methods, offline game strategy learning methods and online game strategy learning methods. Finally, some frontiers of multi-agent learning are discussed from three aspects of agent cognitive behavior modelling and collaboration, general game strategy learning methods, and distributed game strategy learning framework.
Keywords:
learning in games; multi-agent learning; meta-game; online no regret learning
0 引 言
人類社會生活中存在著各種不同形式的對抗、競爭和合作,其中對抗一直是人類文明發(fā)展史發(fā)展的最強(qiáng)勁推動力。正是由于個體與個體、個體與群體、群體與群體之間復(fù)雜的動態(tài)博弈對抗演化,才不斷促進(jìn)人類智能升級換代[1]。人工智能技術(shù)的發(fā)展呈現(xiàn)出計算、感知和認(rèn)知3個階段[2],大數(shù)據(jù)、大算力和智能算法為研究認(rèn)知智能提供了先決條件。從人工智能技術(shù)發(fā)展的角度來看,計算智能主要以科學(xué)運(yùn)算、邏輯處理、統(tǒng)計查詢等形式化規(guī)則化運(yùn)算為核心,能存會算會查找。感知智能主要以圖像理解、語音識別、機(jī)器翻譯為代表,基于深度學(xué)習(xí)模型,能聽會說能看會認(rèn)。認(rèn)知智能主要以理解、推理、思考和決策為代表,強(qiáng)調(diào)認(rèn)知推理,自主學(xué)習(xí)能力,能理解會思考決策。博弈智能作為決策智能的前沿范式,是認(rèn)知智能的高階表現(xiàn)形式,其主要以博弈論為理論支撐,以反事實因果推理、可解釋性決策為表現(xiàn)形式,強(qiáng)調(diào)將其他智能體(隊友及對手)納入己方的決策環(huán)進(jìn)行規(guī)則自學(xué)習(xí)、博弈對抗演化、可解釋性策略推薦等。當(dāng)前,博弈智能已然成為人工智能領(lǐng)域的前沿方面、通用人工智能的重要問題。
多智能體系統(tǒng)一般是指由多個獨立的智能體組成的分布式系統(tǒng),每個智能體均受到獨立控制,但需在同一個環(huán)境中與其他智能體交互[3]。Shoham等人[4]將多智能體系統(tǒng)定義為包含多個自治實體的系統(tǒng),這些實體要么有不同的信息,要么有不同的興趣,或兩者兼有。Muller等人[5]對由多智能體系統(tǒng)技術(shù)驅(qū)動的各個領(lǐng)域的152個真實應(yīng)用進(jìn)行了分類總結(jié)和分析。多智能體系統(tǒng)是分布式人工智能的一個重要分支,主要研究智能體之間的交互通信、協(xié)調(diào)合作、沖突消解等方面的內(nèi)容,強(qiáng)調(diào)多個智能體之間的緊密群體合作,而非個體能力的自治和發(fā)揮。智能體之間可能存在對抗、競爭或合作關(guān)系,單個智能體可通過信息交互與友方進(jìn)行協(xié)調(diào)配合,一同對抗敵對智能體。由于每個智能體均能夠自主學(xué)習(xí),多智能體系統(tǒng)通常表現(xiàn)出涌現(xiàn)性能力。
當(dāng)前,多智能體系統(tǒng)模型常用于描述共享環(huán)境下多個具有感知、計算、推理和行動能力的自主個體組成的集合,典型應(yīng)用包括各類機(jī)器博弈、拍賣、在線平臺交易、資源分配(路由包、服務(wù)器分配)、機(jī)器人足球、無線網(wǎng)絡(luò)、多方協(xié)商、多機(jī)器人災(zāi)難救援、自動駕駛和無人集群對抗等。其中,基于機(jī)器博弈(計算機(jī)博弈)的人機(jī)對抗,作為圖靈測試的典型范式[6],是研究人工智能的果蠅[7]。多智能體系統(tǒng)被廣泛用于解決分布式?jīng)Q策優(yōu)化問題,其成功的關(guān)鍵是高效的多智能體學(xué)習(xí)方法。多智能體學(xué)習(xí)主要研究由多個自主個體組成的多智能體系統(tǒng)如何通過學(xué)習(xí)探索、利用經(jīng)驗提升自身性能的過程[8]。如何通過博弈策略學(xué)習(xí)提高多智能體系統(tǒng)的自主推理與決策能力是人工智能和博弈論領(lǐng)域面臨的前沿挑戰(zhàn)。
1 多智能體學(xué)習(xí)簡介
多智能體學(xué)習(xí)是人工智能研究的前沿?zé)狳c。從第三次人工智能浪潮至今,社會各界對多智能體學(xué)習(xí)的相關(guān)研究產(chǎn)生了極大的興趣。多智能體學(xué)習(xí)在人工智能、博弈論、機(jī)器人和心理學(xué)領(lǐng)域得到了廣泛研究。面對參與實體數(shù)量多、狀態(tài)空間規(guī)模大、實時決策巨復(fù)雜等現(xiàn)實問題,多智能體如何建模變得困難,手工設(shè)計的智能體交互行為遷移性比較弱。相反,基于認(rèn)知行為建模的智能體能夠從與環(huán)境及其他智能體的交互經(jīng)驗中學(xué)會有效地提升自身行為。在學(xué)習(xí)過程中,智能體可以學(xué)會與其他智能體進(jìn)行協(xié)調(diào),學(xué)習(xí)選擇自身行為、其他智能體如何選擇行為以及其目標(biāo)、計劃和信念是什么等。
本文從博弈論視角分析多智能體學(xué)習(xí),第1節(jié)簡要介紹了多智能體學(xué)習(xí),主要包括多智能體系統(tǒng)組成、多智能體學(xué)習(xí)概述、多智能體學(xué)習(xí)研究分類;第2節(jié)重點介紹了多智能體博弈學(xué)習(xí)框架,包括博弈基礎(chǔ)模型及元博弈模型、博弈解概念及博弈動力學(xué)、多智能體博弈學(xué)習(xí)的挑戰(zhàn);第3節(jié),全面梳理了多智能體博弈策略學(xué)習(xí)方法,重點剖析了策略學(xué)習(xí)框架、離線博弈策略學(xué)習(xí)方法和在線博弈策略學(xué)習(xí)方法;第4節(jié)著重從智能體認(rèn)知行為建模與協(xié)同、通用博弈策略學(xué)習(xí)方法和分布式博弈策略學(xué)習(xí)框架共3個方面展望了多智能體學(xué)習(xí)研究前沿;最后對全文進(jìn)行了總結(jié)。整體架構(gòu)如圖1所示。
近年來,伴隨著深度學(xué)習(xí)(感知領(lǐng)域)和強(qiáng)化學(xué)習(xí)(決策領(lǐng)域)的深度融合發(fā)展,多智能體學(xué)習(xí)方法在機(jī)器博弈領(lǐng)域取得了長足進(jìn)步,如圖2所示,AlphaGo[9]和Muzero[10],DeepStack[11]及德州撲克[12],DeltaDou[13]及斗地主[14],麻將[15],AlphaStar[16]及星際爭霸[17],OpenAI Five[18]及絕悟[19],AlphaWar[20]及戰(zhàn)顱[21],ALPHA[22]及AlphaDogFight空戰(zhàn)[23]等人工智能在各類比賽中獲得較好名次或在人機(jī)對抗比賽中戰(zhàn)勝了人類頂級選手。
1.1 多智能體學(xué)習(xí)系統(tǒng)組成
多智能體學(xué)習(xí)系統(tǒng)共包含四大模塊:環(huán)境、智能體、交互機(jī)制和學(xué)習(xí)方法。當(dāng)前針對多智能體學(xué)習(xí)的相關(guān)研究主要是圍繞這四部分展開的,如圖3所示。
環(huán)境模塊由狀態(tài)空間、動作空間、轉(zhuǎn)換函數(shù)和獎勵函數(shù)構(gòu)成。狀態(tài)空間指定單個智能體在任何給定時間可以處于的一組狀態(tài)。動作空間是單個智能體在任何給定時間可用的一組動作,轉(zhuǎn)換函數(shù)或環(huán)境動力學(xué)指定了環(huán)境在給定狀態(tài)下執(zhí)行動作的每個智能體(或智能體的子集)改變的(可能是隨機(jī)的)方式,獎勵函數(shù)根據(jù)狀態(tài)-行動轉(zhuǎn)換結(jié)果給出獎勵反饋信號。智能體模塊需要定義其與環(huán)境的通信關(guān)系,用于獲取觀測狀態(tài)和輸出指定動作、智能體之間的行為通信方式、其效用函數(shù)以表征環(huán)境狀態(tài)偏好以及選擇行動的策略。學(xué)習(xí)模塊由學(xué)習(xí)實體、學(xué)習(xí)目標(biāo)、學(xué)習(xí)經(jīng)驗數(shù)據(jù)、學(xué)習(xí)更新和學(xué)習(xí)目標(biāo)定義。學(xué)習(xí)實體需要指定單智能體還是多智能體級別。學(xué)習(xí)目標(biāo)描述了正在學(xué)習(xí)的任務(wù)目標(biāo),通常表現(xiàn)為目標(biāo)或評價函數(shù)。學(xué)習(xí)經(jīng)驗數(shù)據(jù)描述了學(xué)習(xí)實體可以獲得哪些信息作為學(xué)習(xí)的基礎(chǔ)。學(xué)習(xí)更新定義了在學(xué)習(xí)過程中學(xué)習(xí)實體的更新規(guī)則。交互機(jī)制模塊定義了智能體相互交互多長時間,與哪些其他智能體交互,及其對其他智能體的觀察。交互機(jī)制還規(guī)定了任何給定智能體之間交互的頻率(或數(shù)量),及其動作是同時選擇還是順序選擇(動作選擇的定時)。
1.2 多智能體學(xué)習(xí)概述
“多智能體學(xué)習(xí)”需要研究的問題是指導(dǎo)和開展研究的指南。Stone等人[24]在2000年就從機(jī)器學(xué)習(xí)的角度綜述分析了多智能體系統(tǒng),主要考慮智能體是同質(zhì)還是異質(zhì),是否可以通信等4種情型。早期相關(guān)綜述文章[25-29]采用公開辯論的方法分別從不同的角度對多智能體學(xué)習(xí)問題進(jìn)行剖析,總結(jié)出多智能體學(xué)習(xí)的4個明確定義問題:問題描述、分布式人工智能、博弈均衡和智能體建模[26]。Shoham等人[27]從強(qiáng)化學(xué)習(xí)和博弈論視角自省式提出了“如果多智能體學(xué)習(xí)是答案,那么問題是什么?”由于沒有找到一個單一的答案,他們提出了未來人工智能研究主要圍繞4個“主題”:計算性、描述性、規(guī)范性、規(guī)定性。其中,規(guī)定性又分為分布式、均衡和智能體,此3項如今正指引著多智能體學(xué)習(xí)的研究。Stone[28]試圖回答Shoham的問題,但看法剛好相反,強(qiáng)調(diào)多智能體學(xué)習(xí)應(yīng)包含博弈論,如何應(yīng)用多智能學(xué)習(xí)技術(shù)仍然是一個開放問題,而沒有一個標(biāo)準(zhǔn)的答案。Tosic等人[29]在2010年就提出了面向多智能體的強(qiáng)化學(xué)習(xí)、協(xié)同學(xué)習(xí)和元學(xué)習(xí)統(tǒng)一框架。Tuyls等人[8]在2018年分析了多智能體學(xué)習(xí)需要研究的5種方法:面向個人收益的在線強(qiáng)化學(xué)習(xí)、面向社會福利的在線強(qiáng)化學(xué)習(xí)、協(xié)同演化方法、群體智能和自適應(yīng)機(jī)制設(shè)計。Tuyls等人[30]在后續(xù)的研究中指出應(yīng)將群體智能[31]、協(xié)同演化[32]、遷移學(xué)習(xí)[33]、非平穩(wěn)性[34]、智能體建模[35]等納入多智能體學(xué)習(xí)方法框架中研究。多智能體學(xué)習(xí)的主流方法主要包括強(qiáng)化學(xué)習(xí)、演化學(xué)習(xí)和元學(xué)習(xí)等內(nèi)容,如圖4所示。
1.3 多智能體學(xué)習(xí)研究方法分類
根據(jù)對多智能體學(xué)習(xí)問題的分類描述,可以區(qū)分不同的研究視角與方法。Jant等人[36]很早就從合作與競爭兩個角度對多智能體學(xué)習(xí)問題進(jìn)行了區(qū)分。Panait等人[37]對合作型多智能體學(xué)習(xí)方法進(jìn)行了概述:團(tuán)隊學(xué)習(xí),指多智能體以公共的、唯一的學(xué)習(xí)機(jī)制集中學(xué)習(xí)最優(yōu)聯(lián)合策略;并發(fā)學(xué)習(xí),指單個智能體以相同或不同的個體學(xué)習(xí)機(jī)制,并發(fā)學(xué)習(xí)最優(yōu)個體策略。最新研究直接利用多智能體強(qiáng)化學(xué)習(xí)[38-41]方法開展研究。Busoniu等人[38]首次從完全合作、完全競爭和混合3類任務(wù)的角度對多智能體強(qiáng)化學(xué)習(xí)方法進(jìn)行了分類總結(jié)。Hernandez-Leal等人[39]總結(jié)了傳統(tǒng)多智能體系統(tǒng)研究中的經(jīng)典思想(如涌現(xiàn)性行為、學(xué)會通信交流和對手建模)是如何融入深度多智能體強(qiáng)化學(xué)習(xí)領(lǐng)域的,并在此基礎(chǔ)上對深度強(qiáng)化學(xué)習(xí)進(jìn)行了分類。Oroojlooy等人[40]從獨立學(xué)習(xí)器、全可觀評價、值函數(shù)分解、一致性和學(xué)會通信協(xié)調(diào)5個方面對合作多智能體強(qiáng)化學(xué)習(xí)方法進(jìn)行了全面回顧分析。Zhang等人[41]對具有理論收斂性保證和復(fù)雜性分析的多智能體強(qiáng)化學(xué)習(xí)算法進(jìn)行了選擇性分析,并首次對聯(lián)網(wǎng)智能體分散式、平均場博弈和隨機(jī)勢博弈多智能體強(qiáng)化學(xué)習(xí)方法進(jìn)行綜述分析。Gronauer等人[42]從訓(xùn)練范式與執(zhí)行方案、智能體涌現(xiàn)性行為模式和智能體面臨的六大挑戰(zhàn),即環(huán)境非平穩(wěn)、部分可觀、智能體之間的通信、協(xié)調(diào)、可擴(kuò)展性、信度分配,分析了多智能體深度強(qiáng)化學(xué)習(xí)。Du等人[43]從通信學(xué)習(xí)、智能體建模、面向可擴(kuò)展性的分散式訓(xùn)練分散式執(zhí)行及面向部分可觀性的集中式訓(xùn)練分散式訓(xùn)練兩種范式等角度對多智能體深度強(qiáng)化學(xué)習(xí)進(jìn)行了綜述分析。
在國內(nèi),吳軍等人[44]從模型的角度出發(fā),對面向馬爾可夫決策過程(Markov decision process, MDP)的集中式和分散式模型,面向馬爾可夫博弈(Markov game, MG)的共同回報隨機(jī)博弈,零和隨機(jī)博弈和一般和隨機(jī)博弈,共5類模型進(jìn)行了分類分析。杜威等人[45]從完全合作、完全競爭和混合型3類任務(wù)分析了多智能體強(qiáng)化學(xué)習(xí)方法。殷昌盛等人[46]對多智能體分層強(qiáng)化學(xué)習(xí)方法做了綜述分析。梁星星等人[47]對從全通信集中決策、全通信自主決策和欠通信自主決策3種范式對多智能體深度強(qiáng)化學(xué)習(xí)方法進(jìn)行綜述分析。孫長銀等人[48]從學(xué)習(xí)算法結(jié)構(gòu)、環(huán)境非靜態(tài)性、部分可觀性、基于學(xué)習(xí)的通信和算法穩(wěn)定性與收斂性共5個方面分析了多智能體強(qiáng)化學(xué)習(xí)需要研究的重點問題。
2 多智能體博弈學(xué)習(xí)框架
博弈論可用于多智能體之間的策略交互建模,近年來,基于博弈論的學(xué)習(xí)方法被廣泛嵌入到多智能體的相關(guān)研究問題中,多智能體博弈學(xué)習(xí)已然成為當(dāng)前一種新的研究范式。Matignon等人[49]僅對合作MG的獨立強(qiáng)化學(xué)習(xí)方法做了綜述分析。Nowe等人[50]從無狀態(tài)博弈、團(tuán)隊MG和一般MG三類場景對多智能體獨立學(xué)習(xí)和聯(lián)合學(xué)習(xí)方法進(jìn)行了分類總結(jié)。Lu等人[51]從強(qiáng)化學(xué)習(xí)和博弈論的整體視角出發(fā)對多智能體博弈的解概念、虛擬自對弈(fictitious self-play, FSP)類方法和反事實后悔值最小化(counterfactual regret minimization, CFR)類方法進(jìn)行了全面綜述分析。Yang等人[52]對同等利益博弈、零和博弈、一般和博弈和平均場博弈中的學(xué)習(xí)方法進(jìn)行了分類總結(jié)。Bloembergen等人[53]利用演化博弈學(xué)習(xí)方法分析了各類多智能體強(qiáng)化學(xué)習(xí)方法的博弈動態(tài),并揭示了演化博弈論和多智能體化學(xué)習(xí)方法之間的深刻聯(lián)系。另外,Wong等人[54]從多智能體深度強(qiáng)化學(xué)習(xí)面臨的四大挑戰(zhàn)出發(fā),指出未來需要研究深化學(xué)習(xí)等類人學(xué)習(xí)的方法。
2.1 多智能體博弈基礎(chǔ)模型及元博弈
2.1.1 多智能體博弈基礎(chǔ)模型
MDP常用于人工智能領(lǐng)域單智能體決策問題的過程建模。基于決策論的多智能體模型主要有分散式MDP(decentralized MDP, Dec-MDP)及多智能體MDP(multi-agent MDP, MMDP)[55]。其中,Dec-MDP模型每個智能體獨立擁有關(guān)于環(huán)境狀態(tài)的觀測,并根據(jù)觀測到的局部信息選擇自身動作;MMDP模型不區(qū)分單個智能體可利用的私有和全局狀態(tài)信息,采用集中式選擇行動策略,然后分配給單個智能體去執(zhí)行;分散式部分可觀MDP(decentralized partial observable MDP, Dec-POMDP)關(guān)注不確定性條件(動作和觀測)下多智能體的動作選擇與協(xié)調(diào)。Dec-POMDP模型中智能體的決策是分散式的,每個智能體根據(jù)自身所獲得的局部觀測信息獨立地做出決策。利用遞歸建模方法對其他智能體的行為進(jìn)行顯式的建模,Doshi等[56]提出的交互式部分POMDP(interactive-POMDP, I-POMDP) 模型,綜合利用博弈論與決策論來建模問題。早在20世紀(jì)50年代,由Shapley提出的隨機(jī)博弈[57],通常也稱作MG[58],常被用來描述多智能體學(xué)習(xí)。當(dāng)前的一些研究將決策論與博弈論統(tǒng)合起來,認(rèn)為兩類模型都屬于部分可觀隨機(jī)博弈模型[59]。從博弈論視角來分析,兩大典型博弈模型: 隨機(jī)博弈和擴(kuò)展式博弈模型如圖5所示。最新的一些研究利用因子可觀隨機(jī)博弈模型來建模擴(kuò)展式博弈[58],探索利用強(qiáng)化學(xué)習(xí)等方法求解擴(kuò)展式博弈。
隨機(jī)博弈模型可分為面向合作的團(tuán)隊博弈模型、面向競爭對抗的零和博弈模型和面向競合(混合)的一般和模型,如圖6所示。其中,團(tuán)隊博弈可廣泛用于對抗環(huán)境下的多智能體的合作交互建模,如即時策略游戲、無人集群對抗、聯(lián)網(wǎng)車輛調(diào)度等;零和博弈和一般和博弈常用于雙方或多方交互建模。擴(kuò)展式博弈包括兩種子類型,正則式表示[4]常用于同時行動交互場景描述,序貫式表示[60]常用于行為策略多階段交互場景描述,回合制博弈[61]常用于雙方交替決策場景。
2.1.2 元博弈模型
元博弈,即博弈的博弈,常用于博弈策略空間分析[62],是研究經(jīng)驗博弈理論分析(empirical game theoretic analysis,EGTA)的基礎(chǔ)模型[63]。目前,已廣泛應(yīng)用于各種可采用模擬器仿真的現(xiàn)實場景:供應(yīng)鏈管理分析、廣告拍賣和能源市場;設(shè)計網(wǎng)絡(luò)路由協(xié)議,公共資源管理;對抗策略選擇、博弈策略動態(tài)分析等。博弈論與元博弈的相關(guān)要素對比如表1所示。
近年來,一些研究對博弈的策略空間幾何形態(tài)進(jìn)行了探索。Jiang等人[64]首次利用組合霍奇理論研究圖霍爾海姆茨分解。Candogan等人[65]探索了策略博弈的流表示,提出策略博弈主要由勢部分、調(diào)和部分和非策略部分組成。Hwang等人[66]從策略等價的角度研究了正則式博弈的分解方法。Balduzzi等人[67]研究提出任何一個泛函式博弈(functional-form game,F(xiàn)FG)可以做直和分解成傳遞壓制博弈和循環(huán)壓制博弈。對于對稱(單種群)零和博弈,可以采用舒爾分解、主成分分析、奇異值分解、t分布隨機(jī)鄰域嵌入等方法分析博弈的策略空間形態(tài)結(jié)構(gòu),如圖7所示。其中,40個智能體的策略評估矩陣及二維嵌入,顏色從紅至綠對應(yīng)歸一化至[-1,1]范圍的平均收益值,完全傳遞壓制博弈的二維嵌入近似一條線、完全循環(huán)壓制博弈的二維嵌入近似一個環(huán)。
Omidshafiei等人[7]利用智能體的對抗數(shù)據(jù),根據(jù)博弈收益,依次繪制響應(yīng)圖、直方圖,得到譜響應(yīng)圖、聚合響應(yīng)圖和收縮響應(yīng)圖,采用圖論對傳遞博弈與循環(huán)博弈進(jìn)行拓?fù)浞治?,繪制智能體的博弈策略特征圖,得出傳遞博弈與循環(huán)博弈特征距離較遠(yuǎn)。Czarnecki等人[68]根據(jù)現(xiàn)實世界中的各類博弈策略的空間分析提出博弈策略空間的陀螺幾何體模型猜想,如圖8所示,縱向表示傳遞壓制維,幾何體頂端為博弈的納什均衡,表征了策略之間的壓制關(guān)系,橫向表示循環(huán)壓制維,表征了策略之間可能存在的首尾嵌套非傳遞性壓制關(guān)系。
關(guān)于如何度量博弈策略的循環(huán)性壓制,即非傳遞性壓制,Czarneck等人[68]指出可以采用策略集鄰接矩陣A(每個節(jié)點代表一個策略,如果策略i壓制策略j,則Aij=1),通過計算dia(A3)可以得到循環(huán)壓制環(huán)長度為3的策略個數(shù),但由于節(jié)點可能重復(fù)訪問,dia(AP)無法適用于更長循環(huán)策略。此外,納什聚類方法也可用于分析循環(huán)壓制環(huán)的長度,其中傳遞性壓制部分對手策略的索引、循環(huán)壓制對應(yīng)聚類類別的大小。Sanjaya等人[69]利用真實的國際象棋比賽數(shù)據(jù)實證分析了人類玩家策略的循環(huán)性壓制。此類結(jié)論表明,只有當(dāng)智能的策略種群足夠大后,才能克服循環(huán)性壓制并產(chǎn)生相變,學(xué)習(xí)收斂至更強(qiáng)的近似納什均衡策略。
Tuyls等人[70]證明了元博弈的納什均衡是原始博弈的2ε納什均衡,并利用Hoeffding給出了批處理單獨采樣和均勻采樣兩種情況下的均衡概率收斂的有效樣本需求界。Viqueria等人[71]利用Hoeffding界和Rademacher復(fù)雜性分析了元博弈,得出基于仿真學(xué)習(xí)到博弈均衡以很高概率保證是元博弈的近似均衡,同時元博弈的近似均衡是仿真博弈的近似均衡。
2.2 均衡解概念與博弈動力學(xué)
2.2.1 均衡解概念
從博弈論視角分析多智能體學(xué)習(xí)需要對其中的博弈均衡解概念做細(xì)致分析。許多博弈沒有純納什均衡,但一定存在混合納什均衡,如圖9所示。比較而言,相關(guān)均衡容易計算,粗相關(guān)均衡非常容易計算[72]。
由于學(xué)習(xí)場景和目標(biāo)的差別,一些新的均衡解概念也被采納:面向安全攻防博弈的斯坦克爾伯格均衡[73],面向有限理性的量子響應(yīng)均衡[74],面向演化博弈的演化穩(wěn)定策略[53],面向策略空間博弈的元博弈均衡[75],穩(wěn)定對抗干擾的魯棒均衡[76]也稱顫抖手均衡[77],處理非完備信息的貝葉斯均衡[78],處理在線決策的無悔或最小后悔值[79],描述智能體在沒有使其他智能體情況變壞的前提下使得自身策略變好的帕累托最優(yōu)[80],以及面向常和隨機(jī)博弈的馬爾可夫完美均衡[81]等。近來年,一些研究采用團(tuán)隊最大最小均衡[82]來描述零和博弈場景下組隊智能體對抗單個智能體,其本質(zhì)是一類對抗團(tuán)隊博弈[83]模型,可用于解決網(wǎng)絡(luò)阻斷[84]類問題、多人撲克[85]問題和橋牌問題[86]。同樣,一些基于“相關(guān)均衡”[87]解概念的新模型相繼被提出,應(yīng)用于元博弈[88]、擴(kuò)展式博弈[89]、一般和博弈[90]、零和同時行動隨機(jī)博弈[91]等。正是由于均衡解的計算復(fù)雜度比較高,當(dāng)前一些近似均衡的解概念得到了廣泛運(yùn)用,如最佳響應(yīng)[92]和預(yù)言機(jī)[93]等。
2.2.2 博弈動力學(xué)
博弈原本就是描述個體之間的動態(tài)交互過程。對于一般的勢博弈,從任意一個局勢開始,最佳響應(yīng)動力學(xué)可確保收斂到一個純納什均衡[94]。最佳響應(yīng)動力學(xué)過程十分直接,每個智能體可以通過連續(xù)性的單方策略改變來搜索博弈的純策略納什均衡。
最佳響應(yīng)動力學(xué):只要當(dāng)前的局勢s不是一個純納什均衡,任意選擇一個智能體i以及一個對有利的策略改變s′i,然后更新局勢為(s′i,s-i)。
由于最佳響應(yīng)動力學(xué)只能收斂到一個純策略納什均衡且與勢博弈緊密相關(guān),但在任意有限博弈中,無悔學(xué)習(xí)動力學(xué)可確保收斂到粗相關(guān)均衡[95]。對任意時間點t=1,2,…,T,假定每個智能體i獲得的收益向量cti,給定其他智能體的混合策略σt-i=∏j≠iptj,每個智能體i使用無悔算法獨立地選擇一個混合策略pti,則智能體選擇純策略si的期望收益:πti(si)=Est-i~σt-i[πi(si,st-i)]。
無悔學(xué)習(xí)方法:如果對于任意ε>0,都存在一個充分大的時間域T=T(ε)使得對于在線決策算法M的任意對手,決策者的后悔值最多為ε,將稱方法M為無悔的。
無交換后悔動力學(xué)可確保學(xué)習(xí)收斂至相關(guān)均衡[63]。相關(guān)均衡與無交換后悔動力的聯(lián)系與粗相關(guān)均衡和無悔動力學(xué)的聯(lián)系一樣。
無交換后悔學(xué)習(xí)方法:如果對于任意ε>0,都存在一個充分大的時間域T=T(ε)使得對于在線決策方法M的任意對手,決策者的期望交換后悔值最多為ε,將稱方法M為無交換后悔的。
對于多智能體之間的動態(tài)交互一般可以采用種群演化博弈理論里的復(fù)制者動態(tài)方程[53]或偏微分方程[96]進(jìn)行描述。Leonardos等人[97]利用突變理論證明了軟Q學(xué)習(xí)在異質(zhì)學(xué)習(xí)智能體的加權(quán)勢博弈中總能收斂到量子響應(yīng)均衡。
2.3 多智能體博弈學(xué)習(xí)的挑戰(zhàn)
2.3.1 學(xué)習(xí)目標(biāo)多樣
學(xué)習(xí)目標(biāo)支配著多智能體學(xué)習(xí)的整個過程,為學(xué)習(xí)方法的評估提供了依據(jù)。Powers等人[98]在2004年將多智能體學(xué)習(xí)的學(xué)習(xí)目標(biāo)歸類為:理性、收斂性、安全性、一致性、相容性、目標(biāo)最優(yōu)性等。Busoniu等人[38]將學(xué)習(xí)的目標(biāo)歸納為兩大類:穩(wěn)定性(收斂性、均衡學(xué)習(xí)、可預(yù)測、對手無關(guān)性)和適應(yīng)性(理性、無悔、目標(biāo)最優(yōu)性、安全性、對手察覺)。Digiovanni等人[99]將帕累托有效性也看作是多智能體學(xué)習(xí)目標(biāo)。多智能體學(xué)習(xí)目標(biāo)如表2所示,穩(wěn)定性表征了學(xué)習(xí)到一個平穩(wěn)策略的能力,收斂到某個均衡解,可學(xué)習(xí)近似模型用于預(yù)測推理,學(xué)習(xí)到的平穩(wěn)策略與對手無關(guān);適應(yīng)性表征了智能體能夠根據(jù)所處環(huán)境,感知對手狀態(tài),理性分析對手模型,做出最佳響應(yīng),在線博弈時可以學(xué)習(xí)一個回報不差于平穩(wěn)策略的無悔響應(yīng);目標(biāo)最優(yōu)、相容性與帕累托有效性、安全性表征了其他智能體可能采用固定策略、自對弈學(xué)習(xí)方法時,當(dāng)前智能體仍能適應(yīng)變化的對手,達(dá)到目標(biāo)最優(yōu)的適應(yīng)性要求。
2.3.2 環(huán)境(對手)非平穩(wěn)
多智能體學(xué)習(xí)過程中,環(huán)境狀態(tài)和獎勵都是由所有智能體的動作共同決定的;各智能體的策略都根據(jù)獎勵同時優(yōu)化;每個智能體只能控制自身策略?;谶@3個特點,非平穩(wěn)性成為影響多智能體學(xué)習(xí)求解最優(yōu)聯(lián)合策略的阻礙,并發(fā)學(xué)習(xí)的非平穩(wěn)性包括策略非平穩(wěn)性和個體策略學(xué)習(xí)環(huán)境非平穩(wěn)性。當(dāng)某個智能體根據(jù)其他智能體的策略調(diào)整自身策略以求達(dá)到更好的協(xié)作效果時,其他智能體也相應(yīng)地為了適應(yīng)該智能體的策略調(diào)整了自己的策略,這就導(dǎo)致該智能體調(diào)整策略的依據(jù)已經(jīng)“過時”,從而無法達(dá)到良好的協(xié)調(diào)效果。從優(yōu)化的角度看,其他智能體策略的非平穩(wěn)性導(dǎo)致智能體自身策略的優(yōu)化目標(biāo)是動態(tài)的,從而造成各智能體策略相互適應(yīng)的滯后性。非平穩(wěn)性作為多智能體問題面臨的最大挑戰(zhàn),如圖10所示,當(dāng)前的處理方法主要有五大類:無視[109],假設(shè)環(huán)境(對手)是平穩(wěn)的;遺忘[110],采用無模型方法,忘記過去的信息同時更新最新的觀測;標(biāo)定對手模型[111],針對預(yù)定義對手進(jìn)行己方策略優(yōu)化;學(xué)習(xí)對手模型的方法[112],采用基于模型的學(xué)習(xí)方法學(xué)習(xí)對手行動策略;基于心智理論的遞歸推理方法[113],智能體采用認(rèn)知層次理論遞歸推理雙方策略。
面對有限理性或欺騙型對手,對手建模(也稱智能體建模)已然成為智能體博弈對抗時必須擁有的能力[114],同集中式訓(xùn)練分散式執(zhí)行、元學(xué)習(xí)、多智能體通信建模為非平穩(wěn)問題的處理提供了技術(shù)支撐[115]。
2.3.3 均衡難解且易變
由于狀態(tài)和數(shù)量的增加,多智能體學(xué)習(xí)問題的計算復(fù)雜度比較大。計算兩人(局中人常用于博弈模型描述,智能體常用于學(xué)習(xí)類模型描述,本文部分語境中兩者等價)零和博弈的納什均衡解是多項式時間內(nèi)可解問題[94],兩人一般和博弈的納什均衡解是有向圖的多項式奇偶性論據(jù)(polynomial parity argument on directed graphs, PPAD)難問題[116],納什均衡的存在性判定問題是非確定性多項式(non-deterministic polynomial, NP)時間計算難問題[117],隨機(jī)博弈的純策略納什均衡存在性判定問題是多項式空間(polynomial space, PSPACE)難問題[118]。多人博弈更是面臨“納什均衡存在性”“計算復(fù)雜度高”“均衡選擇難”等挑戰(zhàn)。
對于多智能體場景,如果采用每個智能體獨立計算納什均衡策略,那么策略組合可能并不是真實的全體納什均衡,且個別智能體可能具有多個均衡策略、偏離動機(jī)等。檸檬水站位博弈[119]如圖11所示,每個智能體需要在圓環(huán)中找到一個站位,使自己與其他所有智能體的距離總和最大(見圖11(a)),所有智能體沿環(huán)均勻分布就是納什均衡,由于這種分布有無限多種方式實現(xiàn),因此納什均衡的個數(shù)無限多,原問題變成了“均衡選擇”問題,但如果每個智能體都獨立計算各自的納什均衡策略,那么組合策略可能并非整體的納什均衡策略(見圖11(b))。
正是由于多維目標(biāo)、非平穩(wěn)環(huán)境、大規(guī)模狀態(tài)行為空間、不完全信息與不確定性因素等影響,高度復(fù)雜的多智能體學(xué)習(xí)問題面臨諸多挑戰(zhàn),已然十分難以求解。
3 多智能體博弈學(xué)習(xí)方法
根據(jù)多智能體博弈對抗的場景(離線和在線)的不同,可以將多智能體博弈策略學(xué)習(xí)方法分為離線學(xué)習(xí)預(yù)訓(xùn)練/藍(lán)圖策略的方法與在線學(xué)習(xí)適變/反制策略的方法等。
3.1 離線場景博弈策略學(xué)習(xí)方法
3.1.1 隨機(jī)博弈策略學(xué)習(xí)方法
當(dāng)前,直接面向博弈均衡的學(xué)習(xí)方法主要為一類基于值函數(shù)的策略學(xué)習(xí)。根據(jù)博弈類型(合作博弈、零和博弈及一般和博弈)的不同均衡學(xué)習(xí)方法主要分為三大類,如表3所示。其中,Team Q[106]是一種直接學(xué)習(xí)聯(lián)合策略的方法;Distributed Q[120]采用樂觀單調(diào)更新本地策略,可收斂到最優(yōu)聯(lián)合策略;JAL(joint action learner)[121]方法通過將強(qiáng)化學(xué)習(xí)與均衡學(xué)習(xí)方法相結(jié)合來學(xué)習(xí)自己的行動與其他智能體的行動值函數(shù);OAL(optimal adaptive learning)方法[122]是一種最優(yōu)自適應(yīng)學(xué)習(xí)方法,通過構(gòu)建弱非循環(huán)博弈來學(xué)習(xí)博弈結(jié)構(gòu),消除所有次優(yōu)聯(lián)合動作,被證明可以收斂至最優(yōu)聯(lián)合策略;Decentralized Q[123]是一類基于OAL的方法,被證明可漸近收斂至最優(yōu)聯(lián)合策略。Minimax Q方法[106]應(yīng)用于兩人零和隨機(jī)博弈。Nash Q方法[124]將Minimax Q方法從零和博弈擴(kuò)展到多人一般和博弈;相關(guān)均衡(correlated equilibrium,CE)Q方法[125]是一類圍繞相關(guān)均衡的多智能體Q學(xué)習(xí)方法;Asymmetric Q[126] 是一類圍繞斯坦克爾伯格均衡的多智能體Q學(xué)習(xí)方法;敵或友Q(friend-or-foe Q, FFQ)學(xué)習(xí)方法[127]將其他所有智能體分為兩組,一組為朋友,可幫助一起最大化獎勵回報,另一組為敵人,試圖降低獎勵回報;贏或快學(xué)(win or learn fast, WoLF)方法[100]通過設(shè)置有利和不利兩種情況下的策略更新步長學(xué)習(xí)最優(yōu)策略。此外,這類方法還有無窮小梯度上升(infinitesimal gradient ascent, IGA)[128]、廣義IGA(gene-ralized IGA, GIGA)[129]、適應(yīng)平衡或均衡(adapt when eve-rybody is stationary otherwise move to equilibrirm,AWE-SOME)[130]等。
當(dāng)前,多智能體強(qiáng)化學(xué)習(xí)方法得到了廣泛研究,但此類方法的學(xué)習(xí)目標(biāo)是博弈最佳響應(yīng)。研究人員陸續(xù)采用獨立學(xué)習(xí)、聯(lián)合學(xué)習(xí)、集中式訓(xùn)練分散式執(zhí)行、利用協(xié)作圖等多種方法設(shè)計多智能體強(qiáng)化學(xué)習(xí)方法。本文根據(jù)訓(xùn)練和執(zhí)行方式,將多智能體強(qiáng)化學(xué)習(xí)方法分為四類:完全分散式、完全集中式、集中式訓(xùn)練分散式執(zhí)行和聯(lián)網(wǎng)分散式訓(xùn)練。如表4所示。
對于完全分散式學(xué)習(xí)方法,研究者們在獨立Q學(xué)習(xí)方法的基礎(chǔ)上進(jìn)行了價值函數(shù)更新方式的改進(jìn),Distributed Q學(xué)習(xí)方法[131],將智能體的個體動作價值函數(shù)視為聯(lián)合動作價值函數(shù)的樂觀映射,設(shè)置價值函數(shù)只有在智能體與環(huán)境和其他智能體的交互使對應(yīng)動作的價值函數(shù)增大時才更新。而Hysteretic Q學(xué)習(xí)方法[132]通過啟發(fā)式信息區(qū)分“獎勵”和“懲罰”兩種情況,分別設(shè)置兩個差別較大的學(xué)習(xí)率克服隨機(jī)變化的環(huán)境狀態(tài)和多最優(yōu)聯(lián)合策略情況。頻率最大Q值(frequency maximum Q, FMQ)方法[133]引入最大獎勵頻率這一啟發(fā)信息,使智能體在進(jìn)行動作選擇時傾向曾經(jīng)導(dǎo)致最大獎勵的動作,鼓勵智能體的個體策略函數(shù)通過在探索時傾向曾經(jīng)頻繁獲得最大獎勵的策略,提高與其他智能體策略協(xié)調(diào)的可能性。Lenient式多智能體強(qiáng)化學(xué)習(xí)方法[134]采用忽略低回報行為的寬容式學(xué)習(xí)方法。Distributed Lenient Q[135]采用分布式的方法組織Lenient值函數(shù)的學(xué)習(xí)。
對于完全集中式學(xué)習(xí)方法,通信網(wǎng)絡(luò)(communication network, CommNet)方法[136]是一種基于中心化的多智能體協(xié)同決策方法,所有的智能體模塊網(wǎng)絡(luò)會進(jìn)行參數(shù)共享,獎勵通過平均的方式分配給每個智能體。方法接收所有智能體的局部觀察作為輸入,然后輸出所有智能體的決策,因此輸入數(shù)據(jù)維度過大會給方法訓(xùn)練造成困難。雙向協(xié)調(diào)網(wǎng)絡(luò)(bidirectionally coordinated network, BiCNet)方法[137]通過一個基于雙向循環(huán)神經(jīng)網(wǎng)絡(luò)的確定性行動者-評論家(actor-critic, AC)結(jié)構(gòu)來學(xué)習(xí)多智能體之間的通信協(xié)議,在無監(jiān)督情況下,可以學(xué)習(xí)各種類型的高級協(xié)調(diào)策略。集中式訓(xùn)練分散式執(zhí)行為解決多智能體問題提供了一種比較通用的框架。反事實多智能體(counterfactual multi-agent, COMA)方法[138]為了解決Dec-POMDP問題中的多智能體信度分配問題,即在合作環(huán)境中,聯(lián)合動作通常只會產(chǎn)生全局性的收益,這使得每個智能體很難推斷出自己對團(tuán)隊成功的貢獻(xiàn)。該方法采用反事實思維,使用一個反事實基線,將單個智能體的行為邊際化,同時保持其他智能體的行為固定,COMA基于AC實現(xiàn)了集中訓(xùn)練分散執(zhí)行,適用于合作型任務(wù)。多智能體確定性策略梯度(multi-agent deterministic policy gradient, MADDPG)方法[139]是對DDPG方法為適應(yīng)多Agent環(huán)境的改進(jìn),最核心的部分就是每個智能體擁有自己獨立的AC網(wǎng)絡(luò)和獨立的回報函數(shù),critic部分能夠獲取其他所有智能體的動作信息,進(jìn)行中心化訓(xùn)練和非中心化執(zhí)行,即在訓(xùn)練的時候,引入可以觀察全局的critic來指導(dǎo)訓(xùn)練,而測試階段便不再有任何通信交流,只使用有局部觀測的actor采取行動。因此,MADDPG方法可以同時解決協(xié)作環(huán)境、競爭環(huán)境以及混合環(huán)境下的多智能體問題。多智能體軟Q學(xué)習(xí)(multi-agent soft Q learning, MASQL)[140]方法利用最大熵構(gòu)造軟值函數(shù)來解決多智能體環(huán)境中的廣泛出現(xiàn)的“相對過泛化”引起的最優(yōu)動作遮蔽問題。
此外,值分解網(wǎng)絡(luò)(value-decomposition networks, VDN)[141]、Q混合(Q mix, QMIX)[142]、多智能體變分探索(multi-agent variational exploration, MAVEN)[143]、Q變換(Q transformation, QTRAN)[144]等方法采用值函數(shù)分解的思想,按照智能體對環(huán)境的聯(lián)合回報的貢獻(xiàn)大小分解全局Q函數(shù),很好地解決了信度分配問題,但是現(xiàn)有分解機(jī)制缺乏普適性。VDN方法基于深度并發(fā)Q網(wǎng)絡(luò)(deep recurrent Q-network, DRQN)提出了值分解網(wǎng)絡(luò)架構(gòu),中心化地訓(xùn)練一個由所有智能體局部的Q網(wǎng)絡(luò)加和得到聯(lián)合的Q網(wǎng)絡(luò),訓(xùn)練完畢后每個智能體擁有只基于自身局部觀察的Q網(wǎng)絡(luò),可以實現(xiàn)去中心化執(zhí)行。該方法解耦了智能體之間復(fù)雜的關(guān)系,還解決了由于部分可觀察導(dǎo)致的偽收益和懶惰智能體問題。由于VDN求解聯(lián)合價值函數(shù)時只是通過對單智能體的價值函數(shù)簡單求和得到,使得學(xué)到的局部Q值函數(shù)表達(dá)能力有限,無法表征智能體之間更復(fù)雜的相互關(guān)系,QMIX對從單智能體價值函數(shù)到團(tuán)隊價值函數(shù)之間的映射關(guān)系進(jìn)行了改進(jìn),在映射的過程中將原來的線性映射換為非線性映射,并通過超網(wǎng)絡(luò)的引入將額外狀態(tài)信息加入到映射過程,提高了模型性能。MAVEN采用了增加互信息變分探索的方法,通過引入一個面向?qū)哟慰刂频碾[層空間來混合基于值和基于策略的學(xué)習(xí)方法。QTRAN提出了一種更加泛化的值分解方法,從而成功分解任何可分解的任務(wù),但是對于無法分解的協(xié)作任務(wù)的問題并未涉及。Q行列式戰(zhàn)點過程(Q determinantal point process, Q-DPP)[145]方法采用行列式點過程方法度量多樣性,加速策略探索。多智能體近端策略優(yōu)化(multi-agent proximal policy optimization, MAPPO)[146]方法直接采用多個PPO算法和廣義優(yōu)勢估計、觀測和層歸一化、梯度和值函數(shù)裁剪等實踐技巧在多類合作場景中表現(xiàn)較好。Shapley Q學(xué)習(xí)方法[147]采用合作博弈理論建模、利用Shapley值來引導(dǎo)值函數(shù)分析,為信度分配提供了可解釋方案。
聯(lián)網(wǎng)分散式訓(xùn)練方法是一類利用時變通信網(wǎng)絡(luò)的學(xué)習(xí)方法,其決策過程可建模成時空MDP,智能體位于時變通信網(wǎng)絡(luò)的節(jié)點上。每個智能體基于其本地觀測和連接的臨近智能體提供的信息來學(xué)習(xí)分散的控制策略,智能體會得到當(dāng)?shù)鬲剟?。擬合Q迭代(fitted Q iteration, FQI)[148]方法采用神經(jīng)擬合Q值函數(shù),分布式非精確梯度(distributed inexact gradient, DIGing)[149]方法基于時變圖拓?fù)涞姆植际絻?yōu)化方法,多智能體AC(multiagent AC, MAAC)[150]方法是基于AC算法提出來的,每個智能體都有自己獨立的actor網(wǎng)絡(luò)和critic網(wǎng)絡(luò),每個智能體都可以獨立決策并接收當(dāng)?shù)鬲剟?,同時在網(wǎng)絡(luò)上與臨近智能體交換信息以得到最佳的全網(wǎng)絡(luò)平均回報,該方法提供了收斂性的保證。由于多智能帶來的維數(shù)詛咒和解的概念難計算等問題,使得其很具有挑戰(zhàn)性,擴(kuò)展AC(scalable AC, SAC)[151]方法是一種可擴(kuò)展的AC方法,可以學(xué)習(xí)一種近似最優(yōu)的局部策略來優(yōu)化平均獎勵,其復(fù)雜性隨局部智能體(而不是整個網(wǎng)絡(luò))的狀態(tài)-行動空間大小而變化。神經(jīng)通信(neural communication, NeurComm)[152]是一種可分解通信協(xié)議,可以自適應(yīng)地共享系統(tǒng)狀態(tài)和智能體行為的信息,該算法的提出是為了減少學(xué)習(xí)中的信息損失和解決非平穩(wěn)性問題,為設(shè)計自適應(yīng)和高效的通信學(xué)習(xí)方法提供了支撐。近似多智能體擬合Q迭代(approximate multiagent fitted Q iteration, AMAFQI)[153]是一種多智能體批強(qiáng)化學(xué)習(xí)的有效逼近方法,其提出的迭代策略搜索對集中式標(biāo)準(zhǔn)Q函數(shù)的多個近似產(chǎn)生貪婪策略。
圍繞聯(lián)網(wǎng)條件下合作性或競爭性多智能體強(qiáng)化學(xué)習(xí)問題,Zhang等[154]提出了利用值函數(shù)近似的分散式擬合Q迭代方法,合作場景中聯(lián)網(wǎng)智能體團(tuán)隊以最大化所有智能體獲得的累積折扣獎勵的全局平均值為目標(biāo),對抗場景中兩個聯(lián)網(wǎng)團(tuán)隊以零和博弈的納什均衡為目標(biāo)。
3.1.2 擴(kuò)展式博弈策略學(xué)習(xí)方法
對于完美信息的擴(kuò)展式博弈可以通過線性規(guī)劃等組合優(yōu)化方法來求解。近年來,由于計算博弈論在非完美信息博弈領(lǐng)域取得的突破,基于后悔值的方法得到廣泛關(guān)注。當(dāng)前,面向納什均衡、相關(guān)均衡、粗相關(guān)均衡、擴(kuò)展形式相關(guān)均衡的相關(guān)求解方法如表5所示。其中,面向兩人零和博弈的組合優(yōu)化方法主要有線性規(guī)劃(linear programming, LP)[155]、過大間隙技術(shù)(excessive gap technique, EGT)[156]、鏡像近似(mirror prox, MP)[157]、投影次梯度下降(projected subgradient descent, PSD)[158]、可利用性下降(exploitability descent, ED)[159]等方法,后悔值最小化方法主要有后悔值匹配[160]、CFR[161]、Hedge[162]、乘性權(quán)重更新(multiplicative weight update, MWU)[163]、Hart后悔值匹配[164]等方法。雖然一些研究面向兩人一般和博弈的組合優(yōu)化方法主要有Lemke-Howson[165]、支撐集枚舉混合整數(shù)線性規(guī)劃(support enumeration mixed-integer linear programming, SEMILP)[166]、混合方法[167]、列生成[168]和線性規(guī)劃方法[169],后悔值最小化方法主要有縮放延拓后悔最小化(scaled extension regret minimizer, SERM)[170]。面向多人一般和博弈的組合優(yōu)化方法主要有列生成方法[168]、反希望橢球法(ellipsoid against hope, EAH)[171],后悔值最小化方法主要有后悔值測試方法、基于采樣的CFR法(CFR-S)[172]和基于聯(lián)合策略重構(gòu)的CFR法(CFR-Jr)[172]等?;诤蠡谥档姆椒?,其收斂速度一般為O(T-1/2),一些研究借助在線凸優(yōu)化技術(shù)將收斂速度提升到O(T-3/4),這類優(yōu)化方法,特別是一些加速一階優(yōu)化方法理論上可以比后悔值方法更快收斂,但實際應(yīng)用中效果并不理想。
在求解大規(guī)模非完全信息兩人零和擴(kuò)展博弈問題中,算法博弈論方法與深度強(qiáng)化學(xué)習(xí)方法成效顯著,形成以Pluribus、DeepStack等為代表的高水平德州撲克,在人機(jī)對抗中超越人類職業(yè)選手水平。其中,CFR類方法通過計算累計后悔值并依據(jù)后悔值匹配方法更新策略。深度強(qiáng)化學(xué)習(xí)類方法通過學(xué)習(xí)信息集上的值函數(shù)來更新博弈策略并收斂于近似納什均衡。近年來,一些研究利用Blackwell近似理論[175],構(gòu)建起了在線凸優(yōu)化類方法與后悔值類方法之間的橋梁,F(xiàn)arina等人[176]證明了后悔值最小化(RM)及其變體RM+分別與正則化跟風(fēng)和在線鏡像下降等價,收斂速度為O(T)。此外,一些研究表明后悔值與強(qiáng)化學(xué)習(xí)中的優(yōu)勢函數(shù)等價[177],現(xiàn)有強(qiáng)化學(xué)習(xí)方法通過引入“后悔值”概念,或者后悔值匹配更新方法,形成不同強(qiáng)化學(xué)習(xí)類方法,在提高收斂速率的同時,使得CFR方法的泛化性更強(qiáng)。三大類方法的緊密聯(lián)系為求解大規(guī)模兩人零和非完美信息博弈提供了新方向和新思路。非完美信息博弈求解方法主要有表格式、采樣類、函數(shù)近似和神經(jīng)網(wǎng)絡(luò)等CFR類方法,優(yōu)化方法和強(qiáng)化學(xué)習(xí)類方法,如表6所示。
基礎(chǔ)的表格類CFR方法受限于后悔值和平均策略的存儲空間限制,只能求解狀態(tài)空間約為1014的博弈問題。CFR與抽象、剪枝、采樣、函數(shù)近似、神經(jīng)網(wǎng)絡(luò)估計等方法結(jié)合,衍生出一系列CFR類方法,試圖從加速收斂速度、減少內(nèi)存占用、縮減博弈樹等,為快速求解近似納什均衡解提供有效支撐。采樣類CFR方法中蒙特卡羅采樣是主流方法,蒙特卡羅CFR(Monte carlo CFR,MCCFR)通過構(gòu)建生成式對手,大幅降低迭代時間、加快收斂速度。此外并行計算小批次、方差約減等技術(shù)被用于約束累積方差,如圖12所示,各類方法的采樣方式呈現(xiàn)出不同形態(tài)。
函數(shù)近似與神經(jīng)網(wǎng)絡(luò)類CFR方法主要采用擬合的方法估計反事實后悔值、累積后悔值,求解當(dāng)前策略或平均策略,相較于表格類方法泛化性更強(qiáng)。優(yōu)化方法有效利用了數(shù)學(xué)優(yōu)化類工具,將非完美信息博弈問題構(gòu)建成雙線性鞍點問題,充分利用離線生成函數(shù)、在線凸優(yōu)化方法、梯度估計與策略探索等方法,在小規(guī)模博弈上收斂速度快,但無法適應(yīng)空間大的博弈求解,應(yīng)用場景受限。傳遞的強(qiáng)化學(xué)習(xí)方法主要是利用自對弈的方式生成對戰(zhàn)經(jīng)驗數(shù)據(jù)集,進(jìn)而學(xué)習(xí)魯棒的應(yīng)對策略,新型的強(qiáng)化學(xué)習(xí)方法將后悔值及可利用性作為強(qiáng)化學(xué)習(xí)的目標(biāo)函數(shù),面向大型博弈空間,由于策略空間的非傳遞性屬性和對手適變的非平穩(wěn)策略,兩類方法均面臨探索與利用難題。當(dāng)前,以CFR為代表的算法博弈論方法已經(jīng)取得了突破,優(yōu)化方法及強(qiáng)化學(xué)習(xí)方法的融合為設(shè)計更具泛化能力的方法提供了可能。
對于多人博弈,一類針對對抗團(tuán)隊博弈[209]模型得到了廣泛研究,其中團(tuán)隊最大最小均衡(team-maxmin equilibrium, TME)描述了一個擁有相同效用的團(tuán)隊與一個對手博弈對抗的解概念。針對智能體之間有無通信、有無事先通信、可否事中通信等情形,近年來的一些研究探索了相關(guān)解概念,如相關(guān)TME(correlated TME, CTME)、帶協(xié)同設(shè)備的TME (TME with coordination device, TMECor)、帶通信設(shè)備的TME (TME with communication device, TMECom),相關(guān)均衡求解方法,如增量策略生成[210],其本質(zhì)是一類雙重預(yù)言機(jī)(double oracle, DO)方法,如表7所示,Zhang結(jié)合網(wǎng)絡(luò)阻斷應(yīng)用場景設(shè)計了多種對抗團(tuán)隊博弈求解方法[119]。此外,還有團(tuán)隊-對手博弈模型也被用來建模多對一的博弈情形。
3.1.3 元博弈種群策略學(xué)習(xí)方法
對于多智能體博弈策略均衡學(xué)習(xí)問題,近年來一些通用的框架相繼被提出,其中關(guān)于元博弈理論的學(xué)習(xí)框架為多智能體博弈策略的學(xué)習(xí)提供了指引。由于問題的復(fù)雜性,多智能體博弈策略學(xué)習(xí)表現(xiàn)出基礎(chǔ)策略可以通過強(qiáng)化學(xué)習(xí)等方法很快生成,而較優(yōu)策略依靠在已生成的策略池中緩慢迭代產(chǎn)生。當(dāng)前由強(qiáng)化學(xué)習(xí)支撐的策略快速生成“內(nèi)環(huán)學(xué)習(xí)器”和演化博弈理論支撐的種群策略緩慢迭代“外環(huán)學(xué)習(xí)器”組合成的“快與慢”雙環(huán)優(yōu)化方法[214],為多智能體博弈策略學(xué)習(xí)提供了基本參考框架。Lanctot等人[215]提出了面向多智能體強(qiáng)化學(xué)習(xí)的策略空間響應(yīng)預(yù)言機(jī)(policy space response oracle, PSRO)統(tǒng)一博弈學(xué)習(xí)框架,成功將雙重預(yù)言機(jī)這類迭代式增量式策略生成方法擴(kuò)展成滿足元博弈種群策略學(xué)習(xí)方法,其過程本質(zhì)上由兩個步驟組成“挑戰(zhàn)對手”和“響應(yīng)對手”。為了應(yīng)對一般和博弈,Muller等人[216]提出了基于α-排名和PSRO的通用學(xué)習(xí)方法框架。Sun等人[217]提出了滿足競爭自對弈多智能體強(qiáng)化學(xué)習(xí),提出了分布式聯(lián)賽學(xué)習(xí)框架T聯(lián)賽(TLeague),可以云服務(wù)架構(gòu)組織多智能體博弈策略學(xué)習(xí)。Zhou等人[218]基于種群多智能體強(qiáng)化學(xué)習(xí)提出了融合策略評估的多智能體庫(multi-agent library, MALib)并行學(xué)習(xí)框架。當(dāng)前多智能體博弈策略學(xué)習(xí)主要是通過算法驅(qū)動仿真器快速生成博弈對抗樣本,如圖13所示,得到收益張量M,元博弈求解器計算策略組合分布,進(jìn)而輔助挑戰(zhàn)下一輪對戰(zhàn)對手(末輪單個、最強(qiáng)k個、均勻采樣等),預(yù)言機(jī)主要負(fù)責(zé)生成最佳響應(yīng),為智能體的策略空間增加新策略。
(1) 策略評估方法
多智能體博弈對抗過程中,由基礎(chǔ)“內(nèi)環(huán)學(xué)習(xí)器”快速生成的智能體模型池里,各類模型的能力水平各不相同,如何評估其能力用于外層的最優(yōu)博弈策略模型探索可以看作是一個多智能體交互機(jī)制設(shè)計問題,即如何按能力挑選智能體用于“外環(huán)學(xué)習(xí)器”策略探索。當(dāng)前,衡量博弈策略模型絕對能力的評估方法主要有可利用性[219]、方差[220]和保真性[221]等。此外,Cloud等人[220]采用三支分解方法度量智能體的技能、運(yùn)氣與非平穩(wěn)性等。
衡量相對能力的評估方法已經(jīng)成為當(dāng)前的主流[222]。由于博弈策略類型的不同,評估方法的適用也不盡相同。當(dāng)前策略評估方法主要分傳遞性壓制博弈和循環(huán)性壓制博弈策略評估方法。面向傳遞性壓制博弈的Elo[223]、Glicko[224]和真實技能[225]方法;面向循環(huán)性壓制博弈的多維Elo(multidimensional Elo, mElo)[75]、納什平均[75]、α-排名[226-227]、響應(yīng)圖上置信界采樣(response graph-upper confidence bound, RG-UCB)[228]、基于信息增益的α排名(α information gain, αIG)[229],最優(yōu)評估(optimal eva-luation, OptEval)[230]等方法,各類方法相關(guān)特點如表8所示。
此外,最新的一些研究對智能體與任務(wù)的適配度[75]、游戲難度[231]、選手排名[232]、方法的性能[233]、在線評估[234]、大規(guī)模評估[235-236]、團(tuán)隊聚合技能評估[237]等問題展開了探索。通過策略評估,可以掌握種群中對手能力情況及自身能力等級,快速的評估方法可有效加快多樣性策略的探索速度。
(2) 策略提升方法
在“內(nèi)環(huán)學(xué)習(xí)器”完成了智能體博弈策略評估的基礎(chǔ)上,“外環(huán)學(xué)習(xí)器”需要通過與不同的“段位”的智能體進(jìn)行對抗,提升策略水平。傳統(tǒng)自對弈的方法對非傳遞壓制性博弈的策略探索作用不明顯。由于問題的復(fù)雜性,多智能體博弈策略的迭代提升需要一些新的方法模型,特別是需要能滿足策略提升的種群訓(xùn)練方法。博弈策略提升的主要方法有自對弈(self-play, SP)[238]、協(xié)同對弈(co-play, CP)[239]、FSP[240]和種群對弈(population play, PP)[241]等多類方法,如圖14所示。但各類方法的適用有所區(qū)分,研究表明僅當(dāng)策略探索至種群數(shù)量足夠多、多樣性滿足條件后,這類迭代式學(xué)習(xí)過程才能產(chǎn)生相變,傳統(tǒng)的自對弈方法只有當(dāng)策略的“傳遞壓制維”上升到一定段位水平后才可能有作用,否則可能陷入循環(huán)壓制策略輪替生成。
根據(jù)適用范圍分類,可以將方法劃分成自對弈、協(xié)同對弈、虛擬對弈和種群對弈共四大類,如表9所示。
自對弈類方法主要有樸素自對弈方法[242],δ-Uniform自對弈[243]、非對稱自對弈[244]、雙重預(yù)言機(jī)[245]、極小極大后悔魯棒預(yù)言機(jī)[246]、無偏自對弈[247]等。這類方法主要利用與自身的歷史版本對抗生成訓(xùn)練樣本,對樣本的質(zhì)量要求高,適用范圍最小。
虛擬對弈類方法主要有虛擬對弈[248]、虛擬自對弈[203]、廣義虛擬對弈[197]、擴(kuò)展虛擬對弈[249]、平滑虛擬對弈[250]、隨機(jī)虛擬對弈[251]、團(tuán)隊虛擬對弈[252]、神經(jīng)虛擬自對弈[253]、蒙特卡羅神經(jīng)虛擬自對弈[204]、優(yōu)先級虛擬自對弈[16]等。這類方法是自對弈方法的升級版本,由于樣本空間大,通常會與采樣或神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)類方法結(jié)合使用,可用于擴(kuò)展式博弈、團(tuán)隊博弈等場景。其中,AlphaStar采用的聯(lián)賽訓(xùn)練機(jī)制正是優(yōu)先級虛擬自對弈方法,智能體策略集中包含三大類:主智能體、主利用者和聯(lián)盟利用者。此外,星際爭霸在優(yōu)先級虛擬自對弈的基礎(chǔ)上增加了智能體分支模塊,TStarBot-X采用了多樣化聯(lián)賽訓(xùn)練。
協(xié)同對弈方法主要有協(xié)同演化[255]、協(xié)同學(xué)習(xí)[29]等,這類方法主要依賴多個策略協(xié)同演化生成下一世代的優(yōu)化策略。
種群對弈方法主要有種群訓(xùn)練自對弈[254]、雙重預(yù)言機(jī)-經(jīng)驗博弈分析[256]、混合預(yù)言機(jī)/混合對手[257-258]、PSRO[215]、聯(lián)合PSRO[259]、行列式點過程PSRO[254]、管線PSRO[260]、在線PSRO[261]和自主PSRO[262]、任意時間最優(yōu)PSRO[263]、有效PSRO[264]、神經(jīng)種群學(xué)習(xí)[265]等多類方法,這類方法與分布式框架的組合為當(dāng)前絕大部分多智能體博弈問題提供了通用解決方案,其關(guān)鍵在于如何提高探索樣本效率,確??焖俚膬?nèi)環(huán)能有效生成策略樣本,進(jìn)而加快慢外環(huán)的優(yōu)化迭代。
(3) 自主學(xué)習(xí)方法
近年來,一些研究試圖從算法框架與分布式計算框架進(jìn)行創(chuàng)新,借助元學(xué)習(xí)方法,將策略評估與策略提升方法融合起來。Feng等人[262]基于元博弈理論、利用元學(xué)習(xí)方法探索了多樣性感知的自主課程學(xué)習(xí)方法,通過自主發(fā)掘多樣性課程用于難被利用策略的探索。Yang等人[266]指出多樣性自主課程學(xué)習(xí)對現(xiàn)實世界里的多智能體學(xué)習(xí)系統(tǒng)非常關(guān)鍵。Wu等人[267]利用元學(xué)習(xí)方法同時可以生成難被利用和多樣性對手,引導(dǎo)智能體自身策略迭代提升。Leibo等人[268]研究指出自主課程學(xué)習(xí)是研究多智能體智能的可行方法,課程可由外生和內(nèi)生挑戰(zhàn)自主生成。
當(dāng)前自主學(xué)習(xí)類方法需要利用多樣性[269]策略來加速策略空間的探索,其中有質(zhì)多樣性[270]作為一類帕累托框架,因其同時確保了對結(jié)果空間的廣泛覆蓋和有效的回報,為平衡處理“探索與利用”問題提供了目標(biāo)導(dǎo)向。
當(dāng)前對多樣性的研究主要區(qū)分為三大類:行為多樣性[271]、策略多樣性[269]、環(huán)境多樣性[271]。一些研究擬采用矩陣范數(shù)(如L1,1范數(shù)[67]、F范數(shù)和譜范數(shù)[67]、行列式值[254, 272])、有效測度[272]、最大平均差異[273]、占據(jù)測度[269]、期望基數(shù)[254]、凸胞擴(kuò)張[269]等衡量多樣性,如表10所示。其中,行為多樣性可引導(dǎo)智能體更傾向于采取多樣化的行動,策略多樣性可引導(dǎo)智能體生成差異化的策略、擴(kuò)大種群規(guī)模、提高探索效率,環(huán)境多樣性可引導(dǎo)智能體適變更多不同的場景,增強(qiáng)智能體的適變能力。
3.2 在線場景博弈策略學(xué)習(xí)方法
由離線學(xué)習(xí)得到的博弈策略通常被稱作藍(lán)圖策略。在線對抗過程中,可完全依托離線藍(lán)圖策略進(jìn)行在線微調(diào),如即時策略游戲中依據(jù)情境元博弈選擇對抗策略[275]。棋牌類游戲中可以用兩種方式生成己方策略, 即從悲觀視角出發(fā)的博弈最優(yōu), 即采用離線藍(lán)圖策略進(jìn)行對抗。從樂觀視角出發(fā)的剝削式對弈, 即在線發(fā)掘?qū)κ挚赡艿娜觞c, 最大化己方收益的方式利用對手。正是由于難以應(yīng)對非平穩(wěn)對手的策略動態(tài)切換[276]、故意隱藏或欺騙,在線博弈過程中通常需要及時根據(jù)對手表現(xiàn)和所處情境進(jìn)行適應(yīng)性調(diào)整,其本質(zhì)是一個對手意圖識別與反制策略生成[275]問題。當(dāng)前在線博弈策略學(xué)習(xí)的研究主要包括學(xué)會控制后悔值[277]、對手建模與利用[35]、智能體匹配及協(xié)作[278]。
3.2.1 在線優(yōu)化與無悔學(xué)習(xí)
在線決策過程的建模方法主要有在線MDP[279]、對抗MDP[280]、未知部分可觀MDP[281]、未知MG[282]等?;谠诰€優(yōu)化與無悔學(xué)習(xí)方法的融合是在線博弈策略學(xué)習(xí)的重點研究方向,其中無悔本是指隨著交互時長趨近無窮大時,后悔值呈亞線性遞減,即滿足O(T-1/2)。傳統(tǒng)的無悔學(xué)習(xí)方法主要依賴Hedge[162]和MWU[163]等,近來的一些研究利用在線凸優(yōu)化方法設(shè)計了基于跟隨正則化領(lǐng)先者[176]和在線鏡像下降[200]等樂觀后悔最小化算法。
Dinh等人[261]利用Hedge方法和策略支撐集數(shù)量約束,證明了線動態(tài)后悔值的有界性。Kash等人[283]將無悔學(xué)習(xí)與Q值函數(shù)結(jié)合設(shè)計了一種局部無悔學(xué)習(xí)方法,無需考慮智能體的完美回憶條件仍可收斂。Lin[284]和Lee[285]等人對無悔學(xué)習(xí)的有限時間末輪迭代收斂問題展開了研究,通過附加正則化項的樂觀后悔值最小化方法收斂速度更快。Daskalakis等人[286]研究了幾類面向一般和博弈的近似最優(yōu)無悔學(xué)習(xí)方法的后悔界。此外,事后理性[287]作為一個與后悔值等效的可替代學(xué)習(xí)目標(biāo),可用于引導(dǎo)在線學(xué)習(xí)與其他智能體關(guān)聯(lián)的最佳策略。
3.2.2 對手建模與利用方法
通過對手建??梢院侠淼仡A(yù)測對手的行動、發(fā)掘隊手的弱點以備利用。當(dāng)前,對手建模方法主要分兩大類:與博弈領(lǐng)域知識關(guān)聯(lián)比較密切的顯式建模方法和面向策略的隱式建模方法。面向在線策略學(xué)習(xí)的對手利用方法主要有以下三大類。
(1) 對手判別式適變方法
Li[288]提出利用模式識別樹顯式的構(gòu)建對手模型,估計對手策略與贏率進(jìn)而生成己方反制策略,Ganzfried等人[289]設(shè)計機(jī)會發(fā)掘方法,試圖利用對手暴露的弱點。Davis等人[290]通過估計對手信息,構(gòu)建限定性條件,加快約束策略生成。
(2) 對手近似式學(xué)習(xí)方法
Wu等人[267]利用元學(xué)習(xí)生成難被剝削對手和多樣性對手模型池來指引在線博弈策略學(xué)習(xí)。Kim等人[291]利用對手建模與元學(xué)習(xí)設(shè)計了面向多智能體的元策略優(yōu)化方法。Foerster等人[107]設(shè)計的對手察覺學(xué)習(xí)方法是一類考慮將對手納入己方策略學(xué)習(xí)過程中的學(xué)習(xí)方法。Silva等人[292]提出的在線自對弈課程方法通過在線構(gòu)建對抗課程引導(dǎo)博弈策略學(xué)習(xí)。
(3) 對手生成式搜索方法
Ganzfried等人[289]提出基于狄利克雷先驗對手模型,利用貝葉斯優(yōu)化模型獲得對手模型的后驗分布,輔助利用對手的反制策略生成。Sustr等人[293]提出利用基于信息集蒙特卡羅采樣的蒙特卡羅重解法生成反制策略。Brown等人[294]提出在對手建模時要平衡安全與可利用性,基于安全嵌套有限深度搜索的方法可以生成安全對手利用的反制策略。Tian[295]提出利用狄利克雷先驗, 基于餐館過程在博弈策略空間中生成安全利用對手的反制策略。
3.2.3 角色匹配與臨機(jī)協(xié)調(diào)
多智能體博弈通常是在多角色協(xié)調(diào)配合下完成的,通常同類角色可執(zhí)行相似的任務(wù),各類智能體之間的臨機(jī)協(xié)調(diào)是博弈對抗致勝的關(guān)鍵。Wang等人[296]設(shè)計了面向多類角色的多智能體強(qiáng)化學(xué)習(xí)框架,通過構(gòu)建一個隨機(jī)角色嵌入空間,可以學(xué)習(xí)特定角色、動態(tài)角色和可分辨角色,相近角色的單元完成相似任務(wù),加快空間劃分與環(huán)境高效探索。Gong等人[297]利用角色(英雄及玩家)向量化方法分析了英雄之間的高階交互情況,圖嵌入的方式分析了協(xié)同與壓制關(guān)系,研究了多智能體匹配在線規(guī)劃問題。
臨機(jī)組隊可以看作是一個機(jī)制設(shè)計問題[278]。Hu等人[298]提出了智能體首次合作的零樣本協(xié)調(diào)問題,利用其對弈[299]方法(即基于學(xué)習(xí)的人工智能組隊方法)為無預(yù)先溝通的多智能體協(xié)調(diào)學(xué)習(xí)提供了有效支撐。此外,人與人工智能組隊作為臨機(jī)組隊問題的子問題,要求人工智能在不需要預(yù)先協(xié)調(diào)下可與人在線協(xié)同。Lucero等人[300]利用StarCraft平臺研究了如何利用人機(jī)組隊和可解釋人工智能技術(shù)幫助玩家理解系統(tǒng)推薦的行動。Waytowich等人[301]研究了如何運(yùn)用自然語言指令驅(qū)動智能體學(xué)習(xí),基于語言指令與狀態(tài)的互嵌入模型實現(xiàn)了人在環(huán)路強(qiáng)化學(xué)習(xí)方法的設(shè)計。Siu等人[302]利用一類合作博弈平臺Hanabi評估了各類人與人工智能組隊方法的效果。
4 多智能體博弈學(xué)習(xí)前沿展望
4.1 智能體認(rèn)知行為建模與協(xié)同
4.1.1 多模態(tài)行為建模
構(gòu)建智能體的認(rèn)知行為模型為一般性問題提供求解方法,是獲得通用人工智能的一種探索。各類認(rèn)知行為模型框架[303]為智能體獲取知識提供了接口。對抗環(huán)境下,智能體的認(rèn)知能力主要包含博弈推理與反制策略生成[195]、對抗推理與對抗規(guī)劃[304]。認(rèn)知行為建模可為分析對手思維過程、決策行動的動態(tài)演化、欺騙與反欺騙等認(rèn)知對抗問題提供支撐。智能體行為的多模態(tài)屬性[305],如合作場景下行為的“解釋性、明確性、透明性和預(yù)測性”,對抗場景下行為的“欺騙性、混淆性、含糊性、隱私性和安全性”,均是欺騙性和可解釋性認(rèn)知行為建模的重要研究內(nèi)容,相關(guān)技術(shù)可應(yīng)用于智能人機(jī)交互、機(jī)器推理、協(xié)同規(guī)劃、具人類意識智能系統(tǒng)等領(lǐng)域問題的求解。
4.1.2 對手推理與適變
傳統(tǒng)的對手建模方法一般會假設(shè)對手策略平衡不變、固定策略動態(tài)切換等簡單情形,但對手建模仍面臨對手策略非平穩(wěn)、風(fēng)格驟變、對抗學(xué)習(xí)、有限理性、有限記憶、欺騙與詐唬等挑戰(zhàn)。當(dāng)前,具對手意識的學(xué)習(xí)[107]、基于心智理論(認(rèn)知層次理論)的遞歸推理[105]和基于策略蒸餾和修正信念的貝葉斯策略重用[306]等方法將對手推理模板嵌入對手建模流程中,可有效應(yīng)對非平穩(wěn)對手。此外,在線博弈對抗過程中,公共知識與完全理性等條件均可能無法滿足,對手缺點的暴露強(qiáng)化了智能體偏離均衡解的動機(jī),基于納什均衡解采用安全適變策略可有效剝削對手[294]且不易被發(fā)覺[307]。
4.1.3 人在環(huán)路協(xié)同
“人機(jī)對抗”是當(dāng)前檢驗人工智能AI的主流評測方式,而“人機(jī)協(xié)同”是人機(jī)混合智能的主要研究內(nèi)容。人與AI的協(xié)同可區(qū)分為人在環(huán)路內(nèi)、人在環(huán)路上和人在環(huán)路外共3種模式,其中人在環(huán)路上(人可參與干預(yù),也可旁觀監(jiān)督)的相關(guān)研究是當(dāng)前的研究重點,特別是基于自然語言指令的相關(guān)研究為人機(jī)交互預(yù)留了更為自然的人機(jī)交互方式[301]。此外,圍繞“人(博弈局中人)—機(jī)(機(jī)器人工智能)—環(huán)(博弈對抗環(huán)境)”協(xié)同演化的相關(guān)研究表明,人機(jī)協(xié)同面臨著應(yīng)用悖論,人機(jī)組隊后的能力將遠(yuǎn)超人類或機(jī)器,但過度依賴人工智能將會使人類的技能退化,盲目樂觀的應(yīng)用,忽視缺陷和漏洞,對抗中被欺騙可至決策錯誤,推薦的行動方案受質(zhì)疑,在某些人道主義應(yīng)用場景中可能面臨倫理挑戰(zhàn)。
4.2 通用博弈策略學(xué)習(xí)方法
4.2.1 大規(guī)模智能體學(xué)習(xí)方法
當(dāng)前多智能體博弈的相關(guān)研究正向多智能體集群對抗、異構(gòu)集群協(xié)同等高復(fù)雜現(xiàn)實及通用博弈場景聚焦。隨著智能體數(shù)量規(guī)模的增加,行動和狀態(tài)空間將呈指數(shù)級增長,在很大程度上限制了多智能體學(xué)習(xí)方法的可擴(kuò)展性。傳統(tǒng)的博弈抽象[308]、狀態(tài)及行動抽象[309]方法雖然可以對問題空間做有效約減,但問題的復(fù)雜度依然很高,在智能體數(shù)目N2時,納什均衡通常很難計算,多人博弈均衡解存在性和求解依然充滿挑戰(zhàn)。Yang等人[310]根據(jù)平均場思想提出的平均場Q學(xué)習(xí)和平均場AC方法,為解決大規(guī)模智能體學(xué)習(xí)問題提供了參考。
4.2.2 雙層優(yōu)化自對弈方法
博弈策略學(xué)習(xí)的范式正從傳統(tǒng)的“高質(zhì)量樣本模仿學(xué)習(xí)+分布式強(qiáng)化學(xué)習(xí)”向“無先驗知識+端到端競爭式自對弈學(xué)習(xí)”轉(zhuǎn)變。此前,Muller等人[216]提出的α-Rank和PSRO學(xué)習(xí)方法是一類元博弈種群策略學(xué)習(xí)通用框架方法。Leibo等人[268]從“問題的問題”視角提出了面向多智能體的“自主課程學(xué)習(xí)”方法。傳統(tǒng)的強(qiáng)化學(xué)習(xí)和算法博弈論方法是多智能體博弈策略學(xué)習(xí)方法的通用基礎(chǔ)學(xué)習(xí)器,基于“快與慢”理念的雙層優(yōu)化類方法[311],其中元學(xué)習(xí)[262]、自主課程學(xué)習(xí)[268]和元演化學(xué)習(xí)[312]、支持并行分布式計算的無導(dǎo)數(shù)演化策略學(xué)習(xí)方法[313]、面向連續(xù)博弈的策略梯度優(yōu)化方法[314]、面向非平穩(wěn)環(huán)境的持續(xù)學(xué)習(xí)方法[315]、由易到難的自步學(xué)習(xí)方法[316]為自主策略探索學(xué)習(xí)程序算法設(shè)計提供了指引。
4.2.3 知識與數(shù)據(jù)融合方法
基于常識知識與領(lǐng)域?qū)<一驅(qū)I(yè)人類玩家經(jīng)驗的知識驅(qū)動型智能體策略具有較強(qiáng)的可解釋性,而基于大樣本采樣和神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)的數(shù)據(jù)驅(qū)動型智能體策略通常具有很強(qiáng)的泛化性。相關(guān)研究從加性融合與主從融合[317]、知識牽引與數(shù)據(jù)數(shù)據(jù)驅(qū)動[318]、層次化協(xié)同與組件化協(xié)同[319]等角度進(jìn)行了探索。此外,張馭龍等人[320]面向任務(wù)級兵棋提出了多智能體策略協(xié)同演進(jìn)框架,打通人類專家與智能算法之間的知識循環(huán)。
4.2.4 離線預(yù)訓(xùn)練與在線微調(diào)方法
基于海量數(shù)據(jù)樣本的大型預(yù)訓(xùn)練模型是通用人工智能的一種探索。相對于基于藍(lán)圖策略的在線探索方法,基于離線預(yù)訓(xùn)練模型的在線微調(diào)方法有著更廣泛的應(yīng)用前景。近來,基于序貫決策Transformer[321]的離線[322]與在線[323]學(xué)習(xí)方法將注意力機(jī)制與強(qiáng)化學(xué)習(xí)方法融合,為大型預(yù)訓(xùn)練模型生成提供了思路,來自DeepMind的Mathieu等人[324]設(shè)計了面向星際爭霸的超大型離線強(qiáng)化學(xué)習(xí)模型。
4.3 分布式博弈策略學(xué)習(xí)框架
4.3.1 多智能體博弈基準(zhǔn)環(huán)境
當(dāng)前,大多數(shù)博弈對抗平臺采用了游戲設(shè)計的思想,將玩家的參與度作為設(shè)計目標(biāo),通常會為了游戲的平衡性,將對抗多方的能力水平設(shè)計成相對均衡狀態(tài)(如星際爭霸中的3個種族之間相對狀態(tài)),這類環(huán)境可看成是近似對稱類環(huán)境。Hernandez等人[62]利用元博弈研究了競爭性多玩家游戲的自平衡問題。當(dāng)前,一些研究包括星際爭霸多智能體挑戰(zhàn)(starcraft multi-agent challenge, SMAC)[325]、OpenSpiel[326]等基準(zhǔn)環(huán)境,PettingZoo[327]、MAVA[328]等集成環(huán)境。兵棋推演作為一類典型的非對稱部分可觀異步多智能體協(xié)同對抗環(huán)境[329],紅藍(lán)雙方通常能力差異明顯,模擬真實環(huán)境的隨機(jī)性使得決策風(fēng)險高[317],可以作為多智能體博弈學(xué)習(xí)的基準(zhǔn)測試環(huán)境。
4.3.2 分布式強(qiáng)化學(xué)習(xí)框架
由于學(xué)習(xí)類方法本質(zhì)上采用了試錯機(jī)制,需要并行采樣大量多樣化樣本提升訓(xùn)練質(zhì)量,需要依賴強(qiáng)大的計算資源?;趩l(fā)式聯(lián)賽訓(xùn)練的AlphaStar,需要訓(xùn)練多個種群才能有效引導(dǎo)策略提升、算法收斂,基于博弈分解的Pluribus,其藍(lán)圖策略的離線訓(xùn)練需要依靠超級計算機(jī)集群。當(dāng)前的一些研究提出利用Ray[330]、可擴(kuò)展高效深度強(qiáng)化學(xué)習(xí)(scalable efficent deep reinforcement learning, SEED)[331]、Flatland[332]等分布式強(qiáng)化學(xué)習(xí)框架。
4.3.3 元博弈種群策略學(xué)習(xí)框架
元博弈種群策略學(xué)習(xí)框架的設(shè)計需要將種群策略演化機(jī)制設(shè)計與分布式計算平臺資源調(diào)度協(xié)同考慮。當(dāng)前絕大多數(shù)機(jī)器博弈人工智能的實現(xiàn)均需要依靠強(qiáng)大的分布式算力支撐。基于元博弈的種群演化自主學(xué)習(xí)方法與分布式學(xué)習(xí)框架的結(jié)合可用于構(gòu)建通用的博弈策略學(xué)習(xí)框架。當(dāng)前,基于競爭式自對弈的TLeague[333]和整體設(shè)計了策略評估的MAlib[334]等為種群策略學(xué)習(xí)提供了分布式并行學(xué)習(xí)框架支撐。
5 結(jié)束語
本文從博弈論視角,分析了多智能體學(xué)習(xí)。首先,簡要介紹了多智能體學(xué)習(xí),主要包括多智能體學(xué)習(xí)系統(tǒng)組成、概述、研究方法分類。其次,重點介紹了多智能體博弈學(xué)習(xí)框架,包括基礎(chǔ)模型和元博弈模型、博弈解概念及博弈動力學(xué),多智能體博弈學(xué)習(xí)面臨的挑戰(zhàn)。圍繞多智能體博弈策略學(xué)習(xí)方法,重點剖析了策略學(xué)習(xí)框架、離線博弈策略學(xué)習(xí)方法和在線博弈策略學(xué)習(xí)方法?;谑崂淼亩嘀悄荏w博弈學(xué)習(xí)方法,指出下一步工作可以著重從“智能體認(rèn)知行為建模、通用博弈策略學(xué)習(xí)方法、分布式策略學(xué)習(xí)框架”等三方面開展多智能體博弈學(xué)習(xí)前沿相關(guān)工作研究。
參考文獻(xiàn)
[1] 黃凱奇, 興軍亮, 張俊格, 等. 人機(jī)對抗智能技術(shù)[J]. 中國科學(xué): 信息科學(xué), 2020, 50(4): 540-550.
HUANG K Q, XING J L, ZHANG J G, et al. Intelligent technologies of human-computer gaming[J]. Scientia Sinica Informationics 2020, 50(4): 540-550.
[2] 譚鐵牛. 人工智能: 用AI技術(shù)打造智能化未來[M]. 北京: 中國科學(xué)技術(shù)出版社, 2019.
TAN T N. Artificial intelligence: building an intelligent future with AI technologies[M]. Beijing: China Science and Technology Press, 2019.
[3] WOOLDRIDGE M. An introduction to multiagent systems[M]. Florida: John Wiley amp; Sons, 2009.
[4] SHOHAM Y, LEYTON-BROWN K. Multiagent systems-algorithmic, game-theoretic, and logical foundations[M]. New York: Cambridge University Press, 2009.
[5] MULLER J P, FISCHER K. Application impact of multi-agent systems and technologies: a survey[M]. SHEHORY O, STURM A. Agent-oriented software engineering. Heidelberg: Springer, 2014: 27-53.
[6] TURING A M. Computing machinery and intelligence[M]. Berlin: Springer, 2009.
[7] OMIDSHAFIEI S, TUYLS K, CZARNECKI W M, et al. Navigating the landscape of multiplayer games[J]. Nature Communications, 2020, 11(1): 5603.
[8] TUYLS K, STONE P. Multiagent learning paradigms[C]∥Proc.of the European Conference on Multi-Agent Systems and Agreement Technologies, 2017: 3-21.
[9] SILVER D, SCHRITTWIESER J, SIMONYAN K, et al. Mastering the game of Go without human knowledge[J]. Nature, 2017, 550(7676): 354-359.
[10] SCHRITTWIESER J, ANTONOGLOU I, HUBERT T, et al. Mastering Atari, Go, Chess and Shogi by planning with a learned model[J]. Nature, 2020, 588(7839): 604-609.
[11] MORAVCIK M, SCHMID M, BURCH N, et al. DeepStack: expert-level artificial intelligence in heads-up no-limit poker[J]. Science, 2017, 356(6337): 508-513.
[12] BROWN N, SANDHOLM T. Superhuman AI for multiplayer poker[J]. Science, 2019, 365(6456): 885-890.
[13] JIANG Q Q, LI K Z, DU B Y, et al. DeltaDou: expert-level Doudizhu AI through self-play[C]∥Proc.of the 28th International Joint Conference on Artificial Intelligence, 2019: 1265-1271.
[14] ZHAO D C, XIE J R, MA W Y, et al. DouZero: mastering Doudizhu with self-play deep reinforcement learning[C]∥Proc.of the 38th International Conference on Machine Learning, 2021: 12333-12344.
[15] LI J J, KOYAMADA S, YE Q W, et al. Suphx: mastering mahjong with deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2003.13590.
[16] VINYALS O, BABUSCHKIN I, CZARNECKI W M, et al. Grandmaster level in StarCraft II using multi-agent reinforcement learning[J]. Nature, 2019, 575(7782): 350-354.
[17] WANG X J, SONG J X, QI P H, et al. SCC: an efficient deep reinforcement learning agent mastering the game of StarCraft II[C]∥Proc.of the 38th International Conference on Machine Learning, 2021, 139: 10905-10915.
[18] BERNER C, BROCKMAN G, CHAN B, et al. Dota 2 with large scale deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1912.06680.
[19] YE D H, CHEN G B, ZHAO P L, et al. Supervised learning achieves human-level performance in MOBA games: a case study of honor of kings[J]. IEEE Trans.on Neural Networks and Learning Systems, 2022, 33(3): 908-918.
[20] 中國科學(xué)院自動化研究所. 人機(jī)對抗智能技術(shù)[EB/OL]. [2021-08-01]. http:∥turingai.ia.ac.cn/.
Institute of Automation, Chinese Academy of Science. Intelligent technologies of human-computer gaming[EB/OL]. [2021-08-01]. http:∥turingai.ia.ac.cn/.
[21] 凡寧, 朱夢瑩, 張強(qiáng). 遠(yuǎn)超阿爾法狗?“戰(zhàn)顱”成戰(zhàn)場輔助決策“最強(qiáng)大腦”[EB/OL]. [2021-08-01]. http:∥digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/html/2021-04/19/content_466128.htm?div=-1.
FAN N, ZHU M Y, ZHANG Q. Way ahead of Alpha Go? “War brain” becomes the “strongest brain” for battlefield decision-making[EB/OL]. [2021-08-01]. http:∥digitalpaper.stdaily.com/http_www.kjrb.com/kjrb/html/2021-04/19/content_466128.htm?div=-1.
[22] ERNEST N. Genetic fuzzy trees for intelligent control of unmanned combat aerial vehicles[D]. Cincinnati: University of Cincinnati, 2015.
[23] CLIFF D. Collaborative air combat autonomy program makes strides[J]. Microwave Journal, 2021, 64(5): 43-44.
[24] STONE P, VELOSO M. Multiagent systems: a survey from a machine learning perspective[J]. Autonomous Robots, 2000, 8(3): 345-383.
[25] GORDON G J. Agendas for multi-agent learning[J]. Artificial Intelligence, 2007, 171(7): 392-401.
[26] SHOHAM Y, POWERS R, GRENAGER T. Multi-agent reinforcement learning: a critical survey[R]. San Francisco: Stanford University, 2003.
[27] SHOHAM Y, POWERS R, GRENAGER T. If multi-agent learning is the answer, what is the question?[J]. Artificial Intelligence, 2006, 171(7): 365-377.
[28] STONE P. Multiagent learning is not the answer. It is the question[J]. Artificial Intelligence, 2007, 171(7): 402-405.
[29] TOSIC P, VILALTA R. A unified framework for reinforcement learning, co-learning and meta-learning how to coordinate in collaborative multi-agent systems[J]. Procedia Computer Science, 2010, 1(1): 2217-2226.
[30] TUYLS K, WEISS G. Multiagent learning: basics, challenges, and prospects[J]. AI Magazine, 2012, 33(3): 41-52.
[31] KENNEDY J. Swarm intelligence[M]. Handbook of nature-inspired and innovative computing. Bostonm: Springer, 2006: 187-219.
[32] TUYLS K, PARSONS S. What evolutionary game theory tells us about multiagent learning[J]. Artificial Intelligence, 2007, 171(7): 406-416.
[33] SILVA F, COSTA A. Transfer learning for multiagent reinforcement learning systems[C]∥Proc.of the 25th International Joint Conference on Artificial Intelligence, 2016: 3982-3983.
[34] HERNANDEZ-LEAL P, KAISERS M, BAARSLAG T, et al. A survey of learning in multiagent environments: dealing with non-stationarity[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1707.09183v1.
[35] ALBRECHT S V, STONE P. Autonomous agents modelling other agents: a comprehensive survey and open problems[J]. Artificial Intelligence, 2018, 258: 66-95.
[36] JANT H P, TUYLS K, PANAIT L, et al. An overview of cooperative and competitive multiagent learning[C]∥Proc.of the International Workshop on Learning and Adaption in Multi-Agent Systems, 2005.
[37] PANAIT L, LUKE S. Cooperative multi-agent learning: the state of the art[J]. Autonomous Agents and Multi-Agent Systems, 2005, 11(3): 387-434.
[38] BUSONIU L, BABUSKA R, SCHUTTER B D. A comprehensive survey of multiagent reinforcement learning[J]. IEEE Trans.on Systems, Man amp; Cybernetics: Part C, 2008, 38(2): 156-172.
[39] HERNANDEZ-LEAL P, KARTAL B, TAYLOR M E. A survey and critique of multiagent deep reinforcement learning[J]. Autonomous Agents and Multi-Agent Systems, 2019, 33(6): 750-797.
[40] OROOJLOOY A, HAJINEZHAD D. A review of cooperative multi-agent deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1908.03963.
[41] ZHANG K Q, YANG Z R, BAAR T. Multi-agent reinforcement learning: a selective overview of theories and algorithms[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1911.10635.
[42] GRONAUER S, DIEPOLD K. Multi-agent deep reinforcement learning: a survey[J]. Artificial Intelligence Review, 2022, 55(2): 895-943.
[43] DU W, DING S F. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications[J]. Artificial Intelligence Review, 2021, 54(5): 3215-3238.
[44] 吳軍, 徐昕, 王健, 等. 面向多機(jī)器人系統(tǒng)的增強(qiáng)學(xué)習(xí)研究進(jìn)展綜述[J]. 控制與決策, 2011, 26(11): 1601-1610.
WU J, XU X, WANG J, et al. Recent advances of reinforcement learning in multi-robot systems: a survey[J]. Control and Decision, 2011, 26(11): 1601-1610.
[45] 杜威, 丁世飛. 多智能體強(qiáng)化學(xué)習(xí)綜述[J]. 計算機(jī)科學(xué), 2019, 46(8): 1-8.
DU W, DING S F, Overview on multi-agent reinforcement learning[J]. Computer Science, 2019, 46(8): 1-8.
[46] 殷昌盛, 楊若鵬, 朱巍, 等. 多智能體分層強(qiáng)化學(xué)習(xí)綜述[J]. 智能系統(tǒng)學(xué)報, 2020, 15(4): 646-655.
YIN C S, YANG R P, ZHU W, et al. A survey on multi-agent hierarchical reinforcement learning[J]. CAAI Transactions on Intelligent Systems, 2020, 15(4): 646-655.
[47] 梁星星, 馮旸赫, 馬揚(yáng), 等. 多Agent深度強(qiáng)化學(xué)習(xí)綜述[J]. 自動化學(xué)報, 2020, 46(12): 2537-2557.
LIANG X X, FENG Y H, MA Y, et al. Deep multi-agent reinforcement learning: a survey[J]. Acta Automatica Sinica, 2020, 46(12): 2537-2557.
[48] 孫長銀, 穆朝絮. 多智能體深度強(qiáng)化學(xué)習(xí)的若干關(guān)鍵科學(xué)問題[J]. 自動化學(xué)報, 2020, 46(7): 1301-1312.
SUN C Y, MU C X. Important scientific problems of multi-agent deep reinforcement learning[J]. Acta Automatica Sinica, 2020, 46(7): 1301-1312.
[49] MATIGNON L, LAURENT G J, LE F P. Independent reinforcement learners in cooperative Markov games: a survey regarding coordination problems[J]. The Knowledge Engineering Review, 2012, 27(1): 1-31.
[50] NOWE A, VRANCX P, HAUWERE Y M D. Game theory and multi-agent reinforcement learning[M]. Berlin: Springer,2012.
[51] LU Y L, YAN K. Algorithms in multi-agent systems: a holistic perspective from reinforcement learning and game theory[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2001.06487.
[52] YANG Y D, WANG J. An overview of multi-agent reinforcement learning from game theoretical perspective[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2011.00583v3.
[53] BLOEMBERGEN D, TUYLS K, HENNES D, et al, Evolutionary dynamics of multi-agent learning: a survey[J]. Artificial Intelligence, 2015, 53(1): 659-697.
[54] WONG A, BACK T, ANNA V, et al. Multiagent deep reinforcement learning: challenges and directions towards human-like approaches[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.15691.
[55] OLIEHOEK F A, AMATO C. A concise introduction to decentralized POMDPs[M]. Berlin: Springer, 2016.
[56] DOSHI P, ZENG Y F, CHEN Q Y. Graphical models for interactive POMDPs: representations and solutions[J]. Autonomous Agents and Multi-Agent Systems, 2009, 18(3): 376-386.
[57] SHAPLEY L S. Stochastic games[J]. National Academy of Sciences of the United States of America, 1953, 39(10): 1095-1100.
[58] LITTMAN M L. Markov games as a framework for multi-agent reinforcement learning[C]∥Proc.of the 11th International Conference on International Conference on Machine Learning, 1994: 157-163.
[59] KOVAIK V, SCHMID M, BURCH N, et al. Rethinking formal models of partially observable multiagent decision making[J]. Artificial Intelligence, 2022, 303: 103645.
[60] LOCKHART E, LANCTOT M, PEROLAT J, et al. Computing approximate equilibria in sequential adversarial games by exploitability descent[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1903.05614.
[61] CUI Q, YANG L F. Minimax sample complexity for turn-based stochastic game[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2011.14267.
[62] HERNANDEZ D, GBADAMOSI C, GOODMAN J, et al. Metagame autobalancing for competitive multiplayer games[C]∥Proc.of the IEEE Conference on Games, 2020: 275-282.
[63] WELLMAN M P. Methods for empirical game-theoretic analysis[C]∥Proc.of the 21st National Conference on Artificial Intelligence, 2006: 1552-1555.
[64] JIANG X, LIM L H, YAO Y, et al. Statistical ranking and combinatorial Hodge theory[J]. Mathematical Programming, 2011, 127(1): 203-244.
[65] CANDOGAN O, MENACHE I, OZDAGLAR A, et al. Flows and decompositions of games: harmonic and potential games[J]. Mathematics of Operations Research, 2011, 36(3): 474-503.
[66] HWANG S H, REY-BELLET L. Strategic decompositions of normal form games: zero-sum games and potential games[J]. Games and Economic Behavior, 2020, 122: 370-390.
[67] BALDUZZI D, GARNELO M, BACHRACH Y, et al. Open-ended learning in symmetric zero-sum games[C]∥Proc.of the International Conference on Machine Learning, 2019: 434-443.
[68] CZARNECKI W M, GIDEL G, TRACEY B, et al. Real world games look like spinning tops[C]∥Proc.of the 34th International Conference on Neural Information Processing Systems, 2020: 17443-17454.
[69] SANJAYA R, WANG J, YANG Y D. Measuring the non-transitivity in chess [EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2110.11737.
[70] TUYLS K, PEROLAT J, LANCTOT M, et al. Bounds and dynamics for empirical game theoretic analysis[J]. Autonomous Agents and Multi-Agent Systems, 2020, 34(1): 7.
[71] VIQUEIRA E A, GREENWALD A, COUSINS C, et al. Learning simulation-based games from data[C]∥Proc.of the 18th International Conference on Autonomous Agents and Multi Agent Systems, 2019: 1778-1780.
[72] ROUGHGARDEN T. Twenty lectures on algorithmic game theory[M]. New York: Cambridge University Press, 2016.
[73] BLUM A, HAGHTALAB N, HAJIAGHAYI M T, et al. Computing Stackelberg equilibria of large general-sum games[C]∥Proc.of the International Symposium on Algorithmic Game Theory, 2019: 168-182.
[74] MILEC D, CERNY J, LISY V, et al. Complexity and algorithms for exploiting quantal opponents in large two-player games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5575-5583.
[75] BALDUZZI D, TUYLS K, PEROLAT J, et al. Re-evaluating evaluation[C]∥Proc.of the 32nd International Conference on Neural Information Processing Systems, 2018: 3272-3283.
[76] LI S H, WU Y, CUI X Y, et al. Robust multi-agent reinforcement learning via minimax deep deterministic policy gradient[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 4213-4220.
[77] YABU Y, YOKOO M, IWASAKI A. Multiagent planning with trembling-hand perfect equilibrium in multiagent POMDPs[C]∥Proc.of the Pacific Rim International Conference on Multi-Agents, 2017: 13-24.
[78] GHOROGHI A. Multi-games and Bayesian Nash equilibriums[D]. London: University of London, 2015.
[79] XU X, ZHAO Q. Distributed no-regret learning in multi-agent systems: challenges and recent developments[J]. IEEE Signal Processing Magazine, 2020, 37(3):84-91.
[80] SUN Y, WEI X, YAO Z H, et al. Analysis of network attack and defense strategies based on Pareto optimum[J]. Electro-nics, 2018, 7(3): 36.
[81] DENG X T, LI N Y, MGUNI D, et al. On the complexity of computing Markov perfect equilibrium in general-sum stochastic games[EB/OL]. [2021-11-01]. http:∥arxiv.org/abs/2109.01795.
[82] BASILICO N, CELLI A, GATTI N, et al. Computing the team-maxmin equilibrium in single-team single-adversary team games[J]. Intelligenza Artificiale, 2017, 11(1): 67-79.
[83] CELLI A, GATTI N. Computational results for extensive-form adversarial team games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1711.06930.
[84] ZHANG Y Z, AN B. Computing team-maxmin equilibria in zero-sum multiplayer extensive-form games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2020: 2318-2325.
[85] LI S X, ZHANG Y Z, WANG X R, et al. CFR-MIX: solving imperfect information extensive-form games with combinatorial action space[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2105.08440.
[86] PROBO G. Multi-team games in adversarial settings: ex-ante coordination and independent team members algorithms[D]. Milano: Politecnico Di Milano, 2019.
[87] ORTIZ L E, SCHAPIRE R E, KAKADE S M. Maximum entropy correlated equilibria[C]∥Proc.of the 11th International Conference on Artificial Intelligence and Statistics, 2007: 347-354.
[88] GEMP I, SAVANI R, LANCTOT M, et al. Sample-based approximation of Nash in large many-player games via gradient descent[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.01285.
[89] FARINA G, BIANCHI T, SANDHOLM T. Coarse correlation in extensive-form games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2020: 1934-1941.
[90] FARINA G, CELLI A, MARCHESI A, et al. Simple uncoupled no-regret learning dynamics for extensive-form correlated equilibrium[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2104.01520.
[91] XIE Q M, CHEN Y D, WANG Z R, et al. Learning zero-sum simultaneous-move Markov games using function approximation and correlated equilibrium[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2002.07066.
[92] HUANG S J, YI P. Distributed best response dynamics for Nash equilibrium seeking in potential games[J]. Control Theory and Technology, 2020, 18(3): 324-332.
[93] BOSANSKY B, KIEKINTVELD C, LISY V, et al. An exact double-oracle algorithm for zero-sum extensive-form games with imperfect information[J]. Journal of Artificial Intelligence Research, 2014, 51(1): 829-866.
[94] HEINRICH T, JANG Y J, MUNGO C. Best-response dyna-mics, playing sequences, and convergence to equilibrium in random games[J]. International Journal of Game Theory, 2023, 52: 703-735.
[95] FARINA G, CELLI A, MARCHESI A, et al. Simple uncoupled no-regret learning dynamics for extensive-form correlated equilibrium[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2104.01520.
[96] HU S Y, LEUNG C W, LEUNG H F, et al. The evolutionary dynamics of independent learning agents in population games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.16068.
[97] LEONARDOS S, PILIOURAS G. Exploration-exploitation in multi-agent learning: catastrophe theory meets game theory[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 11263-11271.
[98] POWERS R, SHOHAM Y. New criteria and a new algorithm for learning in multi-agent systems[C]∥Proc.of the 17th International Conference on Neural Information Processing Systems, 2004: 1089-1096.
[99] DIGIOVANNI A, ZELL E C. Survey of self-play in reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2107.02850.
[100] BOWLING M. Multiagent learning in the presence of agents with limitations[D]. Pittsburgh: Carnegie Mellon University, 2003.
[101] BOWLING M H, VELOSO M M. Multi-agent learning using a variable learning rate[J]. Artificial Intelligence, 2002, 136(2): 215-250.
[102] BOWLING M. Convergence and no-regret in multiagent learning[C]∥Proc.of the 17th International Conference on Neural Information Processing Systems, 2004: 209-216.
[103] KAPETANAKIS S, KUDENKO D. Reinforcement learning of coordination in heterogeneous cooperative multi-agent systems[C]∥Proc.of the 3rd International Joint Conference on Autonomous Agents and Multiagent Systems, 2004: 1258-1259.
[104] DAI Z X, CHEN Y Z, LOW K H, et al. R2-B2: recursive reasoning-based Bayesian optimization for no-regret learning in games[C]∥Proc.of the International Conference on Machine Learning, 2020: 2291-2301.
[105] FREEMAN R, PENNOCK D M, PODIMATA C, et al. No-regret and incentive-compatible online learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2002.08837.
[106] LITTMAN M L. Value-function reinforcement learning in Markov games[J]. Journal of Cognitive Systems Research, 2001, 2(1): 55-66.
[107] FOERSTER J N, CHEN R Y, AL-SHEDIVAT M, et al. Learning with opponent-learning awareness[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1709.04326.
[108] RDULESCU R, VERSTRAETEN T, ZHANG Y, et al. Opponent learning awareness and modelling in multi-objective normal form games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2011.07290.
[109] RONEN I B, MOSHE T. R-MAX: a general polynomial time algorithm for near-optimal reinforcement learning[J]. Journal of Machine Learning Research, 2002, 3(10): 213-231.
[110] HIMABINDU L, ECE K, RICH C, et al. Identifying unknown unknowns in the open world: representations and policies for guided exploration[C]∥Proc.of the 31st AAAI Conference on Artificial Intelligence, 2017: 2124-2132.
[111] PABLO H, MICHAEL K. Learning against sequential opponents in repeated stochastic games[C]∥Proc.of the 3rd Multi-Disciplinary Conference on Reinforcement Learning and Decision Making, 2017.
[112] PABLO H, YUSEN Z, MATTHEW E, et al. Efficiently detecting switches against non-stationary opponents[J]. Auto-nomous Agents and Multi-Agent Systems, 2017, 31(4): 767-789.
[113] FRIEDRICH V D O, MICHAEL K, TIM M. The minds of many: opponent modelling in a stochastic game[C]∥Proc.of the 26th International Joint Conference on Artificial Intelligence, 2017: 3845-3851.
[114] BAKKES S, SPRONCK P, HERIK H. Opponent modelling for case-based adaptive game AI[J]. Entertainment Computing, 2010, 1(1): 27-37.
[115] PAPOUDAKIS G, CHRISTIANOS F, RAHMAN A, et al. Dealing with non-stationarity in multi-agent deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1906.04737.
[116] DASKALAKIS C, GOLDBERG P W, PAPADIMITRIOU C H. The complexity of computing a Nash equilibrium[J]. SIAM Journal on Computing, 2009, 39(1):195-259.
[117] CONITZER V, SANDHOLM T. Complexity results about Nash equilibria[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/0205074.
[118] CONITZER V, SANDHOLM T. New complexity results about Nash equilibria[J]. Games and Economic Behavior, 2008, 63(2): 621-641.
[119] ZHANG Y Z. Computing team-maxmin equilibria in zero-sum multiplayer games[D]. Singapore: Nanyang Technological University, 2020.
[120] LAUER M, RIEDMILLER M. An algorithm for distributed reinforcement learning in cooperative multi-agent systems[C]∥Proc.of the 17th International Conference on Machine Learning, 2000: 535-542.
[121] CLAUS C, BOUTILIER C. The dynamics of reinforcement learning in cooperative multiagent system[C]∥Proc.of the 15th National/10th Conference on Artificial Intelligence/Innovative Applications of Artificial Intelligence, 1998: 746-752.
[122] WANG X F, SANDHOLM T. Reinforcement learning to play an optimal Nash equilibrium in team Markov games[C]∥Proc.of the 15th International Conference on Neural Information Processing Systems, 2002: 1603-1610.
[123] ARSLAN G, YUKSEL S. Decentralized q-learning for stochastic teams and games[J]. IEEE Trans.on Automatic Control, 2016, 62(4): 1545-1558.
[124] HU J L, WELLMAN M P. Nash Q-learning for general-sum stochastic games[J]. Journal of Machine Learning Research, 2003, 4(11): 1039-1069.
[125] GREENWALD A, HALL L, SERRANO R. Correlated-q learning[C]∥Proc.of the 20th International Conference on Machine Learning, 2003: 242-249.
[126] KONONEN V. Asymmetric multi-agent reinforcement learning[J]. Web Intelligence and Agent Systems, 2004, 2(2): 105-121.
[127] LITTMAN M L. Friend-or-foe q-learning in general-sum games[C]∥Proc.of the 18th International Conference on Machine Learning, 2001: 322-328.
[128] SINGH S, KEARNS M, MANSOUR Y. Nash convergence of gradient dynamics in iterated general-sum games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1301.3892.
[129] ZINKEVICH M. Online convex programming and generalized infinitesimal gradient ascent[C]∥Proc.of the 20th International Conference on Machine Learning, 2003: 928-935.
[130] CONITZER V, SANDHOLM T. AWESOME: a general multiagent learning algorithm that converges in self-play and learns a best response against stationary opponents[J]. Machine Learning, 2007, 67: 23-43.
[131] TAN M. Multi-agent reinforcement learning: independent vs. cooperative agents[C]∥Proc.of the 10th International Conference on Machine Learning, 1993: 330-337.
[132] LAETITIA M, GUILLAUME L, NADINE L F. Hysteretic Q learning: an algorithm for decentralized reinforcement learning in cooperative multi-agent teams[C]∥Proc.of the IEEE/RSJ International Conference on Intelligent Robots and Systems, 2007: 64-69.
[133] MATIGNON L, LAURENT G, LE F P N. A study of FMQ heuristic in cooperative multi-agent games[C]∥Proc.of the 7th International Conference on Autonomous Agents and Multiagent Systems, 2008: 77-91.
[134] WEI E, LUKE S. Lenient learning in independent-learner stochastic cooperative games[J]. Journal Machine Learning Research, 2016, 17(1): 2914-2955.
[135] PALMER G. Independent learning approaches: overcoming multi-agent learning pathologies in team-games[D]. Liverpool: University of Liverpool, 2020.
[136] SUKHBAATAR S, FERGUS R. Learning multiagent communication with backpropagation[C]∥Proc.of the 30th International Conference on Neural Information Processing Systems, 2016: 2244-2252.
[137] PENG P, WEN Y, YANG Y D, et al. Multiagent bidirectionally-coordinated nets: emergence of human-level coordination in learning to play StarCraft combat games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1703.10069.
[138] JAKOB N F, GREGORY F, TRIANTAFYLLOS A, et al, Counterfactual multi-agent policy gradients[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2018: 2974-2982.
[139] LOWE R, WU Y, TAMAR A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C]∥Proc.of the 31st International Conference on Neural Information Processing Systems, 2017: 6382-6393.
[140] WEI E, WICKE D, FREELAN D, et al, Multiagent soft q-learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1804.09817.
[141] SUNEHAG P, LEVER G, GRUSLYS A, et al. Value-decomposition networks for cooperative multi-agent learning based on team reward[C]∥Proc.of the 17th International Conference on Autonomous Agents and Multi-Agent Systems, 2018: 2085-2087.
[142] RASHID T, SAMVELYAN M, WITT C S, et al. Qmix: monotonic value function factorisation for deep multi-agent reinforcement learning[C]∥Proc.of the International Conference on Machine Learning, 2018: 4292-4301.
[143] MAHAJAN A, RASHID T, SAMVELYAN M, et al. MAVEN: multi-agent variational exploration[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1910.07483.
[144] SON K, KIM D, KANG W J, et al. Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning[C]∥Proc.of the International Conference on Machine Learning, 2019: 5887-5896.
[145] YANG Y D, WEN Y, CHEN L H, et al. Multi-agent determinantal q-learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.01482.
[146] YU C, VELU A, VINITSKY E, et al. The surprising effectiveness of MAPPO in cooperative, multi-agent games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2103.01955.
[147] WANG J H, ZHANG Y, KIM T K, et al. Shapley q-value: a local reward approach to solve global reward games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2020: 7285-7292.
[148] RIEDMILLER M. Neural fitted Q iteration-first experiences with a data efficient neural reinforcement learning method[C]∥Proc.of the European Conference on Machine Learning, 2005: 317-328.
[149] NEDIC A, OLSHEVSKY A, SHI W. Achieving geometric convergence for distributed optimization over time-varying graphs[J]. SIAM Journal on Optimization, 2017, 27(4): 2597-2633.
[150] ZHANG K Q, YANG Z R, LIU H, et al. Fully decentralized multi-agent reinforcement learning with networked agents[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1802.08757.
[151] QU G N, LIN Y H, WIERMAN A, et al. Scalable multi-agent reinforcement learning for networked systems with ave-rage reward[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.06626.
[152] CHU T, CHINCHALI S, KATTI S. Multi-agent reinforcement learning for networked system control[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2004.01339.
[153] LESAGE-LANDRY A, CALLAWAY D S. Approximate multi-agent fitted q iteration[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2104.09343.
[154] ZHANG K Q, YANG Z R, LIU H, et al. Finite-sample analysis for decentralized batch multi-agent reinforcement learning with networked agents[J]. IEEE Trans.on Automatic Control, 2021, 66(12): 5925-5940.
[155] SANDHOLM T, GILPIN A, CONITZER V. Mixed-integer programming methods for finding Nash equilibria[C]∥Proc.of the 20th National Conference on Artificial Intelligence, 2005: 495-501.
[156] YU N. Excessive gap technique in nonsmooth convex minimization[J]. SIAM Journal on Optimization, 2005, 16(1): 235-249.
[157] SUN Z F, NAKHAI M R. An online mirror-prox optimization approach to proactive resource allocation in MEC[C]∥Proc.of the IEEE International Conference on Communications, 2020.
[158] AMIR B, MARC T. Mirror descent and nonlinear projected subgradient methods for convex optimization[J]. Operations Research Letters, 2003, 31(3): 167-175.
[159] LOCKHART E, LANCTOT M, PEROLAT J, et al. Computing approximate equilibria in sequential adversarial games by exploitability descent[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1903.05614.
[160] LAN S. Geometrical regret matching: a new dynamics to Nash equilibrium[J]. AIP Advances, 2020, 10(6): 065033.
[161] VON S B, FORGES F. Extensive-form correlated equilibrium: definition and computational complexity[J]. Mathematics of Operations Research, 2008, 33(4): 1002-1022.
[162] CESA-BIANCHI N, LUGOSI G. Prediction, learning, and games[M]. Cambridge: Cambridge University Press, 2006.
[163] FREUND Y, SCHAPIRE R E. Adaptive game playing using multiplicative weights[J]. Games and Economic Behavior, 1999, 29(1/2): 79-103.
[164] HART S, MAS-COLELL A. A general class of adaptive strategies[J]. Journal of Economic Theory, 2001, 98(1): 26-54.
[165] LEMKE C E, HOWSON J T. Equilibrium points of bimatrix games[J]. Journal of the Society for Industrial and Applied Mathematics, 1964, 12 (2): 413-423.
[166] PORTER R, NUDELMAN E, SHOHAM Y. Simple search methods for finding a Nash equilibrium[J]. Games and Economic Behavior, 2008, 63(2): 642-662.
[167] CEPPI S, GATTI N, PATRINI G, et al. Local search techniques for computing equilibria in two-player general-sum strategic form games[C]∥Proc.of the 9th International Conference on Autonomous Agents and Multiagent Systems, 2010: 1469-1470.
[168] CELLI A, CONIGLIO S, GATTI N. Computing optimal ex ante correlated equilibria in two-player sequential games[C]∥Proc.of the 18th International Conference on Autonomous Agents and Multiagent Systems, 2019: 909-917.
[169] VON S B, FORGES F. Extensive-form correlated equilibrium: definition and computational complexity[J]. Mathematics of Operations Research, 2008, 33(4): 1002-1022.
[170] FARINA G, LING C K, FANG F, et al. Efficient regret minimization algorithm for extensive-form correlated equilibrium[C]∥Proc.of the 33rd International Conference on Neural Information Processing Systems, 2019: 5186-5196.
[171] PAPADIMITRIOU C H, ROUGHGARDEN T. Computing correlated equilibria in multi-player games[J]. Journal of the ACM, 2008, 55(3): 14.
[172] CELLI A, MARCHESI A, BIANCHI T, et al. Learning to correlate in multi-player general-sum sequential games[C]∥Proc.of the 33rd International Conference on Neural Information Processing Systems, 2019: 13076-13086.
[173] JIANG A X, KEVIN L B. Polynomial-time computation of exact correlated equilibrium in compact games[J]. Games and Economic Behavior, 2015, 100(91): 119-126.
[174] FOSTER D P, YOUNG H P. Regret testing: learning to play Nash equilibrium without knowing you have an opponent[J]. Theoretical Economics, 2006, 1(3): 341-367.
[175] ABERNETHY J, BARTLETT P L, HAZAN E. Blackwell approachability and no-regret learning are equivalent[C]∥Proc.of the 24th Annual Conference on Learning Theory, 2011: 27-46.
[176] FARINA G, KROER C, SANDHOLM T. Faster game solving via predictive Blackwell approachability: connecting regret matching and mirror descent[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5363-5371.
[177] SRINIVASAN S, LANCTOT M, ZAMBALDI V, et al. Actor-critic policy optimization in partially observable multiagent environments[C]∥Proc.of the 32nd International Conference on Neural Information Processing Systems, 2018: 3426-3439.
[178] ZINKEVICH M, JOHANSON M, BOWLING M, et al, Regret minimization in games with incomplete information[C]∥Proc.of the 20th International Conference on Neural Information Processing Systems, 2007: 1729-1736.
[179] BOWLING M, BURCH N, JOHANSON M, et al. Heads-up limit hold’em poker is solved[J]. Science, 2015, 347(6218): 145-149.
[180] BROWN N, LERER A, GROSS S, et al. Deep counterfactual regret minimization[C]∥Proc.of the International Conference on Machine Learning, 2019: 793-802.
[181] BROWN N, SANDHOLM T. Solving imperfect-information games via discounted regret minimization[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 1829-1836.
[182] LI H L, WANG X, QI S H, et al. Solving imperfect-information games via exponential counterfactual regret minimization[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2008.02679.
[183] LANCTOT M, WAUGH K, ZINKEVICH M, et al. Monte Carlo sampling for regret minimization in extensive games[C]∥Proc.of the 22nd International Conference on Neural Information Processing Systems, 2009: 1078-1086.
[184] LI H, HU K L, ZHANG S H, et al. Double neural counterfactual regret minimization[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1812.10607.
[185] JACKSON E G. Targeted CFR[C]∥Proc.of the 31st AAAI Conference on Artificial Intelligence, 2017.
[186] SCHMID M, BURCH N, LANCTOT M, et al. Variance reduction in Monte Carlo counterfactual regret minimization (VR-MCCFR) for extensive form games using baselines[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 2157-2164.
[187] ZHOU Y C, REN T Z, LI J L, et al. Lazy-CFR: a fast regret minimization algorithm for extensive games with imperfect information[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1810.04433.
[188] WAUGH K, MORRILL D, BAGNELL J A, et al. Solving games with functional regret estimation[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2015: 2138-2144.
[189] D’ORAZIO R, MORRILL D, WRIGHT J R, et al. Alternative function approximation parameterizations for solving games: an analysis of f-regression counterfactual regret minimization[C]∥Proc.of the 19th International Conference on Autonomous Agents and Multiagent Systems, 2020: 339-347.
[190] PILIOURAS G, ROWLAND M, OMIDSHAFIEI S, et al. Evolutionary dynamics and Φ-regret minimization in games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.14668v1.
[191] STEINBERGER E. Single deep counterfactual regret minimization[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1901.07621.
[192] LI H L, WANG X, GUO Z Y, et al. RLCFR: minimize counterfactual regret with neural networks[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2105.12328.
[193] LI H L, WANG X, JIA F W, et al. RLCFR: minimize counterfactual regret by deep reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2009.06373.
[194] LIU W M, LI B, TOGELIUS J. Model-free neural counterfactual regret minimization with bootstrap learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2012.01870.
[195] SCHMID M, MORAVCIK M, BURCH N, et al. Player of games[EB/OL]. [2021-12-30]. http:∥arxiv.org/abs/2112.03178.
[196] CHRISTIAN K, KEVIN W, FATMA K K, et al. Faster first-order methods for extensive-form game solving[C]∥Proc.of the 16th ACM Conference on Economics and Computation, 2015: 817-834.
[197] LESLIE D S, COLLINS E J. Generalised weakened fictitious play[J]. Games and Economic Behavior, 2006, 56(2): 285-298.
[198] KROER C, WAUGH K, KLN-KARZAN F, et al. Faster algorithms for extensive-form game solving via improved smoo-thing functions[J]. Mathematical Programming, 2020, 179(1): 385-417.
[199] FARINA G, KROER C, SANDHOLM T. Optimistic regret minimization for extensive-form games via dilated distance-generating functions[C]∥Proc.of the 33rd International Conference on Neural Information Processing Systems, 2019: 5221-5231.
[200] LIU W M, JIANG H C, LI B, et al. Equivalence analysis between counterfactual regret minimization and online mirror descent[EB/OL]. [2021-12-11]. http:∥arxiv.org/abs/2110.04961.
[201] PEROLAT J, MUNOS R, LESPIAU J B, et al. From Poincaré recurrence to convergence in imperfect information games: finding equilibrium via regularization[C]∥Proc.of the International Conference on Machine Learning, 2021: 8525-8535.
[202] MUNOS R, PEROLAT J, LESPIAU J B, et al. Fast computation of Nash equilibria in imperfect information games[C]∥Proc.of the International Conference on Machine Learning, 2020: 7119-7129.
[203] KAWAMURA K, MIZUKAMI N, TSURUOKA Y. Neural fictitious self-play in imperfect information games with many players[C]∥Proc.of the Workshop on Computer Games, 2017: 61-74.
[204] ZHANG L, CHEN Y X, WANG W, et al. A Monte Carlo neural fictitious self-play approach to approximate Nash equilibrium in imperfect-information dynamic games[J]. Frontiers of Computer Science, 2021, 15(5): 155334.
[205] STEINBERGER E, LERER A, BROWN N. DREAM: deep regret minimization with advantage baselines and model-free learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.10410.
[206] BROWN N, BAKHTIN A, LERER A, et al. Combining deep reinforcement learning and search for imperfect-information games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2007.13544.
[207] GRUSLYS A, LANCTOT M, MUNOS R, et al. The advantage regret-matching actor-critic[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2008.12234.
[208] CHEN Y X, ZHANG L, LI S J, et al. Optimize neural fictitious self-play in regret minimization thinking[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2104.10845.
[209] SONZOGNI S. Depth-limited approaches in adversarial team games[D]. Milano: Politecnico Di Milano, 2019.
[210] ZHANG Y Z, AN B. Converging to team maxmin equilibria in zero-sum multiplayer games[C]∥Proc.of the International Conference on Machine Learning, 2020: 11033-11043.
[211] ZHANG Y Z, AN B, LONG T T, et al. Optimal escape interdiction on transportation networks[C]∥Proc.of the 26th International Joint Conference on Artificial Intelligence, 2017: 3936-3944.
[212] ZHANG Y Z, AN B. Computing ex ante coordinated team-maxmin equilibria in zero-sum multiplayer extensive-form games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5813-5821.
[213] ZHANG Y Z, GUO Q Y, AN B, et al. Optimal interdiction of urban criminals with the aid of real-time information[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 1262-1269.
[214] BOTVINICK M, RITTER S, WANG J X, et al. Reinforcement learning, fast and slow[J]. Trends in Cognitive Sciences, 2019, 23(5): 408-422.
[215] LANCTOT M, ZAMBALDI V, GRUSLYS A, et al. A unified game-theoretic approach to multiagent reinforcement learning[C]∥Proc.of the 31st International Conference on Neural Information Processing Systems, 2017: 4193-4206.
[216] MULLER P, OMIDSHAFIEI S, ROWLAND M, et al. A generalized training approach for multiagent learning[C]∥Proc.of the 8th International Conference on Learning Representations, 2020.
[217] SUN P, XIONG J C, HAN L, et al. TLeague: a framework for competitive self-play based distributed multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2011.12895.
[218] ZHOU M, WAN Z Y, WANG H J, et al. MALib: a parallel framework for population-based multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.07551.
[219] LISY V, BOWLING M. Eqilibrium approximation quality of current no-limit poker bots[C]∥Proc.of the 31st AAAI Conference on Artificial Intelligence, 2017.
[220] CLOUD A, LABER E. Variance decompositions for extensive-form games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2009.04834.
[221] SUSTR M, SCHMID M, MORAVCK M. Sound algorithms in imperfect information games[C]∥Proc.of the 20th International Conference on Autonomous Agents and Multiagent Systems, 2021: 1674-1676.
[222] BREANNA M, COMPARING E, GLICKO I. Bayesian IRT statistical models for educational and gaming data[D]. Fayetteville: University of Arkansas, 2019.
[223] PANKIEWICZ M, BATOR M. Elo rating algorithm for the purpose of measuring task difficulty in online learning environments[J]. E-Mentor, 2019, 82(5): 43-51.
[224] GLICKMAN M E. The glicko system[M]. Boston: Boston University, 1995.
[225] HERBRICH R, MINKA T, GRAEPEL T. TrueskillTM: a Bayesian skill rating system[C]∥Proc.of the 19th International Conference on Neural Information Processing Systems, 2006: 569-576.
[226] OMIDSHAFIEI S, PAPADIMITRIOU C, PILIOURAS G, et al. α-Rank: multi-agent evaluation by evolution[J]. Scientific Reports, 2019, 9(1): 9937.
[227] YANG Y D, TUTUNOV R, SAKULWONGTANA P, et al. αα-Rank: practically scaling α-rank through stochastic optimisation[C]∥Proc.of the 19th International Conference on Autonomous Agents and Multiagent Systems, 2020: 1575-1583.
[228] ROWLAND M, OMIDSHAFIEI S, TUYLS K, et al. Multiagent evaluation under incomplete information[C]∥Proc.of the 33rd International Conference on Neural Information Processing Systems, 2019: 12291-12303.
[229] RASHID T, ZHANG C, CIOSEK K, et al. Estimating α-rank by maximizing information gain[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5673-5681.
[230] DU Y L, YAN X, CHEN X, et al. Estimating α-rank from a few entries with low rank matrix completion[C]∥Proc.of the International Conference on Machine Learning, 2021: 2870-2879.
[231] ROOHI S, GUCKELSBERGER C, RELAS A, et al. Predicting game engagement and difficulty using AI players[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2107.12061.
[232] OBRIEN J D, GLEESON J P. A complex networks approach to ranking professional Snooker players[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2010.08395.
[233] JORDAN S M, CHANDAK Y, COHEN D, et al. Evaluating the performance of reinforcement learning algorithms[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.16958.
[234] DEHPANAH A, GHORI M F, GEMMELL J, et al. The evaluation of rating systems in online free-for-all games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2006.16958.
[235] LEIBO J Z, DUEEZ-GUZMAN E, VEZHNEVETS A S, et al. Scalable evaluation of multi-agent reinforcement learning with melting pot[C]∥Proc.of the International Conference on Machine Learning, 2021: 6187-6199.
[236] EBTEKAR A, LIU P. Elo-MMR: a rating system for massive multiplayer competitions[C]∥Proc.of the Web Conference, 2021: 1772-1784.
[237] DEHPANAH A, GHORI M F, GEMMELL J, et al. Evaluating team skill aggregation in online competitive games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.11397.
[238] HERNANDEZ D, DENAMGANAI K, DEVLIN S, et al. A comparison of self-play algorithms under a generalized framework[J]. IEEE Trans.on Games, 2021, 14(2): 221-231.
[239] LEIGH R, SCHONFELD J, LOUIS S J. Using coevolution to understand and validate game balance in continuous games[C]∥Proc.of the 10th Annual Conference on Genetic and Evolutionary Computation, 2008: 1563-1570.
[240] SAYIN M O, PARISE F, OZDAGLAR A. Fictitious play in zero-sum stochastic games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2010.04223.
[241] JADERBERG M, CZARNECKI W M, DUNNING I, et al. Human-level performance in 3D multiplayer games with population-based reinforcement learning[J]. Science, 2019, 364(6443): 859-865.
[242] SAMUEL A L. Some studies in machine learning using the game of checkers[J]. IBM Journal of Research and Development, 2000, 44(1/2): 206-226.
[243] BANSAL T, PACHOCKI J, SIDOR S, et al. Emergent complexity via multi-agent competition[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1710.03748.
[244] SUKHBAATAR S, LIN Z, KOSTRIKOV I, et al. Intrinsic motivation and automatic curricula via asymmetric self-play[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1703.05407.
[245] ADAM L, HORCIK R, KASL T, et al. Double oracle algorithm for computing equilibria in continuous games[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5070-5077.
[246] WANG Y Z, MA Q R, WELLMAN M P. Evaluating strategy exploration in empirical game-theoretic analysis[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2105.10423.
[247] SHOHEI O. Unbiased self-play[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.03007.
[248] HENDON E, JACOBSEN H J, SLOTH B. Fictitious play in extensive form games[J]. Games and Economic Behavior, 1996, 15(2): 177-202.
[249] HEINRICH J, LANCTOT M, SILVER D. Fictitious self-play in extensive-form games[C]∥Proc.of the International Conference on Machine Learning, 2015: 805-813.
[250] LIU B Y, YANG Z R, WANG Z R. Policy optimization in zero-sum Markov games: fictitious self-play provably attains Nash equilibria[EB/OL]. [2021-08-01]. https:∥openreview.net/forum?id=c3MWGN_cTf.
[251] HOFBAUER J, SANDHOLM W H. On the global convergence of stochastic fictitious play[J]. Econometrica, 2002, 70(6): 2265-2294.
[252] FARINA G, CELLI A, GATTI N, et al. Ex ante coordination and collusion in zero-sum multi-player extensive-form games[C]∥Proc.of the 32nd International Conference on Neural Information Processing Systems, 2018: 9661-9671.
[253] HEINRICH J. Deep reinforcement learning from self-play in imperfect-information games[D]. London: University College London, 2016.
[254] NIEVES N P, YANG Y, SLUMBERS O, et al. Modelling behavioural diversity for learning in open-ended games[C]∥Proc.of the International Conference on Machine Learning, 2021: 8514-8524.
[255] KLIJN D, EIBEN A E. A coevolutionary approach to deep multi-agent reinforcement learning[C]∥Proc.of the Genetic and Evolutionary Computation Conference, 2021.
[256] WRIGHT M, WANG Y, WELLMAN M P. Iterated deep reinforcement learning in games: history-aware training for improved stability[C]∥Proc.of the ACM Conference on Economics and Computation, 2019: 617-636.
[257] SMITH M O, ANTHONY T, WANG Y, et al. Learning to play against any mixture of opponents[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2009.14180.
[258] SMITH M O, ANTHONY T, WELLMAN M P. Iterative empirical game solving via single policy best response[C]∥Proc.of the International Conference on Learning Representations, 2020.
[259] MARRIS L, MULLER P, LANCTOT M, et al. Multi-agent training beyond zero-sum with correlated equilibrium meta-solvers[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.09435.
[260] MCALEER S, LANIER J, FOX R, et al. Pipeline PSRO: a scalable approach for finding approximate Nash equilibria in large games[C]∥Proc.of the 34th International Conference on Neural Information Processing Systems, 2020, 33: 20238-20248.
[261] DINH L C, YANG Y, TIAN Z, et al. Online double oracle[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2103.07780.
[262] FENG X D, SLUMBERS O, YANG Y D, et al. Discovering multi-agent auto-curricula in two-player zero-sum games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.02745.
[263] MCALEER S, WANG K, LANCTOT M, et al. Anytime optimal PSRO for two-player zero-sum games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2201.07700.
[264] ZHOU M, CHEN J X, WEN Y, et al. Efficient policy space response oracles[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2202.00633.
[265] LIU S Q, MARRIS L, HENNES D, et al. NeuPL: neural population learning[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2202.07415.
[266] YANG Y D, LUO J, WEN Y, et al. Diverse auto-curriculum is critical for successful real-world multiagent learning systems[C]∥Proc.of the 20th International Conference on Autonomous Agents and Multiagent Systems, 2021: 51-56.
[267] WU Z, LI K, ZHAO E M, et al. L2E: learning to exploit your opponent[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2102.09381.
[268] LEIBO J Z, HUGHES E, LANCTOT M, et al. Autocurricula and the emergence of innovation from social interaction: a manifesto for multi-agent intelligence research[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/1903.00742.
[269] LIU X Y, JIA H T, WEN Y, et al. Unifying behavioral and response diversity for open-ended learning in zero-sum games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.04958.
[270] MOURET J B. Evolving the behavior of machines: from micro to macroevolution[J]. Iscience, 2020, 23(11): 101731.
[271] MCKEE K R, LEIBO J Z, BEATTIE C, et al. Quantifying environment and population diversity in multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2102.08370.
[272] PACCHIANO A, HOLDER J P, CHOROMANSKI K M, et al. Effective diversity in population-based reinforcement learning[C]∥Proc.of the 34th International Conference on Neural Information Processing Systems, 2020: 18050-18062.
[273] MASOOD M A, FINALE D V. Diversity-inducing policy gradient: using maximum mean discrepancy to find a set of diverse policies[C]∥Proc.of the 28th International Joint Conference on Artificial Intelligence, 2019: 5923-5929.
[274] GARNELO M, CZARNECKI W M, LIU S, et al. Pick your battles: interaction graphs as population-level objectives for strategic diversity[C]∥Proc.of the 20th International Conference on Autonomous Agents and Multi-Agent Systems, 2021: 1501-1503.
[275] TAVARES A, AZPURUA H, SANTOS A, et al. Rock, paper, StarCraft: strategy selection in real-time strategy games[C]∥Proc.of the 12th AAAI Conference on Artificial Intelligence and Interactive Digital Entertainment, 2016: 93-99.
[276] PABLO H L, ENRIQUE M C, SUCAR L E. A framework for learning and planning against switching strategies in repeated games[J]. Connection Science, 2014, 26(2): 103-122.
[277] FEI Y J, YANG Z R, WANG Z R, et al. Dynamic regret of policy optimization in non-stationary environments[C]∥Proc.of the 31st International Conference on Neural Information Processing Systems, 2020: 6743-6754.
[278] WRIGHT M, VOROBEYCHIK Y. Mechanism design for team formation[C]∥Proc.of the AAAI 29th AAAI Conference on Artificial Intelligence, 2015: 1050-1056.
[279] AUER P, JAKSCH T, ORTNER R, et al. Near-optimal regret bounds for reinforcement learning[C]∥Proc.of the 21st International Conference on Neural Information Processing Systems, 2008: 89-96.
[280] HE J F, ZHOU D R, GU Q Q, et al. Nearly optimal regret for learning adversarial MDPs with linear function approximation[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2102.08940.
[281] MEHDI J J, RAHUL J, ASHUTOSH N. Online learning for unknown partially observable MDPs[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2102.12661.
[282] TIAN Y, WANG Y H, YU T C, et al. Online learning in unknown Markov games[C]∥Proc.of the International Conference on Machine Learning, 2021: 10279-10288.
[283] KASH I A, SULLINS M, HOFMANN K. Combining no-regret and q-learning[C]∥Proc.of the 19th International Conference on Autonomous Agents and Multi-Agent Systems, 2020: 593-601.
[284] LIN T Y, ZHOU Z Y, MERTIKOPOULOS P, et al. Finite-time last-iterate convergence for multi-agent learning in games[C]∥Proc.of the International Conference on Machine Learning, 2020: 6161-6171.
[285] LEE C W, KROER C, LUO H P. Last-iterate convergence in extensive-form games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.14326.
[286] DASKALAKIS C, FISHELSON M, GOLOWICH N. Near-optimal no-regret learning in general games[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2108.06924.
[287] MORRILL D, D’ORAZIO R, SARFATI R, et al. Hindsight and sequential rationality of correlated play[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2021: 5584-5594.
[288] LI X. Opponent modeling and exploitation in poker using evolved recurrent neural networks[D]. Austin: University of Texas at Austin, 2018.
[289] GANZFRIED S. Computing strong game-theoretic strategies and exploiting suboptimal opponents in large games[D]. Pittsburgh: Carnegie Mellon University, 2015.
[290] DAVIS T, WAUGH K, BOWLING M. Solving large extensive-form games with strategy constraints[C]∥Proc.of the AAAI Conference on Artificial Intelligence, 2019: 1861-1868.
[291] KIM D K, LIU M, RIEMER M, et al. A policy gradient algorithm for learning to learn in multiagent reinforcement learning[C]∥Proc.of the International Conference on Machine Learning, 2021: 5541-5550.
[292] SILVA F, COSTA A, STONE P. Building self-play curricula online by playing with expert agents in adversarial games[C]∥Proc.of the 8th Brazilian Conference on Intelligent Systems, 2019: 479-484.
[293] SUSTR M, KOVARK V, LISY V. Monte Carlo continual resolving for online strategy computation in imperfect information games[C]∥Proc.of the 18th International Conference on Autonomous Agents and Multi-Agent Systems, 2019: 224-232.
[294] BROWN N, SANDHOLM T. Safe and nested subgame solving for imperfect-information games[C]∥Proc.of the 31st International Conference on Neural Information Processing Systems, 2017: 689-699.
[295] TIAN Z. Opponent modelling in multi-agent systems[D]. London: University College London, 2021.
[296] WANG T H, DONG H, LESSER V, et al. ROMA: multi-agent reinforcement learning with emergent roles[C]∥Proc.of the International Conference on Machine Learning, 2020: 9876-9886.
[297] GONG L X, FENG X C, YE D Z, et al, OptMatch: optimized matchmaking via modeling the high-order interactions on the arena[C]∥Proc.of the 26th ACM SIGKDD International Conference on Knowledge Discovery amp; Data Mining, 2020: 2300-2310.
[298] HU H Y, LERER A, PEYSAKHOVICH A, et al. “Other-play” for zero-shot coordination[C]∥Proc.of the International Conference on Machine Learning, 2020: 4399-4410.
[299] TREUTLEIN J, DENNIS M, OESTERHELD C, et al. A new formalism, method and open issues for zero-shot coordination[C]∥Proc.of the International Conference on Machine Learning, 2021: 10413-10423.
[300] LUCERO C, IZUMIGAWA C, FREDERIKSEN K, et al. Human-autonomy teaming and explainable AI capabilities in RTS games[C]∥Proc.of the International Conference on Human-Computer Interaction, 2020: 161-171.
[301] WAYTOWICH N, BARTON S L, LAWHERN V, et al. Grounding natural language commands to StarCraft II game states for narration-guided reinforcement learning[J]. Artificial Intelligence and Machine Learning for Multi-Domain Ope-rations Applications, 2019, 11006: 267-276.
[302] SIU H C, PENA J D, CHANG K C, et al. Evaluation of human-AI teams for learned and rule-based agents in Hanabi[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2107.07630.
[303] KOTSERUBA I, TSOTSOS J K. 40 years of cognitive architectures: core cognitive abilities and practical applications[J]. Artificial Intelligence Review, 2020, 53(1): 17-94.
[304] ALEXANDER K. Adversarial reasoning: computational approaches to reading the opponent’s mind[M]. Boca Raton: Chapman amp; Hall/CRC, 2006.
[305] KULKARNI A. Synthesis of interpretable and obfuscatory behaviors in human-aware AI systems[D]. Arizona: Arizona State University, 2020.
[306] ZHENG Y, HAO J Y, ZHANG Z Z, et al. Efficient policy detecting and reusing for non-stationarity in Markov games[J]. Autonomous Agents and Multi-Agent Systems, 2021, 35(1): 1-29.
[307] SHEN M, HOW J P. Safe adaptation in multiagent competition[EB/OL]. [2022-03-12]. http:∥arxiv.org/abs/2203.07562.
[308] HAWKIN J. Automated abstraction of large action spaces in imperfect information extensive-form games[D]. Edmonton: University of Alberta, 2014.
[309] ABEL D. A theory of abstraction in reinforcement learning[D]. Providence: Brown University, 2020.
[310] YANG Y D, RUI L, LI M N, et al. Mean field multi-agent reinforcement learning[C]∥Proc.of the International Conference on Machine Learning, 2018: 5571-5580.
[311] JI K Y. Bilevel optimization for machine learning: algorithm design and convergence analysis[D]. Columbus: Ohio State University, 2020.
[312] BOSSENS D M, TARAPORE D. Quality-diversity meta-evolution: customising behaviour spaces to a meta-objective[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2109.03918v1.
[313] MAJID A Y, SAAYBI S, RIETBERGEN T, et al. Deep reinforcement learning versus evolution strategies: a comparative survey[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2110.01411.
[314] RAMPONI G. Challegens and opportunities in multi-agent reinforcement learnings[D]. Milano: Politecnico Di Milano, 2021.
[315] KHETARPAL K, RIEMER M, RISH I, et al. Towards continual reinforcement learning: a review and perspectives[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2012.13490.
[316] MENG D Y, ZHAO Q, JIANG L. A theoretical understanding of self-paced learning[J]. Information Sciences, 2017, 414: 319-328.
[317] 尹奇躍, 趙美靜, 倪晚成, 等. 兵棋推演的智能決策技術(shù)與挑戰(zhàn)[J]. 自動化學(xué)報, 2021, 47(5): 913-928.
YIN Q Y, ZHAO M J, NI W C, et al. Intelligent decision making technology and challenge of wargame[J]. Acta Automatica Sinica, 2021, 47(5): 913-928.
[318] 程愷, 陳剛, 余曉晗, 等. 知識牽引與數(shù)據(jù)驅(qū)動的兵棋AI設(shè)計及關(guān)鍵技術(shù)[J]. 系統(tǒng)工程與電子技術(shù), 2021, 43(10): 2911-2917.
CHENG K, CHEN G, YU X H, et al. Knowledge traction and data-driven wargame AI design and key technologies[J]. Systems Engineering and Electronics, 2021, 43(10): 2911-2917.
[319] 蒲志強(qiáng), 易建強(qiáng), 劉振, 等. 知識和數(shù)據(jù)協(xié)同驅(qū)動的群體智能決策方法研究綜述[J]. 自動化學(xué)報, 2022, 48(3): 627-643.
PU Z Q, YI J Q, LIU Z, et al. Knowledge-based and data-driven integrating methodologies for collective intelligence decision making: a survey[J]. Acta Automatica Sinica, 2022, 48(3): 627-643.
[320] 張馭龍, 范長俊, 馮旸赫, 等. 任務(wù)級兵棋智能決策技術(shù)框架設(shè)計與關(guān)鍵問題分析[J]. 指揮與控制學(xué)報, 2024, 10(1): 19-25.
ZHANG Y L, FAN C J, FENG Y H, et al. Technical framework design and key issues analysis in task-level wargame intelligent decision making[J]. Journal of Command and Control, 2024, 10(1): 19-25.
[321] CHEN L L, LU K, RAJESWARAN A, et al. Decision transformer: reinforcement learning via sequence modeling[C]∥Proc.of the 35th Conference on Neural Information Processing Systems, 2021: 15084-15097.
[322] MENG L H, WEN M N, YANG Y D, et al. Offline pre-trained multi-agent decision transformer: one big sequence model conquers all StarCraft II tasks[EB/OL]. [2022-01-01]. http:∥arxiv.org/abs/2112.02845.
[323] ZHENG Q Q, ZHANG A, GROVER A. Online decision transformer [EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2202.05607.
[324] MATHIEU M, OZAIR S, SRINIVASAN S, et al. StarCraft II unplugged: large scale offline reinforcement learning[C]∥Proc.of the Deep RL Workshop NeurIPS 2021, 2021.
[325] SAMVELYAN M, RASHID T, SCHROEDER D W C, et al. The StarCraft multi-agent challenge[C]∥Proc.of the 18th International Conference on Autonomous Agents and Multi-agent Systems, 2019: 2186-2188.
[326] LANCTOT M, LOCKHART E, LESPIAU J B, et al. OpenSpiel: a framework for reinforcement learning in games[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/1908.09453.
[327] TERRY J K, BLACK B, GRAMMEL N, et al. PettingZoo: gym for multi-agent reinforcement learning[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2009.14471.
[328] PRETORIUS A, TESSERA K, SMIT A P, et al. MAVA: a research framework for distributed multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2107.01460.
[329] YAO M, YIN Q Y, YANG J, et al. The partially observable asynchronous multi-agent cooperation challenge[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2112.03809.
[330] MORITZ P, NISHIHARA R, WANG S, et al. Ray: a distributed framework for emerging AI applications[C]∥Proc.of the 13th USENIX Symposium on Operating Systems Design and Implementation, 2018: 561-577.
[331] ESPEHOLT L, MARINIER R, STANCZYK P, et al. SEED RL: scalable and efficient deep-RL with accelerated central inference[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/1910.06591.
[332] MOHANTY S, NYGREN E, LAURENT F, et al. Flatland-RL: multi-agent reinforcement learning on trains[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2012.05893.
[333] SUN P, XIONG J C, HAN L, et al. Tleague: a framework for competitive self-play based distributed multi-agent reinforcement learning[EB/OL]. [2022-03-01]. http:∥arxiv.org/abs/2011.12895.
[334] ZHOU M, WAN Z Y, WANG H J, et al. MALib: a parallel framework for population-based multi-agent reinforcement learning[EB/OL]. [2021-08-01]. http:∥arxiv.org/abs/2106.07551.
作者簡介
羅俊仁(1989—),男,博士研究生,主要研究方向為多智能體學(xué)習(xí)、智能博弈。
張萬鵬(1981—),男,研究員,博士,主要研究方向為大數(shù)據(jù)智能、智能演進(jìn)。
蘇炯銘(1984—),男,副研究員,博士,主要研究方向為可解釋人工智能、智能博弈。
袁唯淋(1994—),男,博士研究生,主要研究方向為智能博弈、多智能體強(qiáng)化學(xué)習(xí)。
陳 璟(1972—),男,教授,博士,主要研究方向為認(rèn)知決策博弈、分布式智能。