Pseudo-convergent Q-Learning by Competitive Pricebots
暂无分享,去创建一个
Pseudo-convergentQ-LearningbyComp etitivePriceb otsJe reyO.Kephartkephart@us.ibm.comGeraldJ.Tesaurotesauro@watson.ibm.comIBMThomasJ.WatsonResearchCenter,30SawMillRiverRd.,Hawthorne,NY10532USAAbstractWestudynovelasp ectsofmulti-agentQ-learninginamo delmarketwhichtwoidentical,comp eting\priceb ots"strategicallypriceacommo dity.Twofundamentallydi erentsolutionsareobserved:anexact,stationarysolutionwithzeroBellmaner-rorconsistingofsymmetricp olicies,andanon-stationary,broken-symmetrypseudo-solution,withsmallbutnon-zeroBellmanerror.This\pseudo-convergent"asymmet-ricsolutionhasnoanaloginordinaryQ-learning.Wecalculateanalyticallytheformofb othsolutions,andmapoutnumericallytheconditionsunderwhicheaco ccurs.Wesuggestthatthisobservedb ehaviorwillalsob efoundmoregenerallyinotherstudiesofmulti-agentQ-learning,anddiscussimplica-tionsanddirectionsforfutureresearch.1.Intro ductionWithinthenextfewyears,weexp ectelectroniccom-mercetob eanimp ortantmulti-agendomaininwhichreinforcementlearningwill ndnumerousapplica-tions.Onesuchapplicationisautomateddynamicpricingbysoftwareagents(Greenald&Kephart,1999).Supp osethateachselleragentindividuallyat-temptstomaximizepro tsthroughjudicioussettingofpricesandotherpro ductparameters.Eveniftheselleragentsdonotcommunicatewithoneanotherdirectly,marketforcesmaystronglycoupletheirac-tions,resultinginahighlydynamicmulti-agentsys-tem.Sincedecisionmakinginmarketsandeconomiesb ene tsgreatlyfromone'sabilitytoforecasteconomictrendsandopp onents'strategies,reinforcementlearn-ingislikelytob eanessentialcomp onentofdecisionmakingbyeconomically-motivatedsoftwareagents.Unfortunately,fromatheoreticalp ersp ective,theis-sueofwhathapp enswhenmultipleinteractingagentssimultaneouslyadapt,usingRLorotherapproaches,islargelyanop enquestion.Thisstandsincontrasttothecaseofsingle-agentRL:instationaryMarkovDecisionProblems,asolidtheoreticalunderstandinghasb eenprovidedbyresearchonalgorithmssucasDynamicProgrammingandQ-learning.Variousthe-oremsestablishthatglobalconvergencetoauniqueoptimalvaluefunctionandp olicywillalwaysb eobtained.However,thesetheoremsdonotapplyinthemulti-agentcase,asadaptingagentsprovidee ec-tivelynon-stationaryenvironmentsforotheragents.Someprogresshasb eenmadeinanalyzingcertainsp ecial-casemulti-agentproblems.Forexample,co-op erativteamsofagentssharingacommongoalorutilityfunctionhaveb eenstudiedin(Stone&Veloso,1999),amongothers.Thepurelycomp etitivcaseofzero-sumutilityfunctionshasb eenstudiedin(Littman,1994),whereanalgorithmcalled\minimax-Q"wasprop osedforo-playerzero-sumgames,showntoconvergetheoptimalaluefunctionp oliciesforb othplayers.SimultaneousQ-learningbtwoplayersintheIteratedPrisoner'sDilemmagamewasstudiedempiricallyin(Sandholm&Crites,1995),whofoundthatthelearningpro cedurefrequentlycon-vergedtostationarysolutions.Animp ortant rststepinanalyzingQ-learningforarbitrary-sumtwo-playergameswasrecentlytakenin(Hu&Wellman,1998).ThisalgorithmassumesthattheplayersfollowNashequilibriump olicies.Issuesremainingtob eaddressedincludethe\equilibriumco ordination"problem(i.e.howtheagentscho osefromamongstmultipleNashequilibria)andveri cationthatthep oliciesimpliedbythelearnedQ-functionsareconsistentwithini-tiallyassumedNashp olicies.Inourpreviouswork(Tesauro&Kephart,1999),estudiedsimultaneousQ-learningbytwoprice-settingagentsinthreesimplemo dels,showingthatallcasesQ-learningbyoneorb othoftheagentsraisedoftheirpro tssubstantiallyfromwhattheywouldobtainusingmyopicb est-resp onsepricing.Simulta-neousconvergenceofb othsellerstostationaryp oli-cieswasfoundinsomebutnotallcases,dep ending
[1] Michael L. Littman,et al. Markov Games as a Framework for Multi-Agent Reinforcement Learning , 1994, ICML.
[2] Tuomas Sandholm,et al. On Multiagent Q-Learning in a Semi-Competitive Domain , 1995, Adaption and Learning in Multi-Agent Systems.
[3] Michael P. Wellman,et al. Multiagent Reinforcement Learning: Theoretical Framework and an Algorithm , 1998, ICML.
[4] Jeffrey O. Kephart,et al. Shopbots and Pricebots , 1999, IJCAI.