Pseudo-convergent Q-Learning by Competitive Pricebots

Pseudo-convergentQ-LearningbyComp etitivePriceb otsJe reyO.Kephartkephart@us.ibm.comGeraldJ.Tesaurotesauro@watson.ibm.comIBMThomasJ.WatsonResearchCenter,30SawMillRiverRd.,Hawthorne,NY10532USAAbstractWestudynovelasp ectsofmulti-agentQ-learninginamo delmarketwhichtwoidentical,comp eting\priceb ots"strategicallypriceacommo dity.Twofundamentallydi erentsolutionsareobserved:anexact,stationarysolutionwithzeroBellmaner-rorconsistingofsymmetricp olicies,andanon-stationary,broken-symmetrypseudo-solution,withsmallbutnon-zeroBellmanerror.This\pseudo-convergent"asymmet-ricsolutionhasnoanaloginordinaryQ-learning.Wecalculateanalyticallytheformofb othsolutions,andmapoutnumericallytheconditionsunderwhicheaco ccurs.Wesuggestthatthisobservedb ehaviorwillalsob efoundmoregenerallyinotherstudiesofmulti-agentQ-learning,anddiscussimplica-tionsanddirectionsforfutureresearch.1.Intro ductionWithinthenextfewyears,weexp ectelectroniccom-mercetob eanimp ortantmulti-agendomaininwhichreinforcementlearningwill ndnumerousapplica-tions.Onesuchapplicationisautomateddynamicpricingbysoftwareagents(Greenald&Kephart,1999).Supp osethateachselleragentindividuallyat-temptstomaximizepro tsthroughjudicioussettingofpricesandotherpro ductparameters.Eveniftheselleragentsdonotcommunicatewithoneanotherdirectly,marketforcesmaystronglycoupletheirac-tions,resultinginahighlydynamicmulti-agentsys-tem.Sincedecisionmakinginmarketsandeconomiesb ene tsgreatlyfromone'sabilitytoforecasteconomictrendsandopp onents'strategies,reinforcementlearn-ingislikelytob eanessentialcomp onentofdecisionmakingbyeconomically-motivatedsoftwareagents.Unfortunately,fromatheoreticalp ersp ective,theis-sueofwhathapp enswhenmultipleinteractingagentssimultaneouslyadapt,usingRLorotherapproaches,islargelyanop enquestion.Thisstandsincontrasttothecaseofsingle-agentRL:instationaryMarkovDecisionProblems,asolidtheoreticalunderstandinghasb eenprovidedbyresearchonalgorithmssucasDynamicProgrammingandQ-learning.Variousthe-oremsestablishthatglobalconvergencetoauniqueoptimalvaluefunctionandp olicywillalwaysb eobtained.However,thesetheoremsdonotapplyinthemulti-agentcase,asadaptingagentsprovidee ec-tivelynon-stationaryenvironmentsforotheragents.Someprogresshasb eenmadeinanalyzingcertainsp ecial-casemulti-agentproblems.Forexample,co-op erativteamsofagentssharingacommongoalorutilityfunctionhaveb eenstudiedin(Stone&Veloso,1999),amongothers.Thepurelycomp etitivcaseofzero-sumutilityfunctionshasb eenstudiedin(Littman,1994),whereanalgorithmcalled\minimax-Q"wasprop osedforo-playerzero-sumgames,showntoconvergetheoptimalaluefunctionp oliciesforb othplayers.SimultaneousQ-learningbtwoplayersintheIteratedPrisoner'sDilemmagamewasstudiedempiricallyin(Sandholm&Crites,1995),whofoundthatthelearningpro cedurefrequentlycon-vergedtostationarysolutions.Animp ortant rststepinanalyzingQ-learningforarbitrary-sumtwo-playergameswasrecentlytakenin(Hu&Wellman,1998).ThisalgorithmassumesthattheplayersfollowNashequilibriump olicies.Issuesremainingtob eaddressedincludethe\equilibriumco ordination"problem(i.e.howtheagentscho osefromamongstmultipleNashequilibria)andveri cationthatthep oliciesimpliedbythelearnedQ-functionsareconsistentwithini-tiallyassumedNashp olicies.Inourpreviouswork(Tesauro&Kephart,1999),estudiedsimultaneousQ-learningbytwoprice-settingagentsinthreesimplemo dels,showingthatallcasesQ-learningbyoneorb othoftheagentsraisedoftheirpro tssubstantiallyfromwhattheywouldobtainusingmyopicb est-resp onsepricing.Simulta-neousconvergenceofb othsellerstostationaryp oli-cieswasfoundinsomebutnotallcases,dep ending