分享 - 大语言模型加持下的自动化引擎

文章目录

智能体（Agent）系统概览
部件一：规划
1. 任务分解
2. 自我反思
部件二：记忆
1. 记忆类型
2. 最大内积搜索（Maximum Inner Product Search, MIPS）
部件三：工具使用
案例研究
挑战
引用
参考文献

文章翻译自Blog，感谢Lilian Weng的分享。

智能体（Agent）系统概览

在LLM加持下的自动化智能体系统中，LLM作为智能体的大脑，它由几个关键部件组成：

规划
- 子目标分解：智能体将大的任务拆分为许多小的、可管理的小目标，以便于复杂任务的高效处理
- 反思与改进：智能体可以基于过去的动作进行自我批评与自我反思，即从过去的错误中学习以来优化未来的行为，进而最终的结果得以改善
记忆
- 短期记忆：这里我将所有情景学习（in-context learning, 参考Prompt Engineering）看作是模型学习短期记忆的过程
- 长期记忆：这为智能体提供了保留与回忆信息的能力，这一般会通过外部向量的快速存取来实现
工具使用
- 智能体学会通过调用外部APIs来获取额外的信息来弥补模型权重（通常预训练之后很难被改变）中不会留存的信息，包括当前的上下文信息、代码执行能力、从一些专用的信息源获取的信息等等

部件一：规划

复杂任务通常会包含很多步骤。智能体需要知道这些步骤是什么，并提前规划。

任务分解

思维链（Chain of thought, CoT;Wei等, 2022）已经成为模型在复杂任务上性能加强的标准提示（prompting）技术。基于此，模型被要求“一步一步思考”，以更多的计算时间为代价，将复杂的任务分解为较小的、更简单的子任务。CoT将大的任务转化为多个可管理的小任务，可以让模型的“思考”过程更具解释性。

思维树（Yao 等, 2023）对思维链进行了扩展。它在每一小步（子任务）的处理过车个闹钟功能，都会去探索很多合理的可能性。首先，它将问题分解为多步，而后它又为每一步生成多个子步骤，这样就生成了一个树形结构的任务完成流程。可以利用分类器或者多数投票的方式来对思维树的每个状态进行评估，以完成对树的深度/广度优先搜索。

任务分解可以通过如下一些方式完成：

通过对LLM输入简单的提示来完成，比如：XYZ的步骤\n1.或达成ZYZ的子目标是什么？
通过使用特定任务相关的指令，举例来说：如果希望写一篇小说可以输入提示写一个故事的大纲。
人工生成

有另一个差异较大的方法叫LLM+P（Liu 等, 2023），它依赖一个传统的外部规划器来进行长期规划。此方法使用规划领域定义语言（Planning Domain Definition Language, PDDL）作为中间接口对问题进行描述。具体操作步骤为：

LLM将问题翻译为满足PDDL的格式
向某个经典规划器发起请求，生成一个基于现有“领域PDDL”的PDDL规划
将PDDL规划翻译为自然语言

本质上，此方案的规划这一步是外包给外部工具来完成的。它假设存在一个特定领域的PDDL与一个合适的规划器。这种设定在一些特定机器人领域是很常见的，但是在很多其它领域，这种假设很难被满足。

自我反思

自我反思是一个重要的过程，它让自动化引擎可以对过去的动作决策进行迭代改进，以修正之前的错误。它在现实任务中扮演了一个极其关键的角色，因为在现实任务中试验与错误是不可避免的。

ReAct（Yao 等, 2023）将因果推断与动作空间集成到LLM中去。它将动作空间扩展为两部分的组合：一、任务相关的离散动作空间；二、语言空间。前者让LLM可以与环境（如：使用维基搜索的API）进行交互。而后者可以用于提示LLM以自然语言的方式生成推理轨迹。

ReAct为LLM提供了显式的提示模版，格式大致如下：

想法：...
动作：...
观察：...
... （重复多次）

图 2:知识密集型任务的推理轨迹举例（如HotpotQA, FEVER）以及决策任务举例（如：AlfWorld Env, WebShop）。（图片源：Yao 等， 2023）

在知识密集型任务与决策任务这两个实验中，ReAct的表现都比单独使用Act这一基准（想法：...这一步被）表现要好。

Reflexion（Shinn & Labash, 2023）框架使用动态记忆与自我反思能力来武装智能体，以改进智能引擎的推理技术。Reflexion是一个标准的强化学习（Reinforcement Learning, RL）设置。它的奖励模型是一个简单的二值奖励，而动作空间遵循ReAct的设置，也就是使用任务相关的动作空间与语言空间相结合的方式。这样可以完成复杂的推理任务。在每一个动作$a_t$后，智能体会计算出一个猜想$h_t$，并且可以根据自我反思的结果选择是否重置整个环境来开始新一轮的尝试。

图 3:Reflexion框架示意图。（图片源：Shinn & Labash， 2023）

猜想函数决定了执行路径是否是无效的或者包含了错觉，以决定是否需要终止。无效规划指的是那些耗费太长时间还没有执行成功的路径。错觉定义为遇到了连续的相同动作的序列而会导致同样环境状态的情况。

自我反思通过如下方式创建：给LLM喂两个样本，每个样本包含一对数据（失败的执行路径，用于规划未来改变的理想想法）。而后，这些想法被添加到智能体的工作记忆中（最多可有三个），用作LLM查询的上下文。

图 4:AlfWorld Env与HotpotQA上的实验结果。在AlfWorld的实验中，错觉失败出现的比无效规划要多。（图片源：Shinn & Labash, 2023）

事后反馈链（Chain of Hindsight, CoH;Liu 等, 2023）鼓励模型在其自身的输出上进行改进。它显式地将其自身输出与对应的反馈放到它过去的输出序列中去。人工反馈数据是一个集合$D_h = \{(x, y_i , r_i , z_i)\}_{i=1}^n$，其中$x$是提示，$y_i$是模型的填充，$r_i$是对$y_i$的人工评分，$z_i$是事后人工反馈。假设反馈元组根据奖励大小来排序，$r_n \geq r_{n-1} \geq \dots \geq r_1$，那么这个真个过程就是监督式微调的一个过程。使用$\tau_h = (x, z_i, y_i, z_j, y_j, \dots, z_n, y_n), 1 \le i \le j \le n$此作为输入，模型可以通过使用序列前缀来预测$y_n$的方式进行微调，以此完成基于反馈序列来自我反思，以生成更好的输出。在测试阶段，模型可以选择是否进行多轮的人类反馈过程。

为了避免过拟合，CoH 在预测训练数据最大似然估计之上添加了正则化项。为了防止模型耍小聪明或简单复制（因为反馈序列中有许多常用单词），在训练阶段，大概0% - 5%的词（tokens）会被随机掩盖。

实验中，训练所用数据集为WebGPT comparisons、人类反馈概要和人类偏好数据集的组合。

图 5:使用CoH微调之后，模型可以遵循指令生成逐渐改善的输出。（图片源：Liu 等, 2023）

CoH的思想是利用特定上下文下的一系列不断改进历史输出来训练模型给出更加优质的结果。算法蒸馏（Algorithm Distillation, AD;Laskin 等, 2023）应用了同样的思想，它在强化学习任务中交叉使用了不同轮的运行轨迹（cross-episode trajectories），也就是算法运算使用了一个长期历史依赖的策略。考虑一个与环境进行多次交互的智能体，在每一轮（episode）新的交互中，它都会改进一点点。AD将这些学习历史信息串连在一起，然后喂给模型。这样，我们期望模型可以给出相比于之前更好的结果。算法将重点放在了强化学习的学习过程而非训练一个任务特定的策略本身。

论文假设任何可以生成一些列历史学习信息的算法都可以通过行为克隆的方式将其压缩到一个神经网络中去。历史数据可以通过一些源策略生成，这些源策略可以由某些特定的任务训练而来。在训练阶段，每次强化学习的运行会随机采样一个任务，得到多轮的交互轨迹用于训练。以此方式所习得的策略是任务无关的。

实际上，模型的上下文窗口长度是有限制的。因此，每一轮交互轨迹应该要足够短，这样才能够包含多轮的交互轨迹。为了可以学习到一个近乎最优的情景（in-context）强化学习算法，训练样本中包含2-4轮的历史交互轨迹是必要的。情景强化学习算法需要足够长的上下文信息。

相比于三个基线算法 ED （专家蒸馏，使用专家轨迹而非学习历史进行行为克隆）、源策略（用于生成UCB的蒸馏轨迹）以及RL^2（Duan 等, 2017；将其用作上界，因为它是一种在线强化学习算法），在情景强化学习中，AD使用离线强化学习策略可以达到接近在线算法RL^2的性能表现。此外，AD相比于其它基线算法，学习速度更快。在仅使用部分训练历史信息的情况下，AD算法的改进速度仍然要快于ED。

图 7:AD、ED、源策略、RL^2在需要记忆与探索的环境中的性能对比。实验环境仅提供二值奖励。源策略使用A3C算法在“dark”环境中训练得到；DQN使用watermaze环境。（图片源：Laskin 等, 2023）

部件二：记忆

（非常感谢ChatGPT帮我起草本小节。在我与ChatGPT的对话中，我学习到了很多关于人类大脑以及快速MIPS的数据结构。)

记忆类型

记忆可以定义为用于获取、存储、保留以及后期检索信息的过程。人类大脑中存在几种记忆类型：

感官记忆：这是记忆的最早阶段，在最初接受感觉刺激之后，它提供了保留感官信息（视觉、听觉等）的能力。感官记忆仅持续几秒钟。该分类的子类包括图像记忆（视觉），声音记忆（听觉）以及触觉记忆（触摸）
短期记忆（Short-Term Memory, STM）或者工作记忆：它存储了我们当前注意以及处理当前复杂认知任务（如学习、推理等）所需的信息。短期记忆被认为可以同时保留大概7件事，持续大概20-30秒
长期记忆（Long-Term Memory, LTM）：长期记忆可以将信息存储相当长的一段时间，范围从几天到几十年不等，它的容量几乎是无限的。LTM分为两个子类：
1. 显式/陈述性记忆（declarative memory）：这是对事实和事件的记忆，即那些可以被主动回忆起的事，包括情景记忆（时间和经历）以及语义记忆（事实和概念）
2. 隐式/程序性记忆：此类记忆是无意识的，包括一些技巧以及自动执行的流程，像骑自行车或者敲键盘等

我们可以大致考虑以下的对应关系：

感官记忆对应学习原始输入的嵌入表征，包括文本、图片以及其它模态
短期记忆对应情景学习。它是短暂且有限的，因为它受到Transformer上下文窗口长度的限制
长期记忆对应外部存储向量，智能体在查询阶段可以随时获取

最大内积搜索（Maximum Inner Product Search, MIPS）

外部记忆可以缓解注意力范围有限的问题。一个标准的实践是将信息的嵌入表征保存在向量数据库中，数据库应支持快速最大内积搜索（MIPS）。为了优化信息获取速度，常见的选择是使用近似最近邻（approximate nearest neighbors, ANN）算法来返回k个近似最优解。ANN能够以很小的准确率损失来换取巨大的搜索加速。

有一些常用的快速MIPS ANN算法：

本地敏感哈希（Locality-Sensitive Hashing, LSH）：这是一个新的哈希函数类别，它可以将输入相近的元素以较高概率映射到同一哈希上。不同哈希值的数量要远小于输入样本的数量
ANNOY（Approximate Nearest Neighbors Oh Yeah）：它的核心数据结构是随机映射树，即一系列二分树的集合。每一棵二分树中的非叶子结点代表一个输入空间的分割平面，它将输入空间分成两部分。树的每个叶子结点存储了一个数据点。树以独立随机的方式生成。因此，从某种意义上来说，它模拟了一个哈希函数。ANNOY搜索使用所有的树进行搜索，在树的每个节点上找到与查询向量相近的后续节点进行迭代，最后再将所有树的结果聚合到一起。这个想法非常类似KD树，但是扩展性要强很多
HNSW（Hierarchical Navigable Small World）：此算法源自小世界网络理论。在小世界中，绝大部分结点都可以用很少的步骤到达其它任意一个结点，比如社交网络中著名的“六度分割理论”。HNSW将这些小世界图构建为分层结构，其中底层包含了实际的数据点。中间层创建了加速搜索的快捷方式。在搜索时，HNSW从顶层的一个随机结点开始向目标导航。当其无法再更靠近的时候，它就向下层移动，直至到达底层。在上层中的每一次移动可能意味着数据空间中很大距离的移动，在底层中的每一次移动都会对搜索结果进行改善
FAISS（Facebook AI Similarity Search）：此算法基于这样一个假设：在高维空间中，结点间的距离遵循高斯分布，因此，应该存在数据点的聚集。FAISS使用向量量化技术将空间分割成很多群，而后又在群内作量化改进。搜索过程首先使用粗粒度量化寻找群，然后在群内使用更细的量化进行进一步搜索
ScaNN（Scalable Nearest Neighbors）：ScaNN的主要创新点是各向异性向量量化。它将数据点$x_i$量化为$\tilde{x}_i$，满足内积$\langle q, x_i \rangle$与原值$\langle q, \tilde{x}_i \rangle$尽可能相近，而非选择最近的量化质心点

图 9:MIPS算法对比，指标：recall@10。（图片源：谷歌博客, 2020）

查看更多MIPS算法及其性能对比：ann-benchmarks.com。

部件三：工具使用

工具的使用是人类的一个显著的、具有区分性的能力。我们可以创建、修改以及利用外部实体来完成超出我们身体和认知极限的事情。使用外部工具武装LLMs可以极大扩展模型的能力。

图 10:一张海獭漂浮在水中时用岩石敲开贝壳的照片。其它一些动物也可以使用工具，但是复杂程度无法如人类相比。（图片源：动物使用工具）

MRKL（Modular Reasoning, Knowledge and Language;Karpas 等, 2022）是一个应用于自动化智能体的神经符号架构。MRKL系统包含一些专家模块，通用的LLM可以用于将查询路由到最合适的专家模块。这些模块可以是基于神经网络的（如深度学习模型）或者基于符号的（比如：数学计算器、货币转换器、天气API等）。

他们针对算术问题进行了一个实验，微调LLM来调用一个计算器。他们的实验表明，解决口头数学问题比明确表述的数学问题更难，因为LLMs（7B Jurassic1-large model）无法为基本算术问题正确、可靠地提取参数。实验结果强调当外部符号工具可以可靠工作时，知道何时以及如何使用工具至关重要，这些都由LLMs的能力来决定。

TALM（Tool Augmented Language Models;Parisi 等, 2022）和Toolformer（Schick 等, 2023）都对LM进行了微调以学习如何使用外部工具。新添加的API注解对模型输出的改善结果被用于对数据集进行扩展。一些细节可以参考提示工程的“外部APIs”章节。

ChatGPT的插件以及OpenAI API的函数调用都是外部工具增强的LLMs的优秀用例。工具APIs可以由其他开发者（类似插件）提供或者是自定义的（比如函数调用）。

HuggignGPT（Shen 等, 2023）是一个将ChatGPT作为一个任务规划器的框架。它使用ChatGPT来分析HuggingFace平台上模型的描述，并选取相应模型，最后根据模型的执行结果输出一个概要描述。

图 11:HuggingGPT工作方式示意图。（图片源：Shen 等, 2023）

系统由四个阶段组成：

任务规划：LLM作为大脑，它将用户的请求分解成多个任务。每个任务有四个相关的属性：任务类型、ID、依赖以及参数。他们使用很少的样本来引导LLM进行任务的分解与规划。指令：

The AI assistant can parse user input to several tasks: [{"task": task, "id", task_id, "dep": dependency_task_ids, "args": {"text": text, "image": URL, "audio": URL, "video": URL}}]. The "dep" field denotes the id of the previous task which generates a new resource that the current task relies on. A special tag "-task_id" refers to the generated text image, audio and video in the dependency task with id as task_id. The task MUST be selected from the following options: {{ Available Task List }}. There is a logical relationship between tasks, please note their order. If the user input can't be parsed, you need to reply empty JSON. Here are several cases for your reference: {{ Demonstrations }}. The chat history is recorded as {{ Chat History }}. From this chat history, you can find the path of the user-mentioned resources for your task planning.

模型选择：LLM将任务分发给专家模型，请求方式是让模型回答一个多项选择问题。LLM需要从模型列表中选择出所需模型。由于上下文长度的限制，基于任务类型的过滤是必要的。指令：

Given the user request and the call command, the AI assistant helps the user to select a suitable model from a list of models to process the user request. The AI assistant merely outputs the model id of the most appropriate model. The output must be in a strict JSON format: "id": "id", "reason": "your detail reason for the choice". We have a list of models for you to choose from {{ Candidate Models }}. Please select one model from the list.

任务执行：专家模型工作于特定任务之上，并且记录执行结果。指令：

With the input and the inference results, the AI assistant needs to describe the process and results. The previous stages can be formed as - User Input: {{ User Input }}, Task Planning: {{ Tasks }}, Model Selection: {{ Model Assignment }}, Task Execution: {{ Predictions }}. You must first answer the user's request in a straightforward manner. Then describe the task process and show your analysis and model inference results to the user in the first person. If inference results contain a file path, must tell the user the complete file path.

响应生成：LLM接受执行结果，并且向用户提供结果描述

为了让HuggingGPT可以在现实世界中可以使用，还有一些挑战有待解决：

效率优化：LLM的推理迭代次数和模型间的交互都降低了响应速度
在复杂的任务内容上，它依赖于一个非常长的上下文窗口进行交流通信
模型输出以及外部模型服务的稳定性有待改善

API-Bank（Li 等, 2023）是一个衡量工具强化的LLMs性能的基准。它包含了53个常用API工具，一个完整的工具加强的LLM工作流，以及264个注解对话（涵盖568个API调用）。APIs的选择非常多样化，包含搜索引擎、计算器、日历查询、智能家居控制、日程管理、健康数据管理、账户认证工作流等等。由于大量APIs的存在，LLM首先需要调用搜索引擎来找到合适的API，而后根据对应的文档来进行正确的API调用。

图 12:LLM在API-Bank中进行API调用的伪代码。（图片源：Li 等, 2023）

在API-Bank的工作流中，LLMs需要做出一系例的决策，在每一步中我们可以评估它的决策的正确性。决策包括：

是否需要一个API调用
识别正确的API类型：如果不够好，LLMs需要迭代修改API的输入（比如：决定搜索引擎API的搜索关键字）
根据API的返回结果进行响应：如果返回结果不够好，模型可以选择进行改进并在此进行API调用

此基准在三个等级上评估了智能体的工具使用能力：

等级一评估了正确使用API的能力。给定一个API的描述，模型需要决定是否调用给定的API，如何正确调用，如何处理API的返回结果
等级二检测其API的选取能力。模型需要搜索可以解决用户需求的合适API，并且通过文档的阅读学习如何使用它们
等级三进一步测量了它的规划能力。给定一个不明确的用户要求（比如：排定会议日程、预订行程机票/酒店/餐馆），模型可能需要调用多次API来解决问题

案例研究

科学发现智能体

ChemCrow（Bran 等, 2023）是一个专用领域的智能体。它的LLM使用了13个专家工具来辅助完成任务。这些任务包括有机合成、药物发现和材料设计。基于LangChain实现的工作流反应了之前在ReAct和MRKL中描述的内容，并将CoT推理与任务相关的工具进行了整合：

LLM可以获取专家工具输入/输出的详细描述，包括工具名称、功能描述等等
而后，利用专家工具，LLM被要求根据指令回答用户指定提示的问题。指令知道模型遵循ReAct格式 -想法，动作，动作输入，状态观察

一个有趣的发现是基于LLM的评估得到结论是GPT-4和ChemCrow的表现几乎一致。而人类专家根据解决方案的完成度以及化学相关正确性的评估表明ChemCrow的表现要远好于GPT-4。这揭露了使用LLM来评估其在专业领域问题上性能的一个潜在问题。缺乏专业性可能会导致LLMs缺乏其对其自身弱点的了解，这样就无法判断任务结果的正确性。

Boiko 等, 2023同样也对LLM智能体在科学发现方面的能力进行了探索，包括处理复杂科学实验的自主设计、规划和执行。该智能体使用工具来浏览互联网、阅读文档、执行代码、调用机器人实验APIs以及利用其它LLMs。

举例来说，当被要求开发一种新的抗癌药，模型提出以下步骤：

查询当前抗癌药物发现的趋势
选择一个目标
请求这些化合物的结构
一旦化合物被确定，模型便尝试合成

他们同样也讨论了风险问题，特别是非法药物与生化武器的问题。他们做了一个涵盖知名化学武器的列表的数据集，并请求智能体合成它们。4/11的请求（36%）被模型接收生成合成方案，智能体尝试编辑文档来执行这些步骤。7/11的请求被拒绝，这7个被拒绝的请求中，有5个是模型在使用网络搜索之后被拒绝的，另外两个是仅根据前置提示而直接拒绝的。

生成式智能体仿真

Generative Agents（GA,Park 等, 2023）是一个有趣的实验。在实验中，由LLM控制的25个虚拟任务角色居住在一个沙盒环境中，彼此可以通过沙盒进行交流（受到模拟人生游戏的启发）。GA为交互式应用程序创建了可靠的人类行为模拟。

GA的设计将LLM与记忆、规划以及思维机制相结合，让智能体可以基于过去的经历来决定其自身行为，同时也可以与其它智能体进行交互。

记忆流是一个长期记忆模块（外部数据库），它可以使用自然语言记录大量的智能体经历
- 每条记忆是一个状态观察（observation），即由智能体提供的事件。智能体之间的交流可以生成新的自然语言描述
检索模型根据相关性、新近度和重要性，为智能体的行为提供上下文
- 新近度（recency）：时间越近，评分越高
- 重要性：将核心记忆与平淡的记忆区分开来，可直接通过向LLM询问得到
- 相关性：基于其与当前的场景/问题的相关程度来判断
思维机制：随着时间的推移，将记忆合成为更高层次的推论，用于指导智能体的未来行为。它们是过去事件的高层次抽象描述（<- 注意这与前文的自我反思有点不同）
- 使用最近的100个观察作为提示信息，并根据一组观察结果/陈述生成三个最显著的高级问题让LM进行回答
规划与响应：将思考与环境信息翻译为动作
- 规划本质上是为了优化当前即时动作的可信度
- 提示模板：{Intro of an agent X}. Here is X's plan today in broad strokes: 1)
- 智能体之间的关系以及另一个智能体对某个智能体的观察都会用于规划与响应
- 环境信息使用树形结构进行组织

这种有趣的模拟会产生新的社交行为，例如信息传播、关系记忆（比如两个智能体持续某个对话主题）和社交活动的协调（比如举办聚会并邀请其他人）。

概念验证举例

AutoGPT引起了人们对建立以LLM作为主控制器的自动化智能体的广泛关注。虽然给定自然语言处理接口，该方案仍然存在着非常多的可靠性问题，但其仍不失为一个非常酷的概念验证应用。AutoGPT中的很多代码都是关于格式转换的。

下面是一个AutoGPT使用的系统消息，其中{{...}}表示用户输入：

You are {{ai-name}}, {{user-provided AI bot description}}.
Your decisions must always be made independently without seeking user assistance. Play to your strengths as an LLM and pursue simple strategies with no legal complications.

GOALS:

1. {{user-provided goal 1}}
2. {{user-provided goal 2}}
3. ...
4. ...
5. ...

Constraints:
1. ~4000 word limit for short term memory. Your short term memory is short, so immediately save important information to files.
2. If you are unsure how you previously did something or want to recall past events, thinking about similar events will help you remember.
3. No user assistance
4. Exclusively use the commands listed in double quotes e.g. "command name"
5. Use subprocesses for commands that will not terminate within a few minutes

Commands:
1. Google Search: "google", args: "input": "<search>"
2. Browse Website: "browse_website", args: "url": "<url>", "question": "<what_you_want_to_find_on_website>"
3. Start GPT Agent: "start_agent", args: "name": "<name>", "task": "<short_task_desc>", "prompt": "<prompt>"
4. Message GPT Agent: "message_agent", args: "key": "<key>", "message": "<message>"
5. List GPT Agents: "list_agents", args:
6. Delete GPT Agent: "delete_agent", args: "key": "<key>"
7. Clone Repository: "clone_repository", args: "repository_url": "<url>", "clone_path": "<directory>"
8. Write to file: "write_to_file", args: "file": "<file>", "text": "<text>"
9. Read file: "read_file", args: "file": "<file>"
10. Append to file: "append_to_file", args: "file": "<file>", "text": "<text>"
11. Delete file: "delete_file", args: "file": "<file>"
12. Search Files: "search_files", args: "directory": "<directory>"
13. Analyze Code: "analyze_code", args: "code": "<full_code_string>"
14. Get Improved Code: "improve_code", args: "suggestions": "<list_of_suggestions>", "code": "<full_code_string>"
15. Write Tests: "write_tests", args: "code": "<full_code_string>", "focus": "<list_of_focus_areas>"
16. Execute Python File: "execute_python_file", args: "file": "<file>"
17. Generate Image: "generate_image", args: "prompt": "<prompt>"
18. Send Tweet: "send_tweet", args: "text": "<text>"
19. Do Nothing: "do_nothing", args:
20. Task Complete (Shutdown): "task_complete", args: "reason": "<reason>"

Resources:
1. Internet access for searches and information gathering.
2. Long Term memory management.
3. GPT-3.5 powered Agents for delegation of simple tasks.
4. File output.

Performance Evaluation:
1. Continuously review and analyze your actions to ensure you are performing to the best of your abilities.
2. Constructively self-criticize your big-picture behavior constantly.
3. Reflect on past decisions and strategies to refine your approach.
4. Every command has a cost, so be smart and efficient. Aim to complete tasks in the least number of steps.

You should only respond in JSON format as described below
Response Format:
{
    "thoughts": {
        "text": "thought",
        "reasoning": "reasoning",
        "plan": "- short bulleted\n- list that conveys\n- long-term plan",
        "criticism": "constructive self-criticism",
        "speak": "thoughts summary to say to user"
    },
    "command": {
        "name": "command name",
        "args": {
            "arg name": "value"
        }
    }
}
Ensure the response can be parsed by Python json.loads

GPT-Engineer是另一个工程，它以自然语言描述的任务作为输入，创建一个完整的代码仓库。GPT-Engineer被指示考虑一些待构建的小组件列表，并在必要的时候要求用户澄清其问题。

下面是一个GPT-Engineer所使用的用于任务澄清的对话样本，该样本会发送给OpenAI ChatCompletion 终端。用户输入被包含在{{user input text}}格式中。

[
  {
    "role": "system",
    "content": "You will read instructions and not carry them out, only seek to clarify them.\nSpecifically you will first summarise a list of super short bullets of areas that need clarification.\nThen you will pick one clarifying question, and wait for an answer from the user.\n"
  },
  {
    "role": "user",
    "content": "We are writing {{a Super Mario game in python. MVC components split in separate files. Keyboard control.}}\n"
  },
  {
    "role": "assistant",
    "content": "Summary of areas that need clarification:\n1. Specifics of the Super Mario game (e.g. level design, characters, gameplay mechanics)\n2. Details about the MVC components (e.g. which components are in each file)\n3. Keyboard control implementation (e.g. which keys to use, how to handle input)\n\nClarifying question:\nCan you provide more details about the Super Mario game, such as level design, characters, and gameplay mechanics?"
  },
  {
    "role": "user",
    "content": "{{There are 10 levels in total. The main character is a plumber named Mario, who can walk and jump. It is a classical platform game just like Super Mario. The main character moves from left to right, trying to get to the destination, where there are many obstacles and attacks from enemies in the process.}}\n\nIs anything else unclear? If yes, only answer in the form:\n{remaining unclear areas} remaining questions.\n{Next question}\nIf everything is sufficiently clear, only answer \"Nothing more to clarify.\"."
  },
  {
    "role": "assistant",
    "content": "Remaining unclear areas: 2 remaining questions.\nCan you provide more information about how the MVC components are split into separate files?"
  },
  {
    "role": "user",
    "content": "{{Make your own assumptions and state them explicitly before starting}}"
  }
]

在这些澄清之后，智能体使用一个不同的系统消息来切换到代码编辑模式。系统提示消息为：

You will get instructions for code to write. You will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code. Make sure that every detail of the architecture is, in the end, implemented as code.
Think step by step and reason yourself to the right decisions to make sure we get it right. You will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.

Then you will output the content of each file including ALL code. Each file must strictly follow a markdown code block format, where the following tokens must be replaced such that FILENAME is the lowercase file name including the file extension, LANG is the markup code block language for the code’s language, and CODE is the code:

FILENAME
```
CODE
```
You will start with the “entrypoint” file, then go to the ones that are imported by that file, and so on. Please note that the code should be fully functional. No placeholders.

Follow a language and framework appropriate best practice file naming convention. Make sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other. Ensure to implement all code, if you are unsure, write a plausible implementation. Include module dependency or package manager dependency definition file. Before you finish, double check that all parts of the architecture is present in the files.

Useful to know: You almost always put different classes in different files. For Python, you always create an appropriate requirements.txt file. For NodeJS, you always create an appropriate package.json file. You always add a comment briefly describing the purpose of the function definition. You try to add comments explaining very complex bits of logic. You always follow the best practices for the requested languages in terms of describing the code written as a defined package/project.

Python toolbelt preferences:
- pytest
- dataclasses

对话举例：

[
  {
    "role": "system",
    "content": "You will get instructions for code to write.\nYou will write a very long answer. Make sure that every detail of the architecture is, in the end, implemented as code.\nMake sure that every detail of the architecture is, in the end, implemented as code.\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nYou will first lay out the names of the core classes, functions, methods that will be necessary, as well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nPlease note that the code should be fully functional. No placeholders.\n\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. Make sure that code in different files are compatible with each other.\nEnsure to implement all code, if you are unsure, write a plausible implementation.\nInclude module dependency or package manager dependency definition file.\nBefore you finish, double check that all parts of the architecture is present in the files.\n\nUseful to know:\nYou almost always put different classes in different files.\nFor Python, you always create an appropriate requirements.txt file.\nFor NodeJS, you always create an appropriate package.json file.\nYou always add a comment briefly describing the purpose of the function definition.\nYou try to add comments explaining very complex bits of logic.\nYou always follow the best practices for the requested languages in terms of describing the code written as a defined\npackage/project.\n\n\nPython toolbelt preferences:\n- pytest\n- dataclasses\n"
  },
 #  … same conversation as earlier, ended with "Make your own assumptions and state them explicitly before starting".
  {
    "role": "assistant",
    "content": "Assumptions:\n1. Model: The model will contain the game's data, such as level information, character states, and enemy positions.\n2. View: The view will handle the game's visuals, including rendering the game objects, backgrounds, and updating the display.\n3. Controller: The controller will manage user input, such as keyboard controls, and update the model accordingly.\n\nWith these assumptions in mind, is there any specific aspect of the keyboard control implementation you would like me to clarify?"
  },
  {
    "role": "user",
    "content": "Please now remember the steps:\n\nThink step by step and reason yourself to the right decisions to make sure we get it right.\nFirst lay out the names of the core classes, functions, methods that will be necessary, As well as a quick comment on their purpose.\n\nThen you will output the content of each file including ALL code.\nEach file must strictly follow a markdown code block format, where the following tokens must be replaced such that\nFILENAME is the lowercase file name including the file extension,\nLANG is the markup code block language for the code's language, and CODE is the code:\n\nFILENAME\n```LANG\nCODE\n```\n\nPlease note that the code should be fully functional. No placeholders.\n\nYou will start with the \"entrypoint\" file, then go to the ones that are imported by that file, and so on.\nFollow a language and framework appropriate best practice file naming convention.\nMake sure that files contain all imports, types etc. The code should be fully functional. Make sure that code in different files are compatible with each other.\nBefore you finish, double check that all parts of the architecture is present in the files.\n"
  }
]

挑战

在对构建以LLM为中心的智能体的核心思想以及应用示范过了一遍之后，我发现一些局限性：

有限的上下文长度：上下文容量的局限会限制可以包含的历史信息、细节指令、API调用上下文以及响应。系统的设计必须考虑工作在此通信带宽限制之上。如果有更长的或者无限的上下文窗口，类似从过去错误中学习的自我反思之类的机制就可以收获更大。虽然向量数据的存取与检索可以提供一个更大的知识池，它们的表示能力还是无法和全注意力机制相提并论
长期规划与任务分解中的挑战：在一段跨度很长的历史信息上进行规划以及解决方案空间的高效探索都是非常具有挑战性的。LLMs在面对异常错误的时候很难对规划进行调整，这与从试错中学习的人类相比还是远不够看
自然语言接口的可靠性：当前的智能系统还依赖于自然语言作为LLMs与外部组件（如记忆与工具）的接口。然而，模型输出的可靠性还是一个问题。LLMs可能会犯格式化错误，并且偶尔也会犯病（比如拒绝遵循指令）。结果，大量的智能体示例都在重点关注如何转换模型的输出

引用

引用格式：

Weng, Lilian. (Jun 2023). LLM-powered Autonomous Agents". Lil’Log.https://lilianweng.github.io/posts/2023-06-23-agent/.

或者

@article{weng2023prompt,
  title   = "LLM-powered Autonomous Agents"",
  author  = "Weng, Lilian",
  journal = "lilianweng.github.io",
  year    = "2023",
  month   = "Jun",
  url     = "https://lilianweng.github.io/posts/2023-06-23-agent/"
}

参考文献

[1] Wei et al.“Chain of thought prompting elicits reasoning in large language models."NeurIPS 2022

[2] Yao et al.“Tree of Thoughts: Dliberate Problem Solving with Large Language Models."arXiv preprint arXiv:2305.10601 (2023).

[3] Liu et al.“Chain of Hindsight Aligns Language Models with Feedback “arXiv preprint arXiv:2302.02676 (2023).

[4] Liu et al.“LLM+P: Empowering Large Language Models with Optimal Planning Proficiency”arXiv preprint arXiv:2304.11477 (2023).

[5] Yao et al.“ReAct: Synergizing reasoning and acting in language models."ICLR 2023.

[6] Google Blog.“Announcing ScaNN: Efficient Vector Similarity Search”July 28, 2020.

[7]https://chat.openai.com/share/46ff149e-a4c7-4dd7-a800-fc4a642ea389

[8] Shinn & Labash.“Reflexion: an autonomous agent with dynamic memory and self-reflection”arXiv preprint arXiv:2303.11366 (2023).

[9] Laskin et al.“In-context Reinforcement Learning with Algorithm Distillation”ICLR 2023.

[10] Karpas et al.“MRKL Systems A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning."arXiv preprint arXiv:2205.00445 (2022).

[11] Weaviate Blog.Why is Vector Search so fast?Sep 13, 2022.

[12] Li et al.“API-Bank: A Benchmark for Tool-Augmented LLMs”arXiv preprint arXiv:2304.08244 (2023).

[13] Shen et al.“HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace”arXiv preprint arXiv:2303.17580 (2023).

[14] Bran et al.“ChemCrow: Augmenting large-language models with chemistry tools."arXiv preprint arXiv:2304.05376 (2023).

[15] Boiko et al.“Emergent autonomous scientific research capabilities of large language models."arXiv preprint arXiv:2304.05332 (2023).

[16] Joon Sung Park, et al.“Generative Agents: Interactive Simulacra of Human Behavior."arXiv preprint arXiv:2304.03442 (2023).

[17] AutoGPT.https://github.com/Significant-Gravitas/Auto-GPT

[18] GPT-Engineer.https://github.com/AntonOsika/gpt-engineer

扫码关注我们：

大语言模型加持下的自动化引擎

热门主题