Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence

Better understanding of Large Language Models' (LLMs) legal analysis abilities can contribute to improving the efficiency of legal services, governing artificial intelligence, and leveraging LLMs to identify inconsistencies in law. This paper explores LLM capabilities in applying tax law. We choose this area of law because it has a structure that allows us to set up automated validation pipelines across thousands of examples, requires logical reasoning and maths skills, and enables us to test LLM capabilities in a manner relevant to real-world economic lives of citizens and companies. Our experiments demonstrate emerging legal understanding capabilities, with improved performance in each subsequent OpenAI model release. We experiment with retrieving and utilising the relevant legal authority to assess the impact of providing additional legal context to LLMs. Few-shot prompting, presenting examples of question-answer pairs, is also found to significantly enhance the performance of the most advanced model, GPT-4. The findings indicate that LLMs, particularly when combined with prompting enhancements and the correct legal texts, can perform at high levels of accuracy but not yet at expert tax lawyer levels. As LLMs continue to advance, their ability to reason about law autonomously could have significant implications for the legal profession and AI governance.

[1]  J. Steinhardt,et al.  Overthinking the Truth: Understanding how Language Models Process False Demonstrations , 2023, ArXiv.

[2]  C. Honey,et al.  Boosting Theory-of-Mind Performance in Large Language Models via Prompting , 2023, ArXiv.

[3]  P. Abbeel,et al.  Foundation Models for Decision Making: Problems, Methods, and Opportunities , 2023, ArXiv.

[4]  Michel Galley,et al.  Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback , 2023, ArXiv.

[5]  Tong Zhang,et al.  Active Prompting with Chain-of-Thought for Large Language Models , 2023, ArXiv.

[6]  Suraj Maharjan,et al.  Scalable Prompt Generation for Semi-supervised Learning with Language Models , 2023, FINDINGS.

[7]  Xuezhi Wang,et al.  Bounding the Capabilities of Large Language Models in Open Text Generation with Prompt Constraints , 2023, FINDINGS.

[8]  Alberto Olmo Hernandez,et al.  On the Planning Abilities of Large Language Models (A Critical Investigation with a Proposed Benchmark) , 2023, ArXiv.

[9]  Katsuhito Sudoh,et al.  Evaluating the Robustness of Discrete Prompts , 2023, EACL.

[10]  Tom Michael Mitchell,et al.  Read and Reap the Rewards: Learning to Play Atari with the Help of Instruction Manuals , 2023, ArXiv.

[11]  Olivier Sigaud,et al.  Grounding Large Language Models in Interactive Environments with Online Reinforcement Learning , 2023, ICML.

[12]  Andreas Stuhlmüller,et al.  Iterated Decomposition: Improving Science Q&A by Supervising Reasoning Processes , 2023, ArXiv.

[13]  Alexander H. Miller,et al.  Human-level play in the game of Diplomacy by combining language models with strategic reasoning , 2022, Science.

[14]  Noah A. Smith,et al.  Measuring and Narrowing the Compositionality Gap in Language Models , 2022, ArXiv.

[15]  B. Schölkopf,et al.  When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment , 2022, NeurIPS.

[16]  Noah Shinn,et al.  Reflexion: an autonomous agent with dynamic memory and self-reflection , 2023, ArXiv.