Upgrade to Pro — share decks privately, control downloads, hide ads and more …

The Future of AI: Beyond Completion Models to Systematic Innovation

The Future of AI: Beyond Completion Models to Systematic Innovation

The abstract-summary of architecture pattern to use LLM on the business side

Sungho Park

April 21, 2024
Tweet

Other Decks in Research

Transcript

  1. 2024. 01. 01 WRTN. All Rights Reserved. 01 Intro 02

    Previous works 03 Systematic solutions 04 Outro The Future of AI: Beyond Completion Models to Systematic Innovation
  2. Self-introduction - Sungho Park - ML Engineer @ wrtn technologies

    - Main works - LLM-based Autonomous Agent Design, Evaluation - Python Backend Service Engineering - Interest things - Open-source Contribution - Model serving - Embodied Agent
  3. - Productizable AI의 system architecture 관점에서 구성하는 고민들을 공유하려고 합니다.

    한국 커뮤니티에서 관련된 많은 논의가 발생하길 바랍니다. - 청중에게 발표하기 부끄럽고 부족한 실력이지만 모아둔 짧은 인사이트들을 공유하려고 합니다. - 주제에 대한 뚜렷한 대안은 저조차도 없지만, 그 동안 학계, 커뮤니티의 연구에서 짚고 넘어갈 사실들을 정리했습니다. - 정신 없이 흘러가는 최근 동향들에서 잠시 멀어지고, 처음부터 짧게 되짚으면서 시작하려고 합니다. About Presentation
  4. Background history Evolution of Language Model - NLU(2011 ~): Word2Vec

    - NLG(2015 ~): Seq2Seq, Transformers - PLM(2018 ~): BERT, LLM(2020 ~): GPT-3 - Mega Scale(~2021): GPT, PaLM, - Token Efficiency(2022 ~): Chinchilla - RL and Instruction Tuning(2022 ~): InstructGPT, Alpaca - Better Reasoning(2022 ~): CoT, ReAct, Tree of Thoughts - Lots of Training Methods(2023 ~ )... 2023 LangCon 김기현님 발표 참고
  5. Background History - Before Higher Performance LLM( ~ 2022) -

    Using AI as RecSys(Shopping), Classification(Finance), OCR … - AI: Small component of system, secondary role of system - After Higher Performance LLM(2023 ~) - AI: Service domain, rich platform feature, enhanced RecSys, Classification - AI: Core component of system.
  6. - Before Higher Performance LLM( ~ 2022) - Using AI

    as RecSys(Shopping), Classification(Finance), OCR … - AI: Small component of system, secondary role of system - After Higher Performance LLM(2023 ~) - AI: Service domain, rich platform feature, enhanced RecSys, Classification - AI: Core component of system. - Three perspective of AI for business for success - Performance, Efficiency of model: - High barrier: Resources, Researchers, Engineers, Cost … - Serve the model - High barrier: Resources, Researchers, Engineers, Cost … - Utilizing the model: Presentation Target - Low barrier: Just junior engineers and API cost LLM guarantee the lower-boundary performance of service. Background History
  7. Perspective of LLM in the service - Completion tools, black

    box system: - Core component of Chat interface, Query Classification … - Focus on broad capabilities like fluence, domain knowledge understanding - Service quality depends on Performance of LLM - Cooperative system component - Core component of Chat interface, Query Classification … - Focus on specific capabilities like decision, code generation - Service quality depends on System Architecture & Performance of LLM Current practice of LLM usage: Mixture of both.
  8. Discussion of AI Community Current practice of LLM usage: Mixture

    of both - LLM as Completion method - AI Community: Try to make better LLM. - 👍 Decreasing cost of LLM API: OpenAI, Claude, Together.ai, … - 👍 Better LLM API: OpenAI, Claude, Together.ai, … - 👍 Better Public LLM: Korean Continual Learning Llama2, Yi, Qwen, … - LLM as Cooperative system component - AI Community: Try to make better System Architecture of LLM usage. - 👍 Modularized AI constructs for easy creation: Langchain, LlamaIndex, .. - 👍 Well-crafted research, project: Taskweaver, AutoGPT, … - ❓ Build Productizable AI architecture for business field: Enterprise role - Easy to serve and maintain, less cost/latency, Multi-modal I/O - Langchain and LlamaIndex are library, doesn’t provide architecture - Maybe Agent Frameworks: Taskweaver, AutoGPT?.. ➡ Hard to use in various scenario. - Hidden practice
  9. - 업계에 많은 노하우가 축적되어서 당연하다고 생각할 수 있는 이야기들도

    다수 포함. - 기업이 추구하는 목표와 연구(특히 Open-source contributors들의 노고)의 성격이 다름을 인지하고 여러 인사이트들을 공유하려고 합니다. - 그럼에도 한국 필드에서 AI system architecture와 관련된 다양한 논의들의 시작점이 되었으면 하는 마음에 여러가지 생각들을 정리해봤습니다. About Presentation
  10. Terminology - Planning: Generate sequence of action - e.g., 1.

    Eat, 2. Run, 3. Sleep - Reasoning: To infer about a problem - e.g., A is B, B is C. So, A is C. - LLM Performance: Fluency, Domain-knowledge understanding, mathematical, logical thinking …, every ability of LLM.
  11. ADaPT: As-Needed Decomposition and Planning with Language Models ADaPT(https://arxiv.org/abs/2311.05772 )

    - Plan: Sequence of action - Decomposition: Action candidates of each sequence - ADaPT: Plan and Decomposition with Iterative error handling - e.g., In left figure - Step 1b ➡ Step 2 ➡ Step 3 - Maybe the most natural way to integrate Planner Architecture - Try Iterative Fail/Success scenario
  12. Better than ReAct, Reflection and without Planning RestGPT: Connecting Large

    Language Models with Real-World RESTful APIs RestGPT Evaluation(http://arxiv.org/abs/2306.06624 )
  13. VOYAGER: An Open-Ended Embodied Agent with Large Language Models Voyager(http://arxiv.org/abs/2305.16291

    ) LLM-powered embodied lifelong learning agent in Minecraft - continuously explores the world - acquires diverse skills - makes novel discoveries without human intervention - ever-growing skill library of executable code for storing and retrieving complex behaviors - a new iterative prompting mechanism that incorporates environment feedback, execution errors, and self-verification for program improvement. - Environment(mineflayer-collectblock) is key!
  14. Reasoning with Language Model is Planning with World Model LLM

    Reasoner(http://arxiv.org/abs/2305.14992 ) Lack of LLM - hard to predict the world state - simulate long-term outcomes of actions - iteratively refining existing reasoning step Use Monte Carlo Tree Search! - Open-Loop Planning in RL(Discrete case) - LLM as action, state generator and reward function
  15. - Inject information via prompts. - CoT, ReAct - ADaPT

    - RestGPT - Handling response as system component - Expected response type like type checking in Programming Language - e.g., The system expects valid json format response to LLMs like {“thought”: “something..”, “action”: “something…”} - Believe LLM capability and performance - e.g., LLM can make valid JSON format using regex constraints. - Utilize the APIs or python function as tool - e.g., LLM can call the search function. - Paper and Open-Source Projects provide lots of idea! - Thanks for the researchers and contributors. - (Optional) Code generation task is core component Common Points of Previous Works
  16. Limitation of Previous Works Let’s think about the developer who

    is sitting in front of the notebook. His goal is making something impressive chat system. But he found the limitations of previous works. - Hard to generalize - e.g., In the common conversation service, MCTS isn’t easy to adapt. - e.g., Many research targets benchmark to prove the ability of proposed methods. - Evaluation metrics(Acc) ≠ Service metrics(MOU, Retention) - It’s necessary to evaluate using benchmark. - But system metrics and effectiveness is different. - Too complicated to maintain - Hard coupling with the old modules.. - Hard to find the best practice solution for utilizing the LLM
  17. Difference between two section - Breaking the Completion Model Barrier:

    Focus on the performance - Systematic Solution: Focus on the system to overcome the limitations
  18. - Focus on the system to overcome the limitations -

    Objective definition: Set the target use case and overcome or solve the use case - Use LLM as components - Systematic features should be concerned… e.g., JSON mode, Latency, Cost … Systematic Solution
  19. AutoGPT - Concern about 5 Topics - Goals - Constraints

    - Commands(APIs, Python Function) - Resources - Performance Evaluation - Generate - text, reasoning, plan, criticism, speak, command - JSON format - Execution and generations are saved to Memory - Iterative running AutoGPT(https://github.com/Significant-Gravitas/AutoGPT ), Decoding AutoGPT Understanding
  20. AutoGPT - ✅ Focus on the system to overcome the

    limitations - Good. AutoGPT can run few weeks to solve the problem. This was an amazing experience at the time(2023.03) - ✅ Objective definition: Set the target use case and overcome or solve the use case - Good. - ✅ Use LLM as components - Yes. Typical practice using LLM as components in the system.
  21. AutoGPT - ✅ Good to make own agent - Yes.

    Forge of AutoGPT supports this philosophy. - ❌ Handle the performance of AutoGPT - Hard to handle the quality of answer - Iterative reasoning can be lead to wrong answer. - ❌ Good to use in general purpose like ChatGPT - Can’t handle the error of commands. - Too expensive. - Too slow. - Hard to change the behavior of AutoGPT.
  22. BabyAGI 1. Define the task, make solution a. solution: can

    be just text, execution of tools 2. Storing the solution 3. Create new task and reprioritize task list 4. Go to 1. BabyAGI(https://github.com/yoheinakajima/babyagi?tab=readme-ov-file ), BabyAGI review
  23. BabyAGI - More improved planning sequence - Same as AutoGPT

    about the summary(Focus on the system, etc…)
  24. TaskWeaver 1. Generate thought and reply target 2. If reply

    target is “User”, just reply to user. 3. If reply target is “CodeInterpreter”, CodeInterpreter will be try to solve the problem with tools. 4. Repeat TaskWeaver(https://github.com/microsoft/TaskWeaver ), TaskWeaver Paper(http://arxiv.org/abs/2311.17541 ) - “Role” is the problem solver. - User - CodeInterpreter - Two role objects interact each other to solve the problem. - Two pattern of tool usage - CodeInterpreter calls the tools - Function calling calls the tools - Simple ReAct
  25. TaskWeaver - ✅ Focus on the system to overcome the

    limitations - Good. TaskWeaver can a lot of use case. - ✅ Objective definition: Set the target use case and overcome or solve the use case - Good. - ✅ Use LLM as components - Yes. Typical practice using LLM as components in the system. - Planner prompt - json response schema: init_plan, plan, current_plan_step, send_to, message - Experience module - Forced not to repeat mistakes in Code Interpreter - e.g., tool argument generation, function usage scenario - LLM in the Plugin(e.g., summary, speech-to-text)
  26. TaskWeaver - ✅ Good to use in general purpose like

    ChatGPT - Yes. It provide very similar experience when using CodeInterpreter. - It provide SerpAPI and web_search_plugin like Web Browsing in ChatGPT - ✅ Good to survey the code-based studying - Well-crafted code architecture compared to other python agent frameworks - ❌ Good to adapt in business service - Hard to customize the framework - Hard to adapt the framework into the online server. This is only the framework to research and find the possibilities/limitations. - ❌ Good to make own agent - No. It’s very hard to implement the new planner. - Very hard to modify the architecture. - ❌ Handle the performance of TaskWeaver. - Same as AutoGPT. Iterative reasoning can be lead to wrong answer.
  27. Summary of the systematic solutions We want these criteria, but...

    - ✅ Good to make own agent - ✅ Good to use in general purpose like ChatGPT - ✅ Good to adapt in business service - ✅ Well-crafted code architecture - ✅ Faster, Cheaper, Simplify, Easy-prompt-debugging - ✅ Handle the performance There is no public solution that meets all of these.
  28. More criteria for the system - ❓ Easy-debugging for service

    and prompt results - ❓ Good to evaluate the system performance on offline - Simple evaluation: Just use public benchmark to approximate the service performance - ❓ Good to unit test the LLM component - Simple unit test for very small amount of dataset: Examples of Liner - ❓ Easy to utilize larger context of tool response - How to manage the large crawling response? - ❓ Good to log the service performance on online - If response of component generates the probability, record and log. - ❓ More domain-specific requirements … - Understanding domain knowledge Unit test reference: https://speakerdeck.com/huffon/autonomous-agent-in-production?slide=81
  29. Can AGI make these concerns garbage? There are two perspectives.

    - Yes: AGI will better fulfill all criteria. - No: AGI will be a component of insane performance. What’s the answer: Maybe both?..
  30. About wrtn Korean, Japan Service Store, Studio Plugin Chat Moving

    Towards a Common Goal(AGI brings to People), Approaching It from Different Angles
  31. wrtn AI Team’s Works (findings from “More criteria for the

    system” slides) - ✅ Easy-debugging for service and prompt results - Promptflow: doesn't provide debugging features, but visualization and modularization of promptflow leads to easy-debugging. ➡ Developoers should divide the features into graph flow. - 🏋 Good to evaluate the system performance on offline - ✅ Simple evaluation: Just use public benchmark to approximate the service performance Promptflow: provides simple evaluation feature. - 🏋 Better Inner-Benchmark systems - ✅ Good to unit test the LLM component - ✅ Simple unit test for very small amount of dataset Promptflow: provides flow-driven development and becomes easy to unit test. - ✅ Easy to utilize larger context of tool response - ✅ How to manage the large crawling response? ➡ “wrtn search” was trying to solve this problem and was accomplished. Promptflow isn’t ultimate solution, but very useful.
  32. wrtn’s Works (findings from “More criteria for the system” slides)

    - ❓ Good to log the service performance on online (perspective of AI service) - Hard to define the score of AI service. - How to define the score of answer on the online service? - Maybe other engineering, service management is the solution. - e.g., RAG service can log the similarity score, but this isn’t the user satisfaction. - ❓ More domain-specific requirements … - 🏋 How do you define meaningful things to user and how will the system take it? - Areas the company will always need to address and prove
  33. Summary: Systematic solutions for LLM Service 👍 Pros - Common

    and natural methods of building AI service - Easy to compose complex functional requirements as service - Easy to overcome LLM performance 👎 Cons - Hard to accomplish good architecture design - Hard to debug, cooperative engineering - Complicated metrics and evaluation for research and service - All the quality of system are depends on response of tool - Depends on too many outer service. It makes expensive service.
  34. Conclusion - Make better tool - Design clear architecture for

    engineering and LLM components - Always follow up the new methods, and decide to adapt - LLM library isn’t always the solution. It can be burden. - Easy Evaluation, Easy Scoring for fast-prototyping. - But still needs Korean Agent Benchmark.
  35. - AI system architecture는 기업들이 시도하는 문제해결인만큼 그 동안 공유가

    어려웠던 부분들이 많았습니다. - 그럼에도 Stable Diffusion이 열풍에서 많은 분들이 Open-source model들의 성장, use case 도출, 생성 툴 architecture 개선들을 관찰했듯이 TaskWeaver, AutoGen과 같은 LLM project architecture 관점의 다양한 시도들이 Open-Source 진영에서 일어나기 바랍니다. - 개인적으로 국내외 많은 contributor들의 수혜를 받고 있는만큼 더 많은 기여를 하고자 노력하겠습니다. - 부족한 발표였던만큼 관련 커뮤니티에서 AI system 구성에 관한 추가적인 논의들로 좋은 관점들을 보충해주시면 좋겠습니다. Another Conclusion