Upgrade to Pro — share decks privately, control downloads, hide ads and more …

LLMs for code: The potential, prospects, and pr...

LLMs for code: The potential, prospects, and problems

Tushar Sharma

June 05, 2024
Tweet

More Decks by Tushar Sharma

Other Decks in Programming

Transcript

  1. Dr. Tushar Sharma Assistant professor at Dalhousie University, Canada PhD

    • Athens University of Economics and Business, Greece Industry experience • Siemens Research (7 + 2) Books • Refactoring for software design smells Tools/platforms • Designite • QConnect
  2. Software engineering Machine learning • Source code analysis • Software

    quality • Code smell detection and refactoring • Developers’ productivity • Program comprehension • Machine learning for software engineering • Software engineering for machine learning https://web.cs.dal.ca/~tushar/smart/ • Binary symbol reconstruction • Program comprehension for decompiled binaries • Vulnerability analysis for decompiled code Green AI • Sustainable machine learning • Energy hotspots and refactorings • Energy efficient code representation Sponsors and collaborators Dr. Tushar Sharma [email protected] SMART lab, Dalhousie University Tools and platforms
  3. Overview • Entering the Matrix • Visions of Zion •

    The Architect's Blueprint • Red Pills and Blue Pills • I am (We are) the one!
  4. Overview • Entering the Matrix (The basics) • Visions of

    Zion (Prospects) • The Architect's Blueprint (State of the art) • Red Pills and Blue Pills (Challenges) • I am (We are) the one! (Opportunities)
  5. 8 A language model is a probability distribution over words

    or word sequences. A language model assign probabilities for the next token(s), given a token(s) P(xn | xn-1 , xn-2 , …)
  6. 11 We believe that our efforts to consolidate and summarize

    the techniques, resources, and challenges will help the community to understand the state-of-the-art better and to focus their efforts on tackling the identified challenges.
  7. 12 We believe that our efforts to consolidate and summarize

    the techniques, resources, and challenges will help the community to understand the state-of-the-art better and to focus their efforts on tackling the identified challenges.
  8. 13 We believe that our efforts to consolidate and summarize

    the techniques, resources, and challenges will help the community understand the state-of- the-art better and focus their efforts on tackling the identified challenges 0.33 0.33 0.33 Assigning probabilities -> creating language model
  9. 14 We believe that our efforts to consolidate and summarize

    the techniques, resources, and challenges will help the community understand the state-of- the-art better and focus their efforts on tackling the identified challenges 0.33 0.33 0.33 Assigning probabilities -> creating language model
  10. 15 We believe that our efforts to consolidate and summarize

    the techniques, resources, and challenges will help community understand state-of-the-art better and focus their efforts on tackling identified challenges 0.25 0.25 0.25 0.25 0.33 0.33 0.33
  11. 16 We believe that our efforts to consolidate and summarize

    the techniques, resources, and challenges will help community understand state-of-the-art better and focus their efforts on tackling identified challenges 0.25 0.25 0.25 0.25 0.33 0.33 0.33 This arrangement can generate new sentences. However, not all generated sentences are meaningful. So, what we can do?
  12. LLMs 17 • One potential approach is to use many

    more examples to learn the probabilities. • But it might not be as helpful as desired. • A better approach would be to consider more than one token to decide the next token. • N-gram But how many?
  13. LLMs 18 • We use neural network to approximate the

    function for predicting the next token given a context. • However, increasing the number of units and capacity is not enough for a simple neural network to learn language modeling because of its complexity.
  14. 1 Transformer LLMs 19 Attention Next word prediction model 0

    1 0 Attention model learns where to focus to better predict the next token with the help of back propagation -> self-attention Both the models learn simultaneously
  15. LLMs 20 As the scope, target, and expectations from a

    model increases, the complexity also increases. • Stacking GPT-3 has 96 such layers 1 Attention Next word prediction model 0 1 0 1 Attention Next word prediction model 0 1 0 1 Attention Next word prediction model 0 1 0 1 Attention Next word prediction model 0 1 0
  16. LLMs 21 XGLM Cohere BERT 340M GPT-1 117M GPT-2 1.5B

    LANGUAGE MODEL SIZES TO MAR/2023 LifeArchitect.ai/models PaLM PaLM-Coder Minerva Med-PaLM Flan-PaLM U-PaLM Flan-U-PaLM Med-PaLM 2 540B BLOOM BLOOMZ 176B 20B 7.5B 13B Gopher 280B GPT-NeoX-20B MT-NLG 530B 52.4B Plato-XL Macaw 11B 11B Jurassic-1 178B 9.4B 6B GPT-J BlenderBot2.0 LaMDA LaMDA 2 Bard 137B GPT-3 175B ruGPT-3 Parameters AI lab/group Available Closed Chinchilla scale Chinchilla 70B* ⃡ Cedille Fairseq Anthropic-LM 6B Flamingo 80B* 20B * AlexaTM VIMA 200M 6.9B * 13B CM3 VLM-4 mGPT Luminous 200B 10B 13B Gato 1.2B OPT-175B BB3 OPT-IML 175B NLLB 54.5B GLM-130B ChatGLM-6B YaLM 100B 10B NOOR UL2 20B FIM 11B Flan-T5 11B 52B RL-CAI Claude PaLI 17B SeeKeR 2.7B 10B * WeLM Galactica 120B T5 Megatron-11B GPT-4 Undisclosed * MOSS 20B* LLaMA 65B* Kosmos-1 Z-Code++ 710M* 7B Toolformer 6.7B* * 11B Atlas Alpaca 1.6B* Beeswarm/bubble plot, sizes linear to scale. Selected highlights only. *Chinchilla scale means T:P ratio >15:1. https://lifearchitect.ai/chinchilla/ Alan D. Thompson. March 2023. https://lifearchitect.ai/
  17. LLMs 22 Olympus 2T (2024) LARGE LANGUAGE MODEL HIGHLIGHTS (APR/2024)

    LifeArchitect.ai/models Parameters AI lab/group ⃡ Sizes linear to scale. Selected highlights only. All models are available. All models are Chinchilla-aligned (20:1 tokens:parameters) https://lifearchitect.ai/chinchilla/ All 300+ models: https://lifearchitect.ai/models-table/ Alan D. Thompson. 2023-2024. Gemini Ultra 1.0 1.5T Claude 3 Opus 2T Next… (2024) 30B 70B 180B Gemini Pro 180B Nano Mamba 2.8B phi-2 2.7B … XS Pythia 12B Mistral 7B Zephyr 7.3B Gauss StripedHyena 7B Persimmon-8B DeciLM-7B SOLAR 10.7B Gemma 7B PaLM 2 340B Inflection-2.5 Grok-1 314B GPT-4 1.76T MoE ERNIE 4.0 1T ChatGPT gpt-3.5-turbo 20B Large Yuan 2.0 102.6B InternLM 104B Jurassic-2 Falcon 180B DBRX 132B MoE Mistral-medium Command R+ 104B … Small Palmyra 20B C1.2 Retro 48B MPT-30B Command-R 35B Yi-34B Mixtral 8x7B … Medium Command 52B StableLM 65B Eurus 70B Luminous Supreme Llama 3 70B Perplexity 70B Online Qwen-72B DeepSeek 67B
  18. LLMs for code 23 A Survey of Large Language Models

    for Code: Evolution, Benchmarking, and Future Trends
  19. • Foundation models • Number of parameters • Chain of

    thoughts • Fine-tuning • Prompt • Prompt engineering • Zero-shot learning • Few-shot learning • Temperature • Multi-modal Common terms
  20. • Efficiency and productivity • Code generation • Real-time code

    suggestions • Automated documentation • Customization • Applicability for wide range of use-cases • Creativity • Ease of use • Learning Benefits
  21. Current state of LLMs for code research 34 LLMs for

    code Predict/Classify Summarize Selective synthesis Multi-agent processing End-to-end development Complexity Sentiment analysis, vulnerability or code quality issue prediction, … Renaming a method, document generation, … Generate code for a specific problem, … Identify security vulnerabilities in the latest commit and lodge an issue, … Identify and prioritize requirements, code, refactor, test, debug, …
  22. LLMs in SE research 35 Large Language Models for Software

    Engineering: Survey and Open Problems
  23. Impact • Essential part of software development • Code generation

    - Copilot generates 61% of Java code (Feb 2023) • Test suite generation – Cover from DiffBlue • Threatening the status quo • StackOverFlow, or, in general, Google search • New tools and agents leveraging LLMs • Improved effectiveness of automated approaches using LLMs • Effective embeddings, for example 37
  24. LLMs are constantly improving 38 Unifying the Perspectives of NLP

    and Software Engineering: A Survey on Language Models for Code. 2023
  25. Correctness of generated code 42 Is Your Code Generated by

    ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation, 2023
  26. Correctness of generated code 45 Bugs in Large Language Models

    Generated Code: An Empirical Study, 2024
  27. 46 Evaluating the Code Quality of AI-Assisted Code Generation Tools:

    An Empirical Study on GitHub Copilot, Amazon CodeWhisperer, and ChatGPT, Oct 2023 How maintainable is the generated code?
  28. • Copilot is less likely to generate vulnerable code corresponding

    to newer vulnerabilities • Copilot is more prone to generate certain types of vulnerabilities • Using Copilot to fix security bugs is risky, given that Copilot did introduce vulnerabilities in at least a third of the cases we studied. Vulnerabilities in generated code Is GitHub’s Copilot as Bad as Humans at Introducing Vulnerabilities in Code?
  29. Vulnerabilities in generated code 48 A survey identified • 83

    “good” papers that highlight the positive contributions of LLMs to security and privacy. • 54 “bad” papers, in which attackers exploited LLMs to target users, • 144 “ugly” papers, in which authors discovered vulnerabilities within LLMs. A Survey on Large Language Model (LLM) Security and Privacy: The Good, The Bad, and The Ugly, June 2024
  30. Vulnerabilities in generated code 49 • In total 376 of

    these websites (approximately 78%) lacked essential extension checks exposing them to potential malicious file uploads. • Alarmingly, only 1143 (about 45.72%) websites implemented prepared statements, leaving 54.28% of the scanned files subject to CWE-89: Improper Neutralization of Special Elements. • We identified 459 SQL injection, 57 stored XSS, 394 reflected XSS vulnerable parameters in the entire dataset. LLMs in Web Development: Evaluating LLM-Generated PHP Code Unveiling Vulnerabilities and Limitations, 2024 Our findings serve as a strong reminder of the continuous and evolving threat landscape, urging developers and security professionals to remain vigilant and use generative AI with caution.
  31. Can LLMs reason? 50 LLMs cannot always perform analogical reasoning

    and the key influencing factor is the accuracy of self-generated examples rather than their relevance. Relevant or Random: Can LLMs Truly Perform Analogical Reasoning?
  32. Can LLMs reason? 51 Overall, GPT-4 performs best in grasping

    inferential rules. But compared to human performance, there still remains substantial room for improvement across all models, especially in highly compositional, symbolic and structural complex rules. Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs, May 2024
  33. Can LLMs reason? 52 LLMs, especially GPT families, can effectively

    stimulate the reasoning results of logic solvers. Although LLMs demonstrate satisfactory performance on several datasets, the potential drawbacks and limitations of LLMs for logic code simulation should not be underestimated. Can LLMs Reason with Rules? Logic Scaffolding for Stress-Testing and Improving LLMs, May 2024
  34. Technical challenges • Context window • How to assess their

    effectiveness • Comprehensive benchmarks (not limited to accuracy) • Evaluation on realistic problems/samples • Data leakage • Multi-step tasks that require domain-specific direction • Prompt injection • Indistinguishability between instruction and data 53
  35. Societal challenges 54 • Effect of LLMs on climate •

    Bias • Pedagogy • Are these LLMs making students dumb? • Over-reliance and loss of critical thinking
  36. • Lack of accountability • Copyright • Many active lawsuits

    • Level of transparency • Large organizations do not share their sources • Misinformation and disinformation • Privacy concerns Legal challenges
  37. Impact on environment and cost 56 Llama cost of training

    - 2048 A100 GPUs for 23 days - Electricity cost $53K Operational cost: ChatGPT spends $700K per day Google PaLM trained on 6144 TPUs V4 made of two TPU V4 pods Meta AI’s OPT was trained on 992 A100 GPUs https://www.economist.com/technology-quarterly/2020/06/11/the-cost-of-training-machines-is-becoming-a-problem
  38. How green are LLMs GPT3 • 1287 GW-hr • 502

    tons of carbon • 120 years' worth of single-family electricity usage of an American household 57
  39. Opportunities • Better integration with apps, APIs, and bots •

    Domain/context-specific support • Greener models • Reduced inference latency • Enhanced code understanding capabilities • LLMs that understand software design and architecture 59
  40. THANK YOU D r. Tu s h a r S

    h a r m a t u s h a r @ d a l . c a