Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Understanding Jlama in Quarkus

Understanding Jlama in Quarkus

Mario Fusco

March 12, 2025
Tweet

More Decks by Mario Fusco

Other Decks in Programming

Transcript

  1. Jlama - Serve your LLM in pure Java ❖ Run

    LLM inference directly embedded into your Java application ❖ Based on the Java Vector API ❖ Integrated with Quarkus and LangChain4j ❖ Jlama includes out-of-the-box ➢ Support for many different LLM families ➢ A command line tool that makes it easy to use ➢ A tool for models quantization ➢ A pure Java tokenizer ➢ Distributed inference
  2. … but why Java? Fast development/prototyping → Not necessary to

    install, configure and interact with any external server. Security → Embedding the model inference in the same JVM instance of the application using it, eliminates the need of interacting with the LLM only through REST calls, thus preventing the leak of private data. Legacy support: Legacy users still running monolithic applications on EAP can include LLM-based capabilities in those applications without changing their architecture or platform. Monitoring and Observability: Gathering statistics on the reliability and speed of the LLM response can be done using the same tools already provided by EAP or Quarkus. Developer Experience → Debuggability will be simplified, allowing Java developers to also navigate and debug the Jlama code if necessary. Distribution → Possibility to include the model itself into the same fat jar of the application using it (even though this could probably be advisable only in very specific circumstances). Edge friendliness → Deploying a self-contained LLM-capable Java application will also make it a better fit than a client/server architecture for edge environments. Embedding of auxiliary LLMs → Apps using different LLMs, for instance a smaller one to to validate the responses of the main bigger one, can use a hybrid approach, embedding the auxiliary LLMs. Similar lifecycle between model and app →Since prompts are very dependent on the model, when it gets updated, even through fine-tuning, the prompt may need to be replaced and the app updated accordingly.
  3. Because we are not data scientists … well … no,

    seriously … why not Python? What we do is integrating existing models
  4. Because we are not data scientists … well … no,

    seriously … why not Python? What we do is integrating existing models into enterprise- grade systems and applications
  5. Because we are not data scientists … well … no,

    seriously … why not Python? What we do is integrating existing models Do you really want to do • Transactions • Security • Scalability • Observability • … you name it in Python??? into enterprise- grade systems and applications
  6. Integrating Jlama with Quarkus and LangChain4j <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-jlama</artifactId> <version>${quarkus.langchain4j.version}</version>

    </dependency> <dependency> <groupId>com.github.tjake</groupId> <artifactId>jlama-core</artifactId> <version>${jlama.version}</version> </dependency> <dependency> <groupId>com.github.tjake</groupId> <artifactId>jlama-native</artifactId> <version>${jlama.version}</version> <classifier>${os.detected.classifier}</classifier> </dependency> Quarkus/LangChain4j/Jlama integration module Jlama native support for specific operating system (optional)
  7. Integrating Jlama with Quarkus and LangChain4j <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-jlama</artifactId> <version>${quarkus.langchain4j.version}</version>

    </dependency> <dependency> <groupId>com.github.tjake</groupId> <artifactId>jlama-core</artifactId> <version>${jlama.version}</version> </dependency> <dependency> <groupId>com.github.tjake</groupId> <artifactId>jlama-native</artifactId> <version>${jlama.version}</version> <classifier>${os.detected.classifier}</classifier> </dependency> <plugin> <groupId>${quarkus.platform.group-id}</groupId> <artifactId>quarkus-maven-plugin</artifactId> <version>${quarkus.platform.version}</version> <extensions>true</extensions> <configuration> <jvmArgs>--enable-preview --enable-native-access=ALL-UNNAMED</jvmArgs> <modules> <module>jdk.incubator.vector</module> </modules> </configuration> </plugin> Enable Vector API preview feature
  8. Integrating Jlama with Quarkus and LangChain4j <dependency> <groupId>io.quarkiverse.langchain4j</groupId> <artifactId>quarkus-langchain4j-jlama</artifactId> <version>${quarkus.langchain4j.version}</version>

    </dependency> <dependency> <groupId>com.github.tjake</groupId> <artifactId>jlama-core</artifactId> <version>${jlama.version}</version> </dependency> <dependency> <groupId>com.github.tjake</groupId> <artifactId>jlama-native</artifactId> <version>${jlama.version}</version> <classifier>${os.detected.classifier}</classifier> </dependency> <plugin> <groupId>${quarkus.platform.group-id}</groupId> <artifactId>quarkus-maven-plugin</artifactId> <version>${quarkus.platform.version}</version> <extensions>true</extensions> <configuration> <jvmArgs>--enable-preview --enable-native-access=ALL-UNNAMED</jvmArgs> <modules> <module>jdk.incubator.vector</module> </modules> </configuration> </plugin> #application.properties quarkus.langchain4j.jlama.chat-model.model-name=tjake/Llama-3.2-1B-Instruct-JQ4 quarkus.langchain4j.jlama.chat-model.temperature=0 Configure a model from Hugging Face (it will be download automatically on 1st run)
  9. A pure Java LLM-based application: the site summarizer @Path("/summarize") public

    class SiteSummarizerResource { @Inject private SiteSummarizer siteSummarizer; @GET @Path("/{type}/{topic}") @Produces(MediaType.TEXT_PLAIN) public Multi<String> read(@PathParam("type") String type, @PathParam("topic") String topic) { return siteSummarizer.summarize(SiteType.determineType(type), topic); } } @RegisterAiService public interface SummarizerAiService { @SystemMessage(""" You are an assistant that receives the content of a web page and sums up the text on that page. Add key takeaways to the end of the sum-up. """) @UserMessage("Here's the text: '{text}'") Multi<String> summarize(@V("text") String text); } @ApplicationScoped class SiteSummarizer { @Inject private SummarizerAiService summarizerAiService; public Multi<String> summarize(SiteType siteType, String topic) { String html = SiteCrawler.crawl(siteType, topic); String content = TextExtractor.extractText(html, 20_000)); return summarizerAiService.summarize(content); } }
  10. Jlama Performance Limits (Large Models) • Inference is memory bound

    • Inference is 99% Matrix Multiplication • Bigger moder == Slower Inference • Humans read at ~5 words/sec • Latency 200ms / token MAX Sweet Spot
  11. Jlama Performance Limits (Large Models) Performance of same sized model

    ~7GB (8B-10B Parameter model @ Q4) • Plain Java + Threads • Native SIMD C (via FFI) • Java Vector API Just barely 5 tok/sec Too bad java + gpu doesn’t mix 🤔
  12. Java + GPU? • Project Babylon one day? ◦ That

    will be great! But when… 202X? • TornadoVM ◦ Requires bespoke wrapper around jdk to compile/run applications (with all driver dependencies) • What else is multi-platform, portable runtime for the GPU?
  13. Jlama Performance (GPU Support Coming Soon!) Performance of same sized

    model ~7GB (8B-10B Parameter model @ Q4) • Plain Java + Threads • Native SIMD C (via FFI) • Java Vector API • Java <-> WebGPU Google Dawn WebGPU Engine + FFI Works on Win/Mac/Linux !