Where are we going with observability? KubeDay Australia
This was a presentation of "Where are we going with observability?" at KubeDay Australia. It covered observability and genai, then both together, both ways. Engaged folks who tolerated my humor well, and met several after. Will love to go again!
We normalize key metrics or correlation fields we care about most • We spend a lot of energy in instrumentation • We collect a lot of logs, but may struggle analyzing impact of them
This is an expert group you can follow for anything from landscape to solutions. The next slides cover basics of GenAI including some wins and challenges
Causal language models are unidirectional: predict words based on previous ones • Large Language Models LLMs can perform functions like chat, code completion, etc. The LLM can answer correctly if geography was a part of the text it was trained on Assistant: Bouvet Island is located in the South Atlantic Ocean. User: Which ocean contains bouvet island?
for something in a database and misspell it → nope • If you mess up syntax for a query → nope The LLM is trained on language, so it can compensate on slightly off input Assistant: Bouvet Island is located in the South Atlantic Ocean. User: Which ocean cntains the bouvet island?
of written languages • LLMs can respond in the userʼs language even if its source is english Assistant: Pulau Bouvet terletak dalam Laut Atlantik Selatan. User: Lautan manakah yang mengandungi pulau bouvet?
• LLMs rely on statistical correlation and may invent things to fill gaps • Mitigate with selecting relevant models, prompt engineering, etc. There is no capital! It is only penguins and researchers! Assistant: The capital of the Bouvet Island is Punta Arenas. User: What is the capital of the Bouvet island?
model and what it knows about. • LLMs have a training date which is often not documented • LLMs are trained differently and may have no knowledge of a specific topic • LLMs might hallucinate when asked about their training date or knowledge cut-off! Should be Stanley Assistant: As of my knowledge cutoff date in September 2023… version 1.21 User: What is the latest version of go? The easiest way to add new information is to pass it to the context of an existing model, this is called RAG.
usage, cost, latency, success rate and model completions • Evaluations - Manual and automated for feedback • Dataset management - For fine tuning and bootstrapping few shot demos https://langtrace.ai
OpenAI is dominant within an otherwise very diverse ecosystem • Expectation of immediate support of features not planned openly • Norms of full request/response collection for debugging chats • Token counts are tied to spend, and both a system and biz metric • Vendors desire common telemetry tools, but they arenʼt there, yet • Accuracy is subjective and depends on the use case
of existing tools • Extracting patterns or system entities out of logs • Configuring or training data feeds • Connecting telemetry data to help desk tickets or source code
them • Logs start before other instrumentation like tracing • Some problems exist like changing log formats (think k8s) • Logs often have stack traces which can be cross-checked with code
reports, and even identify and execute a runbook! Observability AI Assistant There are other assistants in Kibana like Search, Security and Integration!
bidirectional model for Named Entity Recognition • This can be used for precise identification of PII in a redaction pipeline www.elastic.co/observability-labs/blog/pii-ner-regex-assess-redact-part-1
or root cause analysis • Flex with changing log formats or new types of frameworks • Get more value out of your logs, generating configuration like SLOs • Consider more fields than a human can comprehend • Integrate non-traditional facts such as pull requests, issues and code • Elastic Observability Lab -> www.elastic.co/observability-labs • Runtime TAG -> Cloud Native AI Working Group • OpenTelemetry -> Semantic Conventions: LLM SIG Want more?
log lines instead of a question SYSTEM """ You are a log file analyzer. You will receive a set of lines from a log file for some software application, find the errors and other interesting aspects of the logs, and explain them so a new user can understand what they mean. If there are any steps they can do to resolve them, list the steps in your answer. """ –snip– 2023-11-10 07:17:40 /docker-entrypoint.sh: Configuration complete; ready for start up 2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: using the "epoll" event method 2023-11-10 07:17:40 2023/11/10 13:17:40 [notice] 1#1: nginx/1.25.3 –snip– 2023-11-10 07:18:53 2023/11/10 13:18:53 [error] 29#29: *1 open() "/usr/share/nginx/html/ahstat" failed (2: No such file or directory), client: 192.168.65.1, server: localhost, request: "GET /ahstat HTTP/1.1", host: "localhost:8080" –snip–
figure out Here are the logs for your Nginx server from 2023-11-10 13:17:40 to 13:18:53. The first log entry shows that a worker process has started, which is normal and expected behavior. The next three entries show requests made by a client with IP address 192.168.65.1 to the server at localhost:8080 for the root path "/", "/favicon.ico", and "/ahstat". The first request returned a 200 status code, which indicates success. However, the requests for "/favicon.ico" and "/ahstat" both failed with a 404 status code, indicating that those resources were not found on the server. The final log entry shows another request from the same client to "/ahstat", also resulting in a 404 error. There is nothing inherently wrong or unusual about these logs, but it's worth noting that Nginx could not find the "/favicon.ico" and "/ahstat" resources requested by the client, which may indicate a misconfiguration or missing files on your server. If you expect to serve those resources, you should ensure they exist in the correct location (by default, /usr/share/nginx/html) and that Nginx is properly configured to serve them.
to be very precise, particularly in how it represents errors. These rules are typically in a schema. The LLM derives impact and semantics from plain logs, without any preparation, even knowing 404 is not necessarily an error.