Upgrade to Pro — share decks privately, control downloads, hide ads and more …

Making your Kubernetes-based log collection rel...

Palark
November 24, 2023

Making your Kubernetes-based log collection reliable & durable with Vector

Tech talk by Maksim Nabokikh, Platform Lead @ Palark, presented at KCD (Kubernetes Community Days) Austria 2023 and OSMC (Open Source Monitoring Conference) 2023.

Vector is an Open Source high-performance solution for collecting & processing your observability data. In this talk, Maksim shares our experience using it for log collection in hundreds of Kubernetes clusters.

Find more resources for this talk:
* Blog article: Collecting logs in Kubernetes with Vector: Benefits, architecture, real cases.
* YouTube video.

P.S. Subscribe to the Palark tech blog to get our latest articles on DevOps, SRE, Kubernetes, and more!

Palark

November 24, 2023
Tweet

More Decks by Palark

Other Decks in Technology

Transcript

  1. DISCLAIMER During this talk preparation, no Kubernetes clusters were hurt

    Just kidding, in reality, there were ple-e-e-enty of outages
  2. ABOUT PALARK We offer all-in-one DevOps-as-a-Service and pick the best

    Open Source projects to fulfill our client goals 16 70 Years in Linux, DevOps & Kubernetes Managed Kubernetes clusters 15 90 Awesome engineers Tech posts at blog.palark.com
  3. PLAN LOGS IN KUBERNETES Let’s recall what to collect in

    Kubernetes WHAT IS VECTOR And in which way it is applicable PRACTICAL USE Exciting operating (Ops) experience cases 1 2 3
  4. LOGS IN KUBERNETES: POD LOGS Log file location path consists

    of a pod name, container name, and UID Format and location of files depends on the CRI settings Max size and rotation depends on the kubelet settings kubernetes.io/docs/concepts/cluster-administration/logging/ /var/log/pods pod-1 pod-2 kubelet stdout stderr stdout stderr
  5. LOGS IN KUBERNETES: NODE SERVICES Files in the /var/log directory

    (probably) Max size and rotation configured by journald Format can be anything… kubernetes.io/docs/concepts/cluster-administration/logging/ containerd kubelet audit logs syslog
  6. LOGS IN KUBERNETES: EVENTS Can only be collected from the

    Kubernetes API Can be collected as either logs, metrics, or traces kubernetes.io/docs/concepts/cluster-administration/logging/ apiVersion: v1 kind: Event count: 1 metadata: name: standard-worker-1.178264e1185b006f namespace: default reason: RegisteredNode firstTimestamp: '2023-09-06T19:08:47Z' lastTimestamp: '2023-09-06T19:08:47Z' involvedObject: apiVersion: v1 kind: Node name: standard-worker-1 uid: 50fb55c5-d97e-4851-85c6-187465154db6 message: 'Registered Node standard-worker-1 in Controller'
  7. LOGS IN KUBERNETES: EVENTS Can only be collected from the

    Kubernetes API Can be collected as either logs, metrics, or traces kubernetes.io/docs/concepts/cluster-administration/logging/ apiVersion: v1 kind: Event count: 1 metadata: name: standard-worker-1.178264e1185b006f namespace: default reason: RegisteredNode firstTimestamp: '2023-09-06T19:08:47Z' lastTimestamp: '2023-09-06T19:08:47Z' involvedObject: apiVersion: v1 kind: Node name: standard-worker-1 uid: 50fb55c5-d97e-4851-85c6-187465154db6 message: 'Registered Node standard-worker-1 in Controller'
  8. LOGS IN KUBERNETES: EVENTS Can only be collected from the

    Kubernetes API Can be collected as either logs, metrics, or traces kubernetes.io/docs/concepts/cluster-administration/logging/ apiVersion: v1 kind: Event count: 1 metadata: name: standard-worker-1.178264e1185b006f namespace: default reason: RegisteredNode firstTimestamp: '2023-09-06T19:08:47Z' lastTimestamp: '2023-09-06T19:08:47Z' involvedObject: apiVersion: v1 kind: Node name: standard-worker-1 uid: 50fb55c5-d97e-4851-85c6-187465154db6 message: 'Registered Node standard-worker-1 in Controller'
  9. LOGS IN KUBERNETES: EVENTS Can only be collected from the

    Kubernetes API Can be collected as either logs, metrics, or traces kubernetes.io/docs/concepts/cluster-administration/logging/ apiVersion: v1 kind: Event count: 1 metadata: name: standard-worker-1.178264e1185b006f namespace: default reason: RegisteredNode firstTimestamp: '2023-09-06T19:08:47Z' lastTimestamp: '2023-09-06T19:08:47Z' involvedObject: apiVersion: v1 kind: Node name: standard-worker-1 uid: 50fb55c5-d97e-4851-85c6-187465154db6 message: 'Registered Node standard-worker-1 in Controller'
  10. WHAT IS VECTOR Vendor agnostic You do not need to

    rewrite Vector in Rust Performance by design and continuous benchmarking Flexible building block vector.dev An open source, efficient tool for building log collecting pipelines
  11. … VECTOR’S ARCHITECTURE Remap Filter Aggregate Collect Transform Send File

    K8s Socket 9 in total 40 in total 52 in total Vector Remap Language (VRL)
  12. VECTOR REMAP LANGUAGE [transforms.filter_severity] type = "filter" inputs = ["logs"]

    condition = '.severity != "info"' [transforms.sanitize_kubernetes_labels] type = "remap" inputs = ["logs"] source = ''' if exists(.pod_labels."controller-revision-hash") { del(.pod_labels."controller-revision-hash") } if exists(.pod_labels."pod-template-hash") { del(.pod_labels."pod-template-hash") } '''
  13. VECTOR REMAP LANGUAGE [transforms.filter_severity] type = "filter" inputs = ["logs"]

    condition = '.severity != "info"' [transforms.sanitize_kubernetes_labels] type = "remap" inputs = ["logs"] source = ''' if exists(.pod_labels."controller-revision-hash") { del(.pod_labels."controller-revision-hash") } if exists(.pod_labels."pod-template-hash") { del(.pod_labels."pod-template-hash") } ''' [transforms.backslash_multiline] type = "reduce" inputs = ["logs"] group_by = ["file", "stream"] merge_strategies."message" = "concat_newline" ends_when = ''' matched, err = match(.message, r'[^\\]$'); if err != null { false; } else { matched; } '''
  14. LOG COLLECTING TOPOLOGIES log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper

    log-shipper log-shipper log-shipper aggregator storage aggregator storage Distributed Centralized
  15. LOG COLLECTING TOPOLOGIES log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper

    log-shipper log-shipper log-shipper aggregator storage aggregator storage log-shipper log-shipper log-shipper log-shipper log-shipper queue storage Distributed Centralized Stream
  16. LOG COLLECTING TOPOLOGIES log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper

    log-shipper log-shipper log-shipper aggregator storage aggregator storage log-shipper log-shipper log-shipper log-shipper log-shipper queue storage Distributed Centralized Stream
  17. LOG COLLECTING TOPOLOGIES aggregator storage aggregator storage queue storage Distributed

    Centralized Stream log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper log-shipper
  18. VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml /var/log /vector-data /etc/vector Vector Reloader Kube

    RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System
  19. VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet /var/log /vector-data

    /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System
  20. VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: -

    name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System
  21. VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: -

    name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime volumeMounts: - name: var-log mountPath: /var/log/ readOnly: true /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System
  22. VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: -

    name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime volumeMounts: - name: var-log mountPath: /var/log/ readOnly: true terminationGracePeriodSeconds: 120 /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System
  23. VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: -

    name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime volumeMounts: - name: var-log mountPath: /var/log/ readOnly: true terminationGracePeriodSeconds: 120 shareProcessNamespace: true /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System
  24. VECTOR IN KUBERNETES github.com/deckhouse/deckhouse/blob/main/modules/460-log-shipper/templates/daemonset.yaml apiVersion: apps/v1 kind: DaemonSet volumes: -

    name: var-log hostPath: path: /var/log/ - name: vector-data-dir hostPath: path: /mnt/vector-data - name: localtime hostPath: path: /etc/localtime volumeMounts: - name: var-log mountPath: /var/log/ readOnly: true terminationGracePeriodSeconds: 120 shareProcessNamespace: true /var/log /vector-data /etc/vector Vector Reloader Kube RBAC proxy log-shipper Vector – collects logs Reloader – validates config and reloads Kube RBAC proxy – protects metrics Node File System
  25. $ lsof -nP | grep '(deleted)' vector 6331 root 25r

    REG 253,3 10602 72738831 /var/log/.../1.log (deleted) vector 6331 root 44r REG 253,3 10239 33665268 /var/log/.../1.log (deleted) vector 6331 6628 vector-wo root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) vector 6331 6628 vector-wo root 44r REG 253,3 10239 33665268 /var/log/.../1.log (deleted) vector 6331 6629 vector-wo root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) CASE #1: NO SPACE LEFT ON THE DEVICE
  26. $ lsof -nP | grep '(deleted)' vector 6331 root 25r

    REG 253,3 10602 72738831 /var/log/.../1.log (deleted) vector 6331 root 44r REG 253,3 10239 33665268 /var/log/.../1.log (deleted) vector 6331 6628 vector-wo root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) vector 6331 6628 vector-wo root 44r REG 253,3 10239 33665268 /var/log/.../1.log (deleted) vector 6331 6629 vector-wo root 25r REG 253,3 10602 72738831 /var/log/.../1.log (deleted) CASE #1: NO SPACE LEFT ON THE DEVICE
  27. Vector /var/log/pods /var/log/pods/{uid}/1.log kubelet Loki 429 /var/log/pods/{uid}/1.log (DELETED) 50Mb 10Mb

    /var/log/pods/{uid}/1.log (DELETED) 50Mb CASE #1: NO SPACE LEFT ON THE DEVICE
  28. Vector /var/log/pods /var/log/pods/{uid}/1.log kubelet Loki 429 /var/log/pods/{uid}/1.log (DELETED) 50Mb 10Mb

    /var/log/pods/{uid}/1.log (DELETED) 50Mb CASE #1: NO SPACE LEFT ON THE DEVICE
  29. Vector /var/log/pods /var/log/pods/{uid}/1.log kubelet Loki 429 /var/log/pods/{uid}/1.log (DELETED) 50Mb 10Mb

    /var/log/pods/{uid}/1.log (DELETED) 50Mb /var/log/pods/{uid}/1.log (DELETED) 50Mb CASE #1: NO SPACE LEFT ON THE DEVICE
  30. HOW TO SOLVE? 1. Tune buffer settings Blocking (default) Drop

    Newest In Memory (default) Disk buffer Max events 1000 (default) 10000 2. Rule of a thumb Let logs go out of the node as quick as possible 3. If you brave enough sysctl -w fs.file-max=1000 (unsafe) vector.dev/docs/about/under-the-hood/architecture/buffering-model/ CASE #1: NO SPACE LEFT ON THE DEVICE
  31. HOW TO SOLVE? expire_metrics_secs=60 vector_component_errors_total time 7 3 3 errors

    4 m ore errors expiration triggered 3 errors empty! This behavior makes the result of the rate PromQL function equal to zero. CASE #2: PROMETHEUS EXPLODED
  32. 1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and

    Fairness API) apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: PriorityLevelConfiguration metadata: name: limit-list-custom spec: type: Limited limited: assuredConcurrencyShares: 5 limitResponse: queuing: handSize: 4 queueLengthLimit: 50 queues: 16 type: Queue apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: FlowSchema metadata: name: limit-list-custom spec: priorityLevelConfiguration: name: limit-list-custom distinguisherMethod: type: ByUser rules: - resourceRules: - apiGroups: [""] clusterScope: true namespaces: ["*"] resources: ["pods"] verbs: ["list", "get"] subjects: - kind: ServiceAccount serviceAccount: name: *** namespace: *** HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE
  33. 1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and

    Fairness API) apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: PriorityLevelConfiguration metadata: name: limit-list-custom spec: type: Limited limited: assuredConcurrencyShares: 5 limitResponse: queuing: handSize: 4 queueLengthLimit: 50 queues: 16 type: Queue apiVersion: flowcontrol.apiserver.k8s.io/v1beta1 kind: FlowSchema metadata: name: limit-list-custom spec: priorityLevelConfiguration: name: limit-list-custom distinguisherMethod: type: ByUser rules: - resourceRules: - apiGroups: [""] clusterScope: true namespaces: ["*"] resources: ["pods"] verbs: ["list", "get"] subjects: - kind: ServiceAccount serviceAccount: name: *** namespace: *** HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE
  34. 1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and

    Fairness API) HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE
  35. 1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and

    Fairness API) 3. Use kubelet API instead of Kubernetes Pods metadata can be fetched by requesting the /pods endpoint HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE
  36. 1. Cache read (resourceVersion=0) 2. Limit concurrent requests (Priority and

    Fairness API) 3. Use kubelet API instead of Kubernetes HOW TO SOLVE? CASE #3: KUBERNETES CONTROL PLANE OUTAGE
  37. CONCLUSION 1. Great to build platforms 2. Vector is awesome,

    seriously, deploy it today 3. Share practical cases and learn together
  38. github.com/werf github.com/palark THANK YOU! Q&A @nabokihms [email protected] OPEN SOURCE TOOLS

    OUR BLOGS AND SOCIAL MEDIA CONTACT US palark.com twitter.com/palark_com MAKSIM NABOKIKH Platform Lead