Logging on Nomad and log aggregation with Loki

Introduction

When running a task orchestrator like Nomad or Kubernetes, there’s usually a bunch of different instances ( containers, micro-VMs, jails, etc. ) running, more or less ephemerally, across a fleet of servers. By default all logs would be local to the nodes actually running the stuff we want to run, making it burdensome to debug, correlate events, alert, etc., especially if the node crashes, hence why it’s a well established practice to collect all logs they emit to a central location, where all of those actions happen.

One of the most popular log management stacks is the so-called ELK ( ElasticSearch for log indexing, Logstash for parsing/transforming them, and Kibana for visualisation, with either Filebeat or Fluentd / Fluent bit for log collection), which has a few drawbacks - most notably heavy resource consumption and licence changes, the latter leading to numerous forks, which will probably result in some chaos/incompatibilities in the future.

A recent-ish contender in that space is Grafana Labs’ Loki. It’s a lightweight Prometheus-inspired tool, which can be run as a bunch of microservices for better and easier scale-out, or in monolithic all-in-one mode. In contrast to ElasticSearch, it only indexes labels ( which are user defined ), the logs themselves ( chunks ) are stored as-is, separately. That makes it more flexible and much cheaper with regards to storage and compute.

There’s a variety of storage options - Cassandra (index and chunk), GCS/S3 (chunk), BigTable (chunk), DynamoDB(chunk) and the monolithic-only local storage ( which uses BoltDB internally for indexing). The last one is great for getting started/testing/non-huge projects, and the lack of redundancy ( since it writes to the local filesystem) can be offset by the storage provider ( e.g. a CSI plugin, GlusterFS, DRBD, or good old NFS). Visualisation from Loki is done with Grafana ( not really surprising since they’re both made by the same company), via the LogQL PromQL-inspired query language.

Having been using Loki for a few months in production, I’m really like it and hence this article on logs and using it on and with Nomad.

There are two main ways to get logs from Nomad tasks into Loki ( one of which only works for Docker tasks), and I’ll discuss the pros and cons of each one later on.

How logs on Nomad work

By default Nomad writes task logs to the alloc/logs folder shared by all tasks in an allocation, which is stored locally on the client node running it. That makes accessing logs slightly burdensome and look like the following:

# get all current allocations for the job
$ nomad job status random-logger-example 
ID            = random-logger-example
Name          = random-logger-example
Submit Date   = 2021-04-27T23:28:29+02:00
Type          = service
Priority      = 50
Datacenters   = dc1
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group  Queued  Starting  Running  Failed  Complete  Lost
random      0       0         1        0       0         0

Latest Deployment
ID          = beffde05
Status      = running
Description = Deployment is running

Deployed
Task Group  Desired  Placed  Healthy  Unhealthy  Progress Deadline
random      1        1       0        0          2021-04-27T23:38:29+02:00

Allocations
ID        Node ID   Task Group  Version  Desired  Status   Created  Modified
529b074e  a394f493  random      0        run      running  8s ago   3s ago

# get the logs for an allocation
$ nomad alloc logs 529b074e

As soon as there’s more than one allocation, you have to get the logs for each one by one; if you want to see the logs of an old allocation ( assuming it hasn’t been garbage collected already ), you have to run nomad job status with the -all-allocs flag to include non-current versions.

Specific log options for the task driver can be configured on multiple levels - per task, per client, or in certain cases, most notably Docker, per task driver.

On the task level one can send all logs to a central syslog server (easily adaptable to fluentd, Splunk, etc. ) like so:

task "syslog" {
      driver = "docker"
      config {
        logging {
          type = "syslog"
          config = {
            syslog-address         = "tcp://my-syslog-server:10514"
            tag                    = "${NOMAD_TASK_NAME}"
          }
        }
      }
}

This is perfectly fine, except that you have to do it on each and every task, which can quickly become burdensome, especially if your logging configuration changes one day, and you need to be on Docker Engine 20.10+, or use a logging driver that supports writing the logs locally as well as to the remote endpoint, otherwise docker logs and nomad alloc logs are unusable because the logs get sent directly. To facilitate that, there is use the new Docker driver-wide logging configuration, but then you’re limited only to Docker logging drivers, and, the bigger issue in my opinion, you’re lacking context (unless the stuff you run detects and specifically logs that context).

The, in my humble opinion, better option, is to have something that runs automatically on all client nodes, via a system job, and collects all logs, like what everyone does with their Kubernetes clusters via Promtail, Fluend/Fluentbit, Datadog agent, Splunk Connect, etc. The biggest challenges with that are:

context - it’s good to know everything about where the logs are from - which task, namespace, client node, etc.
discovery - how to find all new allocations to ship the logs of?

Recently, there have been some great news concerning those two challenges in regards to Docker-based tasks - this PR on Nomad that adds extra Docker labels with job, task, task group, node, namespace names, this PR that enables default logging options for the Docker driver on the Nomad side.

I’ll describe how to collect and ship logs from Nomad, and then briefly go over how to run Loki, a lightweight log aggregation tool by Grafana Labs, on Nomad.

Logs collection

Nomad makes each allocation’s logs available through the API, so ideally our logging agent should be able to connect to it directly. The only mainstream one that does that at the time of writing is:

Filebeat

Filebeat, Elastic’s log collection agent, has an auto-discovery module for Nomad, experimentally via this PR which hasn’t been released in a stable version yet. If one wants to use ElasticSearch, or any of the compatible alternatives or companions ( Logstash, Kafka), that’s great. In theory one can configure Filebeat to output to a file, and then parse that file with another logging agent ( like Vector or Promtail) and ship the logs somewhere else, but IMHO that seems too much hassle for little benefit.

Promtail

Grafana provide a log collecting companion to Loki in the form of Promtail, which is lightweight and easy to configure, but much more limited compared to filebeat, fluent* or Vector in terms of compatible sources. The biggest downside is that for Docker logs it can’t use Docker labels for extra context, and having only the container name and ID is often insufficient.

Vector

Vector is a very fast and lightweight observability agent that can collect logs and metrics, transform and ship them to a variety of backends by Timber, a logging SaaS. They were recently acquired by Datadog, who also have an agent and logging backend, so they obviously saw some value in them. And, very importantly for this story, Vector’s Docker logs source can use Docker labels to enrich the logs’ context, and there’s a Loki sink ( backend ). So, let’s see about deploying Vector as a system job to collect logs from Docker.

First, we need to allow Vector to access the Docker daemon and configure Nomad to add extra metadata; the easiest and cleanest way to do that is to pass the Docker socket as a host volume, which need to be declared at the host (Nomad client) level, and then mounted in the Vector job. We can do that read-only, and we can use ACLs to limit who can mount it to avoid allowing every job to mount it and do whatever.

hcl Nomad client configuration for the Docker sock host volume and extra Docker labels


plugin "docker" {
  config {  
    # extra Docker labels to be set by Nomad on each Docker container with the appropriate value
    extra_labels = ["job_name", "task_group_name", "task_name", "namespace", "node_name"]
  }
}

client {
  host_volume "docker-sock-ro" {
    path = "/var/run/docker.sock"
    read_only = true
  }
}

hcl An ACL policy allowing the host volume to be mounted as read-only


host_volume "docker-sock-ro" {
  policy = "read" # read = read-only, write = read/write, deny to deny
}

Second, actually mount the newly available host volume inside the Vector task:

hcl Declaring and mounting the Docker sock volume



group "vector" {
    ...
    volume "docker-sock" {
      type = "host"
      source = "docker-sock"
      read_only = true
    }
    task "vector" {
      ...
      volume_mount {
        volume = "docker-sock"
        destination = "/var/run/docker.sock"
        read_only = true
      }
    }
}

Third, tell Vector to collect Docker’s logs and send them to Loki, enriching them with the metadata extracted from the Docker labels:

toml Vector configuration file


          data_dir = "alloc/data/vector/"
          [sources.logs]
            type = "docker_logs"
          [sinks.out]
            type = "console"
            inputs = [ "logs" ]
            encoding.codec = "json"
          [sinks.loki]
            type = "loki" 
            inputs = ["logs"] 
            endpoint = "http://loki.example"
            encoding.codec = "json" 
            healthcheck.enabled = true 
            # since . is used by Vector to denote a parent-child relationship, and Nomad's Docker labels contain ".",
            # we need to escape them twice, once for TOML, once for Vector
            labels.job = "{{ label.com\\.hashicorp\\.nomad\\.job_name }}"
            labels.task = "{{ label.com\\.hashicorp\\.nomad\\.task_name }}"
            labels.group = "{{ label.com\\.hashicorp\\.nomad\\.task_group_name }}"
            labels.namespace = "{{ label.com\\.hashicorp\\.nomad\\.namespace }}"
            labels.node = "{{ label.com\\.hashicorp\\.nomad\\.node_name }}"
            # remove fields that have been converted to labels to avoid having them twice
            remove_label_fields = true

And that’s it.

Full example Vector job file with all those components and dynamic sourcing of the Loki address from Consul:

hcl Vector job file


job "vector" {
  datacenters = ["dc1"]
  # system job, runs on all nodes
  type = "system" 
  update {
    min_healthy_time = "10s"
    healthy_deadline = "5m"
    progress_deadline = "10m"
    auto_revert = true
  }
  group "vector" {
    count = 1
    restart {
      attempts = 3
      interval = "10m"
      delay = "30s"
      mode = "fail"
    }
    network {
      port "api" {
        to = 8686
      }
    }
    # docker socket volume
    volume "docker-sock" {
      type = "host"
      source = "docker-sock"
      read_only = true
    }
    ephemeral_disk {
      size    = 500
      sticky  = true
    }
    task "vector" {
      driver = "docker"
      config {
        image = "timberio/vector:0.14.X-alpine"
        ports = ["api"]
      }
      # docker socket volume mount
      volume_mount {
        volume = "docker-sock"
        destination = "/var/run/docker.sock"
        read_only = true
      }
      # Vector won't start unless the sinks(backends) configured are healthy
      env {
        VECTOR_CONFIG = "local/vector.toml"
        VECTOR_REQUIRE_HEALTHY = "true"
      }
      # resource limits are a good idea because you don't want your log collection to consume all resources available 
      resources {
        cpu    = 500 # 500 MHz
        memory = 256 # 256MB
      }
      # template with Vector's configuration
      template {
        destination = "local/vector.toml"
        change_mode   = "signal"
        change_signal = "SIGHUP"
        # overriding the delimiters to [[ ]] to avoid conflicts with Vector's native templating, which also uses {{ }}
        left_delimiter = "[["
        right_delimiter = "]]"
        data=<<EOH
          data_dir = "alloc/data/vector/"
          [api]
            enabled = true
            address = "0.0.0.0:8686"
            playground = true
          [sources.logs]
            type = "docker_logs"
          [sinks.out]
            type = "console"
            inputs = [ "logs" ]
            encoding.codec = "json"
          [sinks.loki]
            type = "loki" 
            inputs = ["logs"] 
            endpoint = "http://[[ range service "loki" ]][[ .Address ]]:[[ .Port ]][[ end ]]" 
            encoding.codec = "json" 
            healthcheck.enabled = true 
            # since . is used by Vector to denote a parent-child relationship, and Nomad's Docker labels contain ".",
            # we need to escape them twice, once for TOML, once for Vector
            labels.job = "{{ label.com\\.hashicorp\\.nomad\\.job_name }}"
            labels.task = "{{ label.com\\.hashicorp\\.nomad\\.task_name }}"
            labels.group = "{{ label.com\\.hashicorp\\.nomad\\.task_group_name }}"
            labels.namespace = "{{ label.com\\.hashicorp\\.nomad\\.namespace }}"
            labels.node = "{{ label.com\\.hashicorp\\.nomad\\.node_name }}"
            # remove fields that have been converted to labels to avoid having the field twice
            remove_label_fields = true
        EOH
      }
      service {
        check {
          port     = "api"
          type     = "http"
          path     = "/health"
          interval = "30s"
          timeout  = "5s"
        }
      }
      kill_timeout = "30s"
    }
  }
}

Overall, that’s pretty a pretty good solution for the problem, with a lightweight and fast logging agent, some context, and everything is configured from within Nomad; however, that only works for Docker tasks and requires some (light) configuration on each client node.

nomad_follower (+Promtail)

There’s a project called nomad_follower, which uses Nomad’s API to tail allocation logs. It requires no outside (of the Nomad system job running it) configuration, and compiles all the logs with metadata and context in a file which can then be scraped by a logging agent and sent to whatever logging backend you have. It’s a bit more obscure compared to Vector or Filebeat, and debugging it might require diving in the code ( speaking from experience, it didn’t accept my logfmt formatted logs because it was looking at a timestamp starting in the first 4 characters of each line (used to deal with multi-line logs), but mine started after 5.. so i had to fork and patch it). Nonetheless, coupled with a logging agent it’s a fairly decent solution that works for all task types.

Configuration is done via env variables ( like the file where to write all logs, the Nomad/Consul service tag to filter on, etc.), including Nomad API access/credentials. The logs are stored in JSON, with JSON-formatted logs inside the data field, non-JSON (like logfmt) logs are in the message field, and there’s a bunch of metadata alongside them:

alloc_id
job_name
job_meta
node_name
service_name
service_tags
task_meta
task_name

hcl Example nomad_follower job file


  group "log-shipping" {
    count = 1
    restart {
      attempts = 2
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }
    ephemeral_disk {
      size = 300
    }
    task "nomad-follower" {
      driver = "docker"
      config {
        image = "sofixa/nomad_follower:latest"
      }
      env {
        VERBOSE    = 4
        LOG_TAG    = "logging"
        LOG_FILE   = "${NOMAD_ALLOC_DIR}/nomad-logs.log"
        # this is the IP of the docker0 interface
        # and Nomad has been explicitly told to listen on it so that Docker tasks can communicate with the API
        NOMAD_ADDR = "http://172.17.0.1:4646" 
        # Nomad ACL token, could be sourced via template from Vault
        NOMAD_TOKEN = "xxxx" 
      }
      # resource limits are a good idea because you don't want your log collection to consume all resources available 
      resources {
        cpu    = 100
        memory = 512
      }
    }
  }

That will result in all allocations running on the same node as an allocation from this task with the logging tag to get their logs collected and stored in ${NOMAD_ALLOC_DIR}/nomad-logs.log. Now all we need is to get something to read, parse and send those logs; considering I’d like to use Loki for storage, Promtail seems like the best option, but of course any of the alternatives could do the job just as well.

Promtail’s configuration is split into scrape (collection), pipeline that parses/transforms/extracts labels, and generic ( Loki adress, ports for healthcheck, etc.). It has a few ways to scrape logs, the one we need in this case is the static, which tails a file; and we also need to parse the various fields ( via a json pipeline stage) and mark some as labels ( so they’re indexed and we can search by them).

To scrape and parse the logs pre-collected by nomad_follower, we need a configuration similar to this one:

yaml Example Promtail YAML configuration file


server:
  # port for the healthcheck
  http_listen_port: 3000
  grpc_listen_port: 0
positions:
  filename: ${NOMAD_ALLOC_DIR}/positions.yaml
client:
  url: http://loki.example/loki/api/v1/push
scrape_configs:
- job_name: local
  static_configs:
  - targets:
      - localhost
    labels:
      job: nomad
      __path__: "${NOMAD_ALLOC_DIR}/nomad-logs.log"
  pipeline_stages:
    # extract the fields from the JSON logs
    - json:
        expressions:
          alloc_id: alloc_id
          job_name: job_name
          job_meta: job_meta
          node_name: node_name
          service_name: service_name
          service_tags: service_tags
          task_meta: task_meta
          task_name: task_name
          message: message
          data: data
    # the following fields are used as labels and are indexed:
    - labels:
        job_name:
        task_name:
        service_name:
        node_name:
        service_tags:
    # an example regex to extract a field called time from within message( which is for non-JSON formatted logs,
    # so the assumption is that they're in the logfmt format,
    # and a field time= is present with a timestamp in, which is the actual timestamp of the log)
    - regex:
        expression: ".*time=\\\"(?P<timestamp>\\S*)\\\"[ ]"
        source: "message"
    - timestamp:
        source: timestamp
        format: RFC3339

A full example system job file, with nomad_follower and Promtail, with a template to dynamically source Loki’s address from within Consul:

hcl Example log_follower job file


job "log-shipping" {
  datacenters = ["dc1"]
  type = "system"
  namespace = "logs"
  update {
    max_parallel      = 1
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "10m"
    auto_revert       = false
  }
  group "log-shipping" {
    count = 1
    network {
      port "promtail-healthcheck" {
        to = 3000
      }
    }
    restart {
      attempts = 2
      interval = "30m"
      delay    = "15s"
      mode     = "fail"
    }
    ephemeral_disk {
      size = 300
    }
    task "nomad-forwarder" {
      driver = "docker"
      env {
        VERBOSE    = 4
        LOG_TAG    = "logging"
        LOG_FILE   = "${NOMAD_ALLOC_DIR}/nomad-logs.log"
        # this is the IP of the docker0 interface
        # and Nomad has been explicitly told to listen on it so that Docker tasks can communicate with the API
        NOMAD_ADDR = "http://172.17.0.1:4646"
        # Nomad ACL token, could be sourced via template from Vault
        NOMAD_TOKEN = "xxxx"
      }
      config {
        image = "sofixa/nomad_follower:latest"
      }
      # resource limits are a good idea because you don't want your log collection to consume all resources available
      resources {
        cpu    = 100
        memory = 512
      }
    }
    task "promtail" {
      driver = "docker"
      config {
        image = "grafana/promtail:2.2.1"
        args = [
          "-config.file",
          "local/config.yaml",
          "-print-config-stderr",
        ]
        ports = ["promtail-healthcheck"]
      }
      template {
        data = <<EOH
server:
  http_listen_port: 3000
  grpc_listen_port: 0

positions:
  filename: ${NOMAD_ALLOC_DIR}/positions.yaml

client:
  url: http://loki.example/loki/api/v1/push
scrape_configs:
- job_name: local
  static_configs:
  - targets:
      - localhost
    labels:
      job: nomad
      __path__: "${NOMAD_ALLOC_DIR}/nomad-logs.log"
  pipeline_stages:
    # extract the fields from the JSON logs
    - json:
        expressions:
          alloc_id: alloc_id
          job_name: job_name
          job_meta: job_meta
          node_name: node_name
          service_name: service_name
          service_tags: service_tags
          task_meta: task_meta
          task_name: task_name
          message: message
          data: data
    # the following fields are used as labels and are indexed:
    - labels:
        job_name:
        task_name:
        service_name:
        node_name:
        service_tags:
    # use a regex to extract a field called time from within message( which is for non-JSON formatted logs,
    # so the assumption is that they're in the logfmt format,
    # and a field time= is present with a timestamp in, which is the actual timestamp of the log)
    - regex:
        expression: ".*time=\\\"(?P<timestamp>\\S*)\\\"[ ]"
        source: "message"
    - timestamp:
        source: timestamp
        format: RFC3339
EOH
        destination = "local/config.yaml"
      }
      # resource limits are a good idea because you don't want your log collection to consume all resources available
      resources {
        cpu    = 500
        memory = 512
      }
      service {
        name = "promtail"
        port = "promtail-healthcheck"
        check {
          type     = "http"
          path     = "/ready"
          interval = "10s"
          timeout  = "2s"
        }
      }
    }
  }
}

Running Loki

Note: I personally consider it somewhat of an anti-pattern to run your log aggregation system on the same cluster as the one it’s aggregating logs from. If the cluster explodes, you can’t really access the logs to debug what happened and why. For smaller use cases, or if critical failure isn’t probable, it’s perfectly fine and will probably never cause any issues; past a certain point though, I’d recommend splitting it ( and other similarly scoped tools like monitoring) into a separate management cluster, or using the hosted version, which includes a decent free tier (50GB and 14 days retention when it comes to logs).

Like I mentioned last time, even though one can include YAML configuration from Consul in Nomad jobs via the template stanza, I prefer to have as much as possible in the Nomad file directly for better versioning and rollbackability ( YAML stored in Consul evolves independently of the Nomad job’s lifecycle, so rollbacking the latter won’t do anything about the former ).

To run a basic, monolithic Loki service with local storage and unlimited retention, a host volume for said storage, Traefik as a reverse proxy (with TLS and a basic auth middleware attached, since Loki doesn’t do auth itself) and Loki’s YAML configuration file embedded, you need something along these lines:

job "loki" {
  datacenters = ["dc1"]
  type        = "service"
  update {
    max_parallel      = 1
    health_check      = "checks"
    min_healthy_time  = "10s"
    healthy_deadline  = "3m"
    progress_deadline = "5m"
  }
  group "loki" {
    count = 1
    restart {
      attempts = 3
      interval = "5m"
      delay    = "25s"
      mode     = "delay"
    }
    network {
      port "loki" {
        to = 3100
      }
    }
    volume "loki" {
      type      = "host"
      read_only = false
      source    = "loki"
    }
    task "loki" {
      driver = "docker"
      config {
        image = "grafana/loki:2.2.1"
        args = [
          "-config.file",
          "local/loki/local-config.yaml",
        ]
        ports = ["loki"]
      }
      volume_mount {
        volume      = "loki"
        destination = "/loki"
        read_only   = false
      }
      template {
        data = <<EOH
auth_enabled: false
server:
  http_listen_port: 3100
ingester:
  lifecycler:
    address: 127.0.0.1
    ring:
      kvstore:
        store: inmemory
      replication_factor: 1
    final_sleep: 0s
  # Any chunk not receiving new logs in this time will be flushed
  chunk_idle_period: 1h       
  # All chunks will be flushed when they hit this age, default is 1h
  max_chunk_age: 1h           
  # Loki will attempt to build chunks up to 1.5MB, flushing if chunk_idle_period or max_chunk_age is reached first
  chunk_target_size: 1048576  
  # Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
  chunk_retain_period: 30s    
  max_transfer_retries: 0     # Chunk transfers disabled
schema_config:
  configs:
    - from: 2020-10-24
      store: boltdb-shipper
      object_store: filesystem
      schema: v11
      index:
        prefix: index_
        period: 24h
storage_config:
  boltdb_shipper:
    active_index_directory: /loki/boltdb-shipper-active
    cache_location: /loki/boltdb-shipper-cache
    cache_ttl: 24h         # Can be increased for faster performance over longer query periods, uses more disk space
    shared_store: filesystem
  filesystem:
    directory: /loki/chunks
compactor:
  working_directory: /tmp/loki/boltdb-shipper-compactor
  shared_store: filesystem
limits_config:
  reject_old_samples: true
  reject_old_samples_max_age: 168h
chunk_store_config:
  max_look_back_period: 0s
table_manager:
  retention_deletes_enabled: false
  retention_period: 0s
EOH
        destination = "local/loki/local-config.yaml"
      }
      resources {
        cpu    = 512
        memory = 256
      }
      service {
        name = "loki"
        port = "loki"
        check {
          name     = "Loki healthcheck"
          port     = "loki"
          type     = "http"
          path     = "/ready"
          interval = "20s"
          timeout  = "5s"
          check_restart {
            limit           = 3
            grace           = "60s"
            ignore_warnings = false
          }
        }
        tags = [
          "traefik.enable=true",
          "traefik.http.routers.loki.tls=true",
          # the middleware has to be declared somewhere else, we only attach it here
          "traefik.http.routers.loki.middlewares=loki-basicauth@file", 
        ]
      }
    }
  }
}

Conclusion

So, which logging agent to use with Nomad ? As always, it depends. If you have ElasticSearch or any of the compatible alternatives, Filebeat is probably the best option due to the native support. If you only run Docker-based tasks, Vector collecting Docker Daemon’s logs (with Docker labels for context) is a pretty decent option. If neither, or you want more context ( like custom metadata) than what is available with Vector, nomad_follower is for you.

Discuss

Hacker News

Twitter