Logging on Nomad and log aggregation with Loki
Table of Contents
Introduction
When running a task orchestrator like Nomad or Kubernetes, there’s usually a bunch of different instances ( containers, micro-VMs, jails, etc. ) running, more or less ephemerally, across a fleet of servers. By default all logs would be local to the nodes actually running the stuff we want to run, making it burdensome to debug, correlate events, alert, etc., especially if the node crashes, hence why it’s a well established practice to collect all logs they emit to a central location, where all of those actions happen.
One of the most popular log management stacks is the so-called ELK ( ElasticSearch for log indexing, Logstash for parsing/transforming them, and Kibana for visualisation, with either Filebeat or Fluentd / Fluent bit for log collection), which has a few drawbacks - most notably heavy resource consumption and licence changes, the latter leading to numerous forks, which will probably result in some chaos/incompatibilities in the future.
A recent-ish contender in that space is Grafana Labs’ Loki. It’s a lightweight Prometheus-inspired tool, which can be run as a bunch of microservices for better and easier scale-out, or in monolithic
all-in-one mode. In contrast to ElasticSearch, it only indexes labels ( which are user defined ), the logs themselves ( chunks
) are stored as-is, separately. That makes it more flexible and much cheaper with regards to storage and compute.
There’s a variety of storage options - Cassandra (index and chunk), GCS/S3 (chunk), BigTable (chunk), DynamoDB(chunk) and the monolithic-only local storage ( which uses BoltDB internally for indexing). The last one is great for getting started/testing/non-huge projects, and the lack of redundancy ( since it writes to the local filesystem) can be offset by the storage provider ( e.g. a CSI plugin, GlusterFS, DRBD, or good old NFS). Visualisation from Loki is done with Grafana ( not really surprising since they’re both made by the same company), via the LogQL PromQL-inspired query language.
Having been using Loki for a few months in production, I’m really like it and hence this article on logs and using it on and with Nomad.
There are two main ways to get logs from Nomad tasks into Loki ( one of which only works for Docker tasks), and I’ll discuss the pros and cons of each one later on.
How logs on Nomad work
By default Nomad writes task logs to the alloc/logs
folder shared by all tasks in an allocation, which is stored locally on the client node running it. That makes accessing logs slightly burdensome and look like the following:
# get all current allocations for the job
$ nomad job status random-logger-example
ID = random-logger-example
Name = random-logger-example
Submit Date = 2021-04-27T23:28:29+02:00
Type = service
Priority = 50
Datacenters = dc1
Namespace = default
Status = running
Periodic = false
Parameterized = false
Summary
Task Group Queued Starting Running Failed Complete Lost
random 0 0 1 0 0 0
Latest Deployment
ID = beffde05
Status = running
Description = Deployment is running
Deployed
Task Group Desired Placed Healthy Unhealthy Progress Deadline
random 1 1 0 0 2021-04-27T23:38:29+02:00
Allocations
ID Node ID Task Group Version Desired Status Created Modified
529b074e a394f493 random 0 run running 8s ago 3s ago
# get the logs for an allocation
$ nomad alloc logs 529b074e
As soon as there’s more than one allocation, you have to get the logs for each one by one; if you want to see the logs of an old allocation ( assuming it hasn’t been garbage collected already ), you have to run nomad job status
with the -all-allocs
flag to include non-current versions.
Specific log options for the task driver can be configured on multiple levels - per task, per client, or in certain cases, most notably Docker, per task driver.
On the task level one can send all logs to a central syslog server (easily adaptable to fluentd, Splunk, etc. ) like so:
task "syslog" {
driver = "docker"
config {
logging {
type = "syslog"
config = {
syslog-address = "tcp://my-syslog-server:10514"
tag = "${NOMAD_TASK_NAME}"
}
}
}
}
This is perfectly fine, except that you have to do it on each and every task, which can quickly become burdensome, especially if your logging configuration changes one day, and you need to be on Docker Engine 20.10+, or use a logging driver that supports writing the logs locally as well as to the remote endpoint, otherwise docker logs
and nomad alloc logs
are unusable because the logs get sent directly. To facilitate that, there is use the new Docker driver-wide logging configuration, but then you’re limited only to Docker logging drivers, and, the bigger issue in my opinion, you’re lacking context (unless the stuff you run detects and specifically logs that context).
The, in my humble opinion, better option, is to have something that runs automatically on all client nodes, via a system
job, and collects all logs, like what everyone does with their Kubernetes clusters via Promtail, Fluend/Fluentbit, Datadog agent, Splunk Connect, etc. The biggest challenges with that are:
- context - it’s good to know everything about where the logs are from - which task, namespace, client node, etc.
- discovery - how to find all new allocations to ship the logs of?
Recently, there have been some great news concerning those two challenges in regards to Docker-based tasks - this PR on Nomad that adds extra Docker labels with job, task, task group, node, namespace names, this PR that enables default logging options for the Docker driver on the Nomad side.
I’ll describe how to collect and ship logs from Nomad, and then briefly go over how to run Loki, a lightweight log aggregation tool by Grafana Labs, on Nomad.
Logs collection
Nomad makes each allocation’s logs available through the API, so ideally our logging agent should be able to connect to it directly. The only mainstream one that does that at the time of writing is:
Filebeat
Filebeat, Elastic’s log collection agent, has an auto-discovery module for Nomad, experimentally via this PR which hasn’t been released in a stable version yet. If one wants to use ElasticSearch, or any of the compatible alternatives or companions ( Logstash, Kafka), that’s great. In theory one can configure Filebeat to output to a file, and then parse that file with another logging agent ( like Vector or Promtail) and ship the logs somewhere else, but IMHO that seems too much hassle for little benefit.
Promtail
Grafana provide a log collecting companion to Loki in the form of Promtail, which is lightweight and easy to configure, but much more limited compared to filebeat, fluent* or Vector in terms of compatible sources. The biggest downside is that for Docker logs it can’t use Docker labels for extra context, and having only the container name and ID is often insufficient.
Vector
Vector is a very fast and lightweight observability agent that can collect logs and metrics, transform and ship them to a variety of backends by Timber, a logging SaaS. They were recently acquired by Datadog, who also have an agent and logging backend, so they obviously saw some value in them. And, very importantly for this story, Vector’s Docker logs source can use Docker labels to enrich the logs’ context, and there’s a Loki sink ( backend ). So, let’s see about deploying Vector as a system
job to collect logs from Docker.
First, we need to allow Vector to access the Docker daemon and configure Nomad to add extra metadata; the easiest and cleanest way to do that is to pass the Docker socket as a host volume, which need to be declared at the host (Nomad client) level, and then mounted in the Vector job. We can do that read-only, and we can use ACLs to limit who can mount it to avoid allowing every job to mount it and do whatever.
plugin "docker" {
config {
# extra Docker labels to be set by Nomad on each Docker container with the appropriate value
extra_labels = ["job_name", "task_group_name", "task_name", "namespace", "node_name"]
}
}
client {
host_volume "docker-sock-ro" {
path = "/var/run/docker.sock"
read_only = true
}
}
host_volume "docker-sock-ro" {
policy = "read" # read = read-only, write = read/write, deny to deny
}
Second, actually mount the newly available host volume inside the Vector task:
group "vector" {
...
volume "docker-sock" {
type = "host"
source = "docker-sock"
read_only = true
}
task "vector" {
...
volume_mount {
volume = "docker-sock"
destination = "/var/run/docker.sock"
read_only = true
}
}
}
Third, tell Vector to collect Docker’s logs and send them to Loki, enriching them with the metadata extracted from the Docker labels:
data_dir = "alloc/data/vector/"
[sources.logs]
type = "docker_logs"
[sinks.out]
type = "console"
inputs = [ "logs" ]
encoding.codec = "json"
[sinks.loki]
type = "loki"
inputs = ["logs"]
endpoint = "http://loki.example"
encoding.codec = "json"
healthcheck.enabled = true
# since . is used by Vector to denote a parent-child relationship, and Nomad's Docker labels contain ".",
# we need to escape them twice, once for TOML, once for Vector
labels.job = "{{ label.com\\.hashicorp\\.nomad\\.job_name }}"
labels.task = "{{ label.com\\.hashicorp\\.nomad\\.task_name }}"
labels.group = "{{ label.com\\.hashicorp\\.nomad\\.task_group_name }}"
labels.namespace = "{{ label.com\\.hashicorp\\.nomad\\.namespace }}"
labels.node = "{{ label.com\\.hashicorp\\.nomad\\.node_name }}"
# remove fields that have been converted to labels to avoid having them twice
remove_label_fields = true
And that’s it.
Full example Vector job file with all those components and dynamic sourcing of the Loki address from Consul:
job "vector" {
datacenters = ["dc1"]
# system job, runs on all nodes
type = "system"
update {
min_healthy_time = "10s"
healthy_deadline = "5m"
progress_deadline = "10m"
auto_revert = true
}
group "vector" {
count = 1
restart {
attempts = 3
interval = "10m"
delay = "30s"
mode = "fail"
}
network {
port "api" {
to = 8686
}
}
# docker socket volume
volume "docker-sock" {
type = "host"
source = "docker-sock"
read_only = true
}
ephemeral_disk {
size = 500
sticky = true
}
task "vector" {
driver = "docker"
config {
image = "timberio/vector:0.14.X-alpine"
ports = ["api"]
}
# docker socket volume mount
volume_mount {
volume = "docker-sock"
destination = "/var/run/docker.sock"
read_only = true
}
# Vector won't start unless the sinks(backends) configured are healthy
env {
VECTOR_CONFIG = "local/vector.toml"
VECTOR_REQUIRE_HEALTHY = "true"
}
# resource limits are a good idea because you don't want your log collection to consume all resources available
resources {
cpu = 500 # 500 MHz
memory = 256 # 256MB
}
# template with Vector's configuration
template {
destination = "local/vector.toml"
change_mode = "signal"
change_signal = "SIGHUP"
# overriding the delimiters to [[ ]] to avoid conflicts with Vector's native templating, which also uses {{ }}
left_delimiter = "[["
right_delimiter = "]]"
data=<<EOH
data_dir = "alloc/data/vector/"
[api]
enabled = true
address = "0.0.0.0:8686"
playground = true
[sources.logs]
type = "docker_logs"
[sinks.out]
type = "console"
inputs = [ "logs" ]
encoding.codec = "json"
[sinks.loki]
type = "loki"
inputs = ["logs"]
endpoint = "http://[[ range service "loki" ]][[ .Address ]]:[[ .Port ]][[ end ]]"
encoding.codec = "json"
healthcheck.enabled = true
# since . is used by Vector to denote a parent-child relationship, and Nomad's Docker labels contain ".",
# we need to escape them twice, once for TOML, once for Vector
labels.job = "{{ label.com\\.hashicorp\\.nomad\\.job_name }}"
labels.task = "{{ label.com\\.hashicorp\\.nomad\\.task_name }}"
labels.group = "{{ label.com\\.hashicorp\\.nomad\\.task_group_name }}"
labels.namespace = "{{ label.com\\.hashicorp\\.nomad\\.namespace }}"
labels.node = "{{ label.com\\.hashicorp\\.nomad\\.node_name }}"
# remove fields that have been converted to labels to avoid having the field twice
remove_label_fields = true
EOH
}
service {
check {
port = "api"
type = "http"
path = "/health"
interval = "30s"
timeout = "5s"
}
}
kill_timeout = "30s"
}
}
}
Overall, that’s pretty a pretty good solution for the problem, with a lightweight and fast logging agent, some context, and everything is configured from within Nomad; however, that only works for Docker tasks and requires some (light) configuration on each client node.
nomad_follower (+Promtail)
There’s a project called nomad_follower
, which uses Nomad’s API to tail allocation logs. It requires no outside (of the Nomad system job running it) configuration, and compiles all the logs with metadata and context in a file which can then be scraped by a logging agent and sent to whatever logging backend you have. It’s a bit more obscure compared to Vector or Filebeat, and debugging it might require diving in the code ( speaking from experience, it didn’t accept my logfmt
formatted logs because it was looking at a timestamp starting in the first 4 characters of each line (used to deal with multi-line logs), but mine started after 5.. so i had to fork and patch it). Nonetheless, coupled with a logging agent it’s a fairly decent solution that works for all task types.
Configuration is done via env variables ( like the file where to write all logs, the Nomad/Consul service tag to filter on, etc.), including Nomad API access/credentials. The logs are stored in JSON, with JSON-formatted logs inside the data
field, non-JSON (like logfmt
) logs are in the message
field, and there’s a bunch of metadata alongside them:
- alloc_id
- job_name
- job_meta
- node_name
- service_name
- service_tags
- task_meta
- task_name
group "log-shipping" {
count = 1
restart {
attempts = 2
interval = "30m"
delay = "15s"
mode = "fail"
}
ephemeral_disk {
size = 300
}
task "nomad-follower" {
driver = "docker"
config {
image = "sofixa/nomad_follower:latest"
}
env {
VERBOSE = 4
LOG_TAG = "logging"
LOG_FILE = "${NOMAD_ALLOC_DIR}/nomad-logs.log"
# this is the IP of the docker0 interface
# and Nomad has been explicitly told to listen on it so that Docker tasks can communicate with the API
NOMAD_ADDR = "http://172.17.0.1:4646"
# Nomad ACL token, could be sourced via template from Vault
NOMAD_TOKEN = "xxxx"
}
# resource limits are a good idea because you don't want your log collection to consume all resources available
resources {
cpu = 100
memory = 512
}
}
}
That will result in all allocations running on the same node as an allocation from this task with the logging
tag to get their logs collected and stored in ${NOMAD_ALLOC_DIR}/nomad-logs.log
. Now all we need is to get something to read, parse and send those logs; considering I’d like to use Loki for storage, Promtail seems like the best option, but of course any of the alternatives could do the job just as well.
Promtail’s configuration is split into scrape (collection), pipeline that parses/transforms/extracts labels, and generic ( Loki adress, ports for healthcheck, etc.). It has a few ways to scrape logs, the one we need in this case is the static
, which tails a file; and we also need to parse the various fields ( via a json pipeline stage) and mark some as labels ( so they’re indexed and we can search by them).
To scrape and parse the logs pre-collected by nomad_follower
, we need a configuration similar to this one:
server:
# port for the healthcheck
http_listen_port: 3000
grpc_listen_port: 0
positions:
filename: ${NOMAD_ALLOC_DIR}/positions.yaml
client:
url: http://loki.example/loki/api/v1/push
scrape_configs:
- job_name: local
static_configs:
- targets:
- localhost
labels:
job: nomad
__path__: "${NOMAD_ALLOC_DIR}/nomad-logs.log"
pipeline_stages:
# extract the fields from the JSON logs
- json:
expressions:
alloc_id: alloc_id
job_name: job_name
job_meta: job_meta
node_name: node_name
service_name: service_name
service_tags: service_tags
task_meta: task_meta
task_name: task_name
message: message
data: data
# the following fields are used as labels and are indexed:
- labels:
job_name:
task_name:
service_name:
node_name:
service_tags:
# an example regex to extract a field called time from within message( which is for non-JSON formatted logs,
# so the assumption is that they're in the logfmt format,
# and a field time= is present with a timestamp in, which is the actual timestamp of the log)
- regex:
expression: ".*time=\\\"(?P<timestamp>\\S*)\\\"[ ]"
source: "message"
- timestamp:
source: timestamp
format: RFC3339
A full example system job file, with nomad_follower
and Promtail, with a template to dynamically source Loki’s address from within Consul:
job "log-shipping" {
datacenters = ["dc1"]
type = "system"
namespace = "logs"
update {
max_parallel = 1
min_healthy_time = "10s"
healthy_deadline = "3m"
progress_deadline = "10m"
auto_revert = false
}
group "log-shipping" {
count = 1
network {
port "promtail-healthcheck" {
to = 3000
}
}
restart {
attempts = 2
interval = "30m"
delay = "15s"
mode = "fail"
}
ephemeral_disk {
size = 300
}
task "nomad-forwarder" {
driver = "docker"
env {
VERBOSE = 4
LOG_TAG = "logging"
LOG_FILE = "${NOMAD_ALLOC_DIR}/nomad-logs.log"
# this is the IP of the docker0 interface
# and Nomad has been explicitly told to listen on it so that Docker tasks can communicate with the API
NOMAD_ADDR = "http://172.17.0.1:4646"
# Nomad ACL token, could be sourced via template from Vault
NOMAD_TOKEN = "xxxx"
}
config {
image = "sofixa/nomad_follower:latest"
}
# resource limits are a good idea because you don't want your log collection to consume all resources available
resources {
cpu = 100
memory = 512
}
}
task "promtail" {
driver = "docker"
config {
image = "grafana/promtail:2.2.1"
args = [
"-config.file",
"local/config.yaml",
"-print-config-stderr",
]
ports = ["promtail-healthcheck"]
}
template {
data = <<EOH
server:
http_listen_port: 3000
grpc_listen_port: 0
positions:
filename: ${NOMAD_ALLOC_DIR}/positions.yaml
client:
url: http://loki.example/loki/api/v1/push
scrape_configs:
- job_name: local
static_configs:
- targets:
- localhost
labels:
job: nomad
__path__: "${NOMAD_ALLOC_DIR}/nomad-logs.log"
pipeline_stages:
# extract the fields from the JSON logs
- json:
expressions:
alloc_id: alloc_id
job_name: job_name
job_meta: job_meta
node_name: node_name
service_name: service_name
service_tags: service_tags
task_meta: task_meta
task_name: task_name
message: message
data: data
# the following fields are used as labels and are indexed:
- labels:
job_name:
task_name:
service_name:
node_name:
service_tags:
# use a regex to extract a field called time from within message( which is for non-JSON formatted logs,
# so the assumption is that they're in the logfmt format,
# and a field time= is present with a timestamp in, which is the actual timestamp of the log)
- regex:
expression: ".*time=\\\"(?P<timestamp>\\S*)\\\"[ ]"
source: "message"
- timestamp:
source: timestamp
format: RFC3339
EOH
destination = "local/config.yaml"
}
# resource limits are a good idea because you don't want your log collection to consume all resources available
resources {
cpu = 500
memory = 512
}
service {
name = "promtail"
port = "promtail-healthcheck"
check {
type = "http"
path = "/ready"
interval = "10s"
timeout = "2s"
}
}
}
}
}
Running Loki
Note: I personally consider it somewhat of an anti-pattern to run your log aggregation system on the same cluster as the one it’s aggregating logs from. If the cluster explodes, you can’t really access the logs to debug what happened and why. For smaller use cases, or if critical failure isn’t probable, it’s perfectly fine and will probably never cause any issues; past a certain point though, I’d recommend splitting it ( and other similarly scoped tools like monitoring) into a separate management cluster, or using the hosted version, which includes a decent free tier (50GB and 14 days retention when it comes to logs).
Like I mentioned last time, even though one can include YAML configuration from Consul in Nomad jobs via the template
stanza, I prefer to have as much as possible in the Nomad file directly for better versioning and rollbackability ( YAML stored in Consul evolves independently of the Nomad job’s lifecycle, so rollbacking the latter won’t do anything about the former ).
To run a basic, monolithic Loki service with local storage and unlimited retention, a host volume for said storage, Traefik as a reverse proxy (with TLS and a basic auth middleware attached, since Loki doesn’t do auth itself) and Loki’s YAML configuration file embedded, you need something along these lines:
job "loki" {
datacenters = ["dc1"]
type = "service"
update {
max_parallel = 1
health_check = "checks"
min_healthy_time = "10s"
healthy_deadline = "3m"
progress_deadline = "5m"
}
group "loki" {
count = 1
restart {
attempts = 3
interval = "5m"
delay = "25s"
mode = "delay"
}
network {
port "loki" {
to = 3100
}
}
volume "loki" {
type = "host"
read_only = false
source = "loki"
}
task "loki" {
driver = "docker"
config {
image = "grafana/loki:2.2.1"
args = [
"-config.file",
"local/loki/local-config.yaml",
]
ports = ["loki"]
}
volume_mount {
volume = "loki"
destination = "/loki"
read_only = false
}
template {
data = <<EOH
auth_enabled: false
server:
http_listen_port: 3100
ingester:
lifecycler:
address: 127.0.0.1
ring:
kvstore:
store: inmemory
replication_factor: 1
final_sleep: 0s
# Any chunk not receiving new logs in this time will be flushed
chunk_idle_period: 1h
# All chunks will be flushed when they hit this age, default is 1h
max_chunk_age: 1h
# Loki will attempt to build chunks up to 1.5MB, flushing if chunk_idle_period or max_chunk_age is reached first
chunk_target_size: 1048576
# Must be greater than index read cache TTL if using an index cache (Default index read cache TTL is 5m)
chunk_retain_period: 30s
max_transfer_retries: 0 # Chunk transfers disabled
schema_config:
configs:
- from: 2020-10-24
store: boltdb-shipper
object_store: filesystem
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/boltdb-shipper-active
cache_location: /loki/boltdb-shipper-cache
cache_ttl: 24h # Can be increased for faster performance over longer query periods, uses more disk space
shared_store: filesystem
filesystem:
directory: /loki/chunks
compactor:
working_directory: /tmp/loki/boltdb-shipper-compactor
shared_store: filesystem
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 168h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: false
retention_period: 0s
EOH
destination = "local/loki/local-config.yaml"
}
resources {
cpu = 512
memory = 256
}
service {
name = "loki"
port = "loki"
check {
name = "Loki healthcheck"
port = "loki"
type = "http"
path = "/ready"
interval = "20s"
timeout = "5s"
check_restart {
limit = 3
grace = "60s"
ignore_warnings = false
}
}
tags = [
"traefik.enable=true",
"traefik.http.routers.loki.tls=true",
# the middleware has to be declared somewhere else, we only attach it here
"traefik.http.routers.loki.middlewares=loki-basicauth@file",
]
}
}
}
}
Conclusion
So, which logging agent to use with Nomad ? As always, it depends. If you have ElasticSearch or any of the compatible alternatives, Filebeat is probably the best option due to the native support. If you only run Docker-based tasks, Vector collecting Docker Daemon’s logs (with Docker labels for context) is a pretty decent option. If neither, or you want more context ( like custom metadata) than what is available with Vector, nomad_follower is for you.