Prometheus and Aegis

Published on , 1185 words, 5 minutes to read

Last time in the xeiaso dot net cinematic universe:

Unix sockets started to be used to grace the cluster. Things were at peace. Then, a realization came through:

Mara is hmm
<Mara>

What about Prometheus? Doesn't it need a direct line of fire to the service to scrape metrics?

This could not do! Without observability the people of the Discord wouldn't have a livefeed of the infrastructure falling over! This cannot stand! Look, our hero takes action!

Cadey is percussive-maintenance
<Cadey>

I'll just hit it until it works!

In order to help keep an eye on all of the services I run, I use Prometheus for collecting metrics. For an example of the kind of metrics I collect, see here (1). In the configuration that I have, Prometheus runs on a server in my apartment and reaches out to my other machines to scrape metrics over the network. This worked great when I had my major services listen over TCP, I could just point Prometheus at the backend port over my tunnel.

When I started using Unix sockets for hosting my services, this stopped working. It became very clear very quickly that I needed some kind of shim. This shim needed to do the following things:

The Go standard library has a tool for doing reverse proxying in the standard library: net/http/httputil#ReverseProxy. Maybe we could build something with this?

Mara is hmm
<Mara>

The documentation seems to imply it will use the network by default. Wait, what's this Transport field?

type ReverseProxy struct {
  // ...

  // The transport used to perform proxy requests.
  // If nil, http.DefaultTransport is used.
  Transport http.RoundTripper

  // ...
}
Mara is hmm
<Mara>

So a transport is a RoundTripper, which is a function that takes a request and returns a response somehow. It uses http.DefaultTransport by default, which reads from the network. So at a minimum we're gonna need:

  • a ReverseProxy
  • a Transport
  • a dialing function

Right?

Yep! Unix sockets can be used like normal sockets, so all you need is something like this:

func proxyToUnixSocket(w http.ResponseWriter, r *http.Request) {
  name := path.Base(r.URL.Path)

  fname := filepath.Join(*sockdir, name+".sock")
  _, err := os.Stat(fname)
  if os.IsNotExist(err) {
    http.NotFound(w, r)
    return
  }

  ts := &http.Transport{
    Dial: func(_, _ string) (net.Conn, error) {
      return net.Dial("unix", fname)
    },
    DisableKeepAlives: true,
  }

  rp := httputil.ReverseProxy{
    Director: func(req *http.Request) {
      req.URL.Scheme = "http"
      req.URL.Host = "aegis"
      req.URL.Path = "/metrics"
      req.URL.RawPath = "/metrics"
    },
    Transport: ts,
  }
  rp.ServeHTTP(w, r)
}
Mara is hmm
<Mara>

So in this handler:

name := path.Base(r.URL.Path)

fname := filepath.Join(*sockdir, name+".sock")
_, err := os.Stat(fname)
if os.IsNotExist(err) {
  http.NotFound(w, r)
  return
}

ts := &http.Transport{
  Dial: func(_, _ string) (net.Conn, error) {
    return net.Dial("unix", fname)
  },
  DisableKeepAlives: true,
}

You have the socket path built from the URL path, and then you return connections to that path ignoring what the HTTP stack thinks it should point to?

Cadey is enby
<Cadey>

Yep. Then the rest is really just boilerplate:

package main

import (
  "flag"
  "log"
  "net"
  "net/http"
  "net/http/httputil"
  "os"
  "path"
  "path/filepath"
)

var (
  hostport = flag.String("hostport", "[::]:31337", "TCP host:port to listen on")
  sockdir  = flag.String("sockdir", "./run", "directory full of unix sockets to monitor")
)

func main() {
  flag.Parse()

  log.SetFlags(0)
  log.Printf("%s -> %s", *hostport, *sockdir)

  http.DefaultServeMux.HandleFunc("/", proxyToUnixSocket)

  log.Fatal(http.ListenAndServe(*hostport, nil))
}

Now all that's needed is to build a NixOS service out of this:

{ config, lib, pkgs, ... }:
let cfg = config.within.services.aegis;
in
with lib; {
  # Mara\ this describes all of the configuration options for Aegis.
  options.within.services.aegis = {
    enable = mkEnableOption "Activates Aegis (unix socket prometheus proxy)";

    # Mara\ This is the IPv6 host:port that the service should listen on.
    # It's IPv6 because this is $CURRENT_YEAR.
    hostport = mkOption {
      type = types.str;
      default = "[::1]:31337";
      description = "The host:port that aegis should listen for traffic on";
    };

    # Mara\ This is the folder full of unix sockets. In the previous post we
    # mentioned that the sockets should go somewhere like /tmp, however this
    # may be a poor life decision:
    # https://lobste.rs/s/fqqsct/unix_domain_sockets_for_serving_http#c_g4ljpf
    sockdir = mkOption {
      type = types.str;
      default = "/srv/within/run";
      example = "/srv/within/run";
      description =
        "The folder that aegis will read from";
    };
  };

  # Mara\ The configuration that will arise from this module if it's enabled
  config = mkIf cfg.enable {
    # Mara\ Aegis has its own user account to keep things tidy. It doesn't need
    # root to run so we don't give it root.
    users.users.aegis = {
      createHome = true;
      description = "tulpa.dev/cadey/aegis";
      isSystemUser = true;
      group = "within";
      home = "/srv/within/aegis";
    };

    # Mara\ The systemd service that actually runs Aegis.
    systemd.services.aegis = {
      wantedBy = [ "multi-user.target" ];

      # Mara\ These correlate to the [Service] block in the systemd unit.
      serviceConfig = {
        User = "aegis";
        Group = "within";
        Restart = "on-failure";
        WorkingDirectory = "/srv/within/aegis";
        RestartSec = "30s";
      };

      # Mara\ When the service starts up, run this script.
      script = let aegis = pkgs.tulpa.dev.cadey.aegis;
      in ''
        exec ${aegis}/bin/aegis -sockdir="${cfg.sockdir}" -hostport="${cfg.hostport}"
      '';
    };
  };
}
Cadey is enby
<Cadey>

Then I just flicked it on for a server of mine:

within.services.aegis = {
  enable = true;
  hostport = "[fda2:d982:1da2:180d:b7a4:9c5c:989b:ba02]:43705";
  sockdir = "/srv/within/run";
};
Cadey is enby
<Cadey>

And then test it with curl:

$ curl http://[fda2:d982:1da2:180d:b7a4:9c5c:989b:ba02]:43705/printerfacts
# HELP printerfacts_hits Number of hits to various pages
# TYPE printerfacts_hits counter
printerfacts_hits{page="fact"} 15
printerfacts_hits{page="index"} 23
printerfacts_hits{page="not_found"} 17
# HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
# TYPE process_cpu_seconds_total counter
process_cpu_seconds_total 0.06
# HELP process_max_fds Maximum number of open file descriptors.
# TYPE process_max_fds gauge
process_max_fds 1024
# HELP process_open_fds Number of open file descriptors.
# TYPE process_open_fds gauge
process_open_fds 12
# HELP process_resident_memory_bytes Resident memory size in bytes.
# TYPE process_resident_memory_bytes gauge
process_resident_memory_bytes 5296128
# HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
# TYPE process_start_time_seconds gauge
process_start_time_seconds 1617458164.36
# HELP process_virtual_memory_bytes Virtual memory size in bytes.
# TYPE process_virtual_memory_bytes gauge
process_virtual_memory_bytes 911777792
Cadey is aha
<Cadey>

And there you go! Now we can make Prometheus point to this and we can save Christmas!


This is another experiment in writing these kinds of posts in more of a Socratic method. I'm trying to strike a balance with a limited pool of stickers while I wait for more stickers/emoji to come in. Feedback is always welcome.

(1): These metrics are not perfect because of the level of caching that Cloudflare does for me.


Facts and circumstances may have changed since publication. Please contact me before jumping to conclusions if something seems wrong or unclear.

Tags: prometheus, o11y