ADR 013: Observability
Changelog
- 20-01-2020: Initial Draft
Status
Proposed
Context
Telemetry is paramount into debugging and understanding what the application is doing and how it is performing. We aim to expose metrics from modules and other core parts of the Cosmos SDK.
In addition, we should aim to support multiple configurable sinks that an operator may choose from. By default, when telemetry is enabled, the application should track and expose metrics that are stored in-memory. The operator may choose to enable additional sinks, where we support only Prometheus for now, as it's battle-tested, simple to setup, open source, and is rich with ecosystem tooling.
We must also aim to integrate metrics into the Cosmos SDK in the most seamless way possible such that metrics may be added or removed at will and without much friction. To do this, we will use the go-metrics library.
Finally, operators may enable telemetry along with specific configuration options. If enabled, metrics
will be exposed via /metrics?format={text|prometheus}
via the API server.
Decision
We will add an additional configuration block to app.toml
that defines telemetry settings:
###############################################################################
### Telemetry Configuration ###
###############################################################################
[telemetry]
# Prefixed with keys to separate services
service-name = {{ .Telemetry.ServiceName }}
# Enabled enables the application telemetry functionality. When enabled,
# an in-memory sink is also enabled by default. Operators may also enabled
# other sinks such as Prometheus.
enabled = {{ .Telemetry.Enabled }}
# Enable prefixing gauge values with hostname
enable-hostname = {{ .Telemetry.EnableHostname }}
# Enable adding hostname to labels
enable-hostname-label = {{ .Telemetry.EnableHostnameLabel }}
# Enable adding service to labels
enable-service-label = {{ .Telemetry.EnableServiceLabel }}
# PrometheusRetentionTime, when positive, enables a Prometheus metrics sink.
prometheus-retention-time = {{ .Telemetry.PrometheusRetentionTime }}
The given configuration allows for two sinks -- in-memory and Prometheus. We create a Metrics
type that performs all the bootstrapping for the operator, so capturing metrics becomes seamless.
// Metrics defines a wrapper around application telemetry functionality. It allows
// metrics to be gathered at any point in time. When creating a Metrics object,
// internally, a global metrics is registered with a set of sinks as configured
// by the operator. In addition to the sinks, when a process gets a SIGUSR1, a
// dump of formatted recent metrics will be sent to STDERR.
type Metrics struct {
memSink *metrics.InmemSink
prometheusEnabled bool
}
// Gather collects all registered metrics and returns a GatherResponse where the
// metrics are encoded depending on the type. Metrics are either encoded via
// Prometheus or JSON if in-memory.
func (m *Metrics) Gather(format string) (GatherResponse, error) {
switch format {
case FormatPrometheus:
return m.gatherPrometheus()
case FormatText:
return m.gatherGeneric()
case FormatDefault:
return m.gatherGeneric()
default:
return GatherResponse{}, fmt.Errorf("unsupported metrics format: %s", format)
}
}
In addition, Metrics
allows us to gather the current set of metrics at any given point in time. An
operator may also choose to send a signal, SIGUSR1, to dump and print formatted metrics to STDERR.
During an application's bootstrapping and construction phase, if Telemetry.Enabled
is true
, the
API server will create an instance of a reference to Metrics
object and will register a metrics
handler accordingly.
func (s *Server) Start(cfg config.Config) error {
// ...
if cfg.Telemetry.Enabled {
m, err := telemetry.New(cfg.Telemetry)
if err != nil {
return err
}
s.metrics = m
s.registerMetrics()
}
// ...
}
func (s *Server) registerMetrics() {
metricsHandler := func(w http.ResponseWriter, r *http.Request) {
format := strings.TrimSpace(r.FormValue("format"))
gr, err := s.metrics.Gather(format)
if err != nil {
rest.WriteErrorResponse(w, http.StatusBadRequest, fmt.Sprintf("failed to gather metrics: %s", err))
return
}
w.Header().Set("Content-Type", gr.ContentType)
_, _ = w.Write(gr.Metrics)
}
s.Router.HandleFunc("/metrics", metricsHandler).Methods("GET")
}
Application developers may track counters, gauges, summaries, and key/value metrics. There is no additional lifting required by modules to leverage profiling metrics. To do so, it's as simple as:
func (k BaseKeeper) MintCoins(ctx sdk.Context, moduleName string, amt sdk.Coins) error {
defer metrics.MeasureSince(time.Now(), "MintCoins")
// ...
}
Consequences
Positive
- Exposure into the performance and behavior of an application