In my last post, I talked about OpenCensus. In this post, I would like to focus on the OpenCensus Service. Read on to learn more!
OpenCensus Agent
As the name implies, the Agent is meant to sit as close to the application as possible. It can be added as a binary, sidecar, or daemonset. It should receive all metrics and traces from the application and/or on the host it is running on. It can then send the data it received to one or more destinations.
It was created so that client libraries could send to a destination and not receive errors or backpressure. The idea is that you always deploy the OpenCensus Agent, and then when you add instrumentation to your application, you always point it to the Agent. The Agent will receive it but will not forward it anywhere unless it is configured to do so.
It should be noted that the Agent does basically no buffering or retry today. It expects that the destination(s) it sends to is/are highly available. While the agent can send directly to one or more backends, the best practice is to send to the OpenCensus Collector.
Configuration
By default, the Agent listens for incoming traffic on 55678
(OpenCensus format). If you need the agent to receive in other formats you can specify them via the agent configuration. For example:
receivers:
opencensus:
address: "127.0.0.1:55678"
zipkin:
address: "127.0.0.1:9411"
jaeger:
jaeger-thrift-tchannel-port: 14267
jaeger-thrift-http-port: 14268
prometheus:
config:
scrape_configs:
- job_name: 'caching_cluster'
scrape_interval: 5s
static_configs:
- targets: ['localhost:8889']
By default, the Agent does not send to any destination. Destinations can be specified by configuring one or more exporters. The typical exporter configuration would be to the OpenCensus Collector. For example:
exporters:
opencensus:
compression: "gzip"
endpoint: "oc-collector.default.svc.cluster.local:55678"
OpenCensus Collector
As the name implies, the Collector is meant to sit between the application and the destination(s) but as close to the application as possible (e.g., same datacenter or region). It can be added as a binary or container but should run as a standalone service. It should receive all metrics and traces from the application, typically via the OpenCensus Agent. It can then perform operations against the data before sending it to one or more destinations.
Some of the reasons why the Collector were created include:
- Ability to buffer and retry data at production scale
- Limit egress points
- Intelligent (tail-based) sampling
- Span annotations (e.g. datacenter or region)
- Tag redaction
Note with intelligent sampling, the Collector should be as close to application as possible, but will not be in the same datacenter/region unless all application calls ONLY happen in the same datacenter/region. This is because the Collector must receive all spans for a given trace in order to make an intelligent sampling decision (more on intelligent sampling in a future post).
Configuration
It should be noted that the Collector is based on the Agent code base. The receivers and exporters are identical to the Agent, and as such, the configuration is mostly the same as well. Like the Agent, the Collector listens on port 55678
by default and does not export by default. The main differences in configuration with the agent are around additionally supported features.
The exporting configuration can be different as the Collector supports bounded-buffer retry as well as batching through its queued-exporter
. For example:
queued-exporters:
jaeger: # A friendly name for the processor
batching:
enable: false
# sets the time, in seconds, after which a batch will be sent regardless of size
timeout: 1
# number of spans which after hit, will trigger it to be sent
send-batch-size: 8192
# num-workers is the number of queue workers that will be dequeuing batches and sending them out (default is 10)
num-workers: 2
# queue-size is the maximum number of batches allowed in the queue at a given time (default is 5000)
queue-size: 100
# retry-on-failure indicates whether queue processor should retry span batches in case of processing failure (default is true)
retry-on-failure: true
# backoff-delay is the amount of time a worker waits after a failed send before retrying (default is 5 seconds)
backoff-delay: 3s
# sender-type is the type of sender used by this processor, the default is an invalid sender so it forces one to be specified
sender-type: jaeger-thrift-http
# configuration of the selected sender-type, in this example jaeger-thrift-http. Which supports 3 settings:
# collector-endpoint: address of Jaeger collector thrift-http endpoint
# headers: a map of any additional headers to be sent with each batch (e.g.: api keys, etc)
# timeout: the timeout for the sender to consider the operation as failed
jaeger-thrift-http:
collector-endpoint: "https://ingest.omnition.io"
headers: { "x-omnition-api-key": "00000000-0000-0000-0000-000000000001" }
timeout: 5s
In addition, the Collector offers configuration outside of receivers and exporters. For example, you can add Collector-level tags to spans:
global:
attributes:
overwrite: true
values:
# values are key value pairs where the value can be an int, float, bool, or string
some_string: "hello world"
some_int: 1234
some_float: 3.14159
some_bool: false
As well as configuration for intelligent sampling:
sampling:
mode: tail
# amount of time from seeing the first span in a trace until making the sampling decision
decision-wait: 10s
# maximum number of traces kept in the memory
num-traces: 10000
policies:
# user-defined policy name
my-string-tag-filter:
# exporters the policy applies to
exporters:
- jaeger
policy: string-tag-filter
configuration:
tag: tag1
values:
- value1
- value2
my-numeric-tag-filter:
exporters:
- zipkin
policy: numeric-tag-filter
configuration:
tag: tag1
min-value: 0
max-value: 100
Finally, the Collector also offers command-line flags to enable functionality (i.e. without manually editing or supplying the configuration). The parameters available today are:
OpenCensus Collector
Usage:
occollector [flags]
Flags:
--config string Path to the config file
--debug-processor Flag to add a debug processor (combine with log level DEBUG to log incoming spans)
--health-check-http-port uint Port on which to run the healthcheck http server. (default 13133)
-h, --help help for occollector
--http-pprof-port uint Port to be used by golang net/http/pprof (Performance Profiler), the profiler is disabled if no port or 0 is specified.
--log-level string Output level of logs (TRACE, DEBUG, INFO, WARN, ERROR, FATAL) (default "INFO")
--metrics-level string Output level of telemetry metrics (NONE, BASIC, NORMAL, DETAILED) (default "BASIC")
--metrics-port uint Port exposing collector telemetry. (default 8888)
--receive-jaeger Flag to run the Jaeger receiver (i.e.: Jaeger Collector), default settings: {ThriftTChannelPort:14267 ThriftHTTPPort:14268}
--receive-oc-trace Flag to run the OpenCensus trace receiver, default settings: {Port:55678} (default true)
--receive-zipkin Flag to run the Zipkin receiver, default settings: {Port:9411}
--receive-zipkin-scribe Flag to run the Zipkin Scribe receiver, default settings: {Address: Port:9410 Category:zipkin}
--tail-sampling-always-sample Flag to use a tail-based sampling processor with an always sample policy, unless tail sampling setting is present on configuration file.
Best Practices
You can use the Agent without the Collector and the Collector without the Agent. You can also use the client libraries without the Agent or the Collector. With that said, the best practice is to configure client libraries (whether OpenCensus or not) to point to the OpenCensus Agent, the OpenCensus Agent to point to the OpenCensus Collector, and the OpenCensus Collector to point to the backends of your choice.
Why is this the best practice? Lots of reasons:
- You want to configure the client libraries once and leave them – by pointing to the agent this is possible without needing to worry about errors or back pressure
- You want an agent that is light weight and only does work when configured to export to a destination
- You want a collector that is capable of doing advanced functionality on top of the data and also is capable of ensuring your data reaches its destination in a secure manner
In addition, it is recommended you use the OpenCensus Service whether you use the OpenCensus client libraries or not. Why? Lots of reasons:
- Supports the most popular receivers and destinations today
- Is extensible and additional receivers/exporters can easily be added
- Is the only open-source software to support intelligent sampling today
- Works for metrics and traces
- Is vendor-agnostic
Summary
The OpenCensus Service provides a lot of capabilities related to the collection of metric and trace data. It offers enterprise-grade features, including scale-out, queuing, retry, and redaction support while being completely open-source and vendor agnostic. Whether you use the OpenCensus client libraries or not, the OpenCensus Service provides a reliable, performant, and heavily tested foundation for observability data collection.
© 2019 – 2021, Steve Flanders. All rights reserved.