Cloud & Infrastructure · Observability
OpenTelemetry for Web Apps in 2026: What to Instrument and What to Skip
OpenTelemetry is the observability standard now. Most tutorials show you how to install the SDK and emit traces. Fewer explain which signals actually matter for web applications and which add noise without helping you debug anything.
Anurag Verma
9 min read
Sponsored
Six years ago, observability in a typical web agency project meant three things: a Sentry integration for errors, a dashboard in the cloud provider’s console, and some server-side logging. When something broke, you’d look at Sentry, then the provider dashboard, then grep through logs hoping to find the problem. Often you found the symptom, not the cause.
OpenTelemetry changes this by making three signals (traces, metrics, and logs) standardized, correlated, and vendor-neutral. You instrument once, and you can ship the telemetry to any backend. But the “instrument once” part is where most teams run into trouble. They instrument everything, end up with terabytes of data they can’t afford to store, and still can’t answer “why was that request slow?”
This is a guide to instrumentation that’s actually useful.
The Three Signals and When to Use Each
Traces record the path of a request through your system. A trace is a collection of spans: one span per operation, each with a start time, duration, and attributes. A web request might generate spans for the HTTP handler, database query, external API call, and cache lookup. Traces answer: “where did the time go?”
Metrics are numeric values aggregated over time. Request count, error rate, p95 latency, cache hit ratio. Metrics answer: “is the system behaving correctly right now?”
Logs are discrete events with context. An error with stack trace, a user action, a security event. Logs answer: “what happened at this specific moment?”
The practical breakdown for web applications:
- Use traces for debugging slow requests and understanding distributed call patterns.
- Use metrics for alerting and dashboards.
- Use logs for error context and audit trails.
Many tutorials push you to emit all three from every function. For a web application, that’s usually overkill and creates more noise than signal.
Starting With the OTel SDK
For a Node.js application, the setup is straightforward with auto-instrumentation:
// instrumentation.ts — load before anything else
import { NodeSDK } from '@opentelemetry/sdk-node'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http'
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
const sdk = new NodeSDK({
serviceName: process.env.SERVICE_NAME ?? 'my-api',
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/traces',
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT ?? 'http://localhost:4318/v1/metrics',
}),
exportIntervalMillis: 30_000,
}),
instrumentations: [
getNodeAutoInstrumentations({
'@opentelemetry/instrumentation-fs': { enabled: false }, // too noisy
'@opentelemetry/instrumentation-dns': { enabled: false }, // rarely useful
}),
],
})
sdk.start()
Load this before your application starts:
// package.json
{
"scripts": {
"start": "node --require ./instrumentation.js dist/server.js"
}
}
The auto-instrumentations package instruments HTTP clients, Express/Fastify/Hono, database drivers (pg, mysql2, mongodb), and Redis automatically. You get traces for all database queries, outbound HTTP calls, and incoming requests without writing any instrumentation code.
Disable instrumentation-fs and instrumentation-dns immediately. They generate hundreds of spans per request for file system and DNS lookups that are almost never the cause of a slow request, and they inflate your telemetry costs.
What Auto-Instrumentation Gives You for Free
With the setup above and an Express application, every incoming request automatically gets:
- A root span covering the entire request lifetime
- Child spans for each database query with the SQL statement (sanitized) and duration
- Child spans for outbound HTTP calls with URL, method, and status code
- Child spans for Redis operations
For most applications, this is 80% of what you need for debugging. The traces let you see: how long each database query took, whether a slow response is from the database or an external API call, and which endpoint is responsible for most of your slow requests.
Adding Custom Spans for Business Logic
Auto-instrumentation covers infrastructure. Business logic that takes significant time and isn’t in a library call needs manual spans:
import { trace, SpanStatusCode } from '@opentelemetry/api'
const tracer = trace.getTracer('my-service')
async function processOrder(orderId: string) {
return tracer.startActiveSpan('process-order', async (span) => {
span.setAttribute('order.id', orderId)
try {
const order = await fetchOrder(orderId) // auto-instrumented DB call
// Custom span for business logic
await tracer.startActiveSpan('validate-inventory', async (inventorySpan) => {
inventorySpan.setAttribute('item.count', order.items.length)
const available = await checkInventory(order.items)
inventorySpan.setAttribute('inventory.available', available)
inventorySpan.end()
return available
})
span.setStatus({ code: SpanStatusCode.OK })
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error instanceof Error ? error.message : 'Unknown error',
})
span.recordException(error as Error)
throw error
} finally {
span.end()
}
})
}
The attributes on spans are the most important part of custom instrumentation. order.id lets you search traces for a specific order. item.count lets you correlate order size with processing time. Without attributes, traces tell you something was slow but not why.
Custom Metrics That Actually Matter
Auto-instrumentation gives you request count, latency distribution, and error rate automatically. The custom metrics worth adding:
Business-level counters: User signups, successful checkouts, failed payment attempts. These aren’t available from infrastructure metrics.
import { metrics } from '@opentelemetry/api'
const meter = metrics.getMeter('my-service')
const checkoutCounter = meter.createCounter('checkout.completed', {
description: 'Number of successfully completed checkouts',
})
const checkoutValue = meter.createHistogram('checkout.value_usd', {
description: 'Value of completed checkouts in USD',
unit: 'USD',
})
async function completeCheckout(cart: Cart) {
const order = await processPayment(cart)
checkoutCounter.add(1, {
'payment.method': cart.paymentMethod,
'user.type': cart.user.isPremium ? 'premium' : 'standard',
})
checkoutValue.record(order.totalUsd, {
'payment.method': cart.paymentMethod,
})
return order
}
Queue depth and processing lag if you run background jobs:
const queueDepthGauge = meter.createObservableGauge('queue.depth', {
description: 'Number of jobs waiting in queue',
})
queueDepthGauge.addCallback(async (result) => {
const depth = await getQueueDepth()
result.observe(depth, { 'queue.name': 'orders' })
})
Cache hit ratios if your performance depends on caching:
const cacheHits = meter.createCounter('cache.hits')
const cacheMisses = meter.createCounter('cache.misses')
async function getCachedUser(userId: string) {
const cached = await redis.get(`user:${userId}`)
if (cached) {
cacheHits.add(1, { 'cache.name': 'users' })
return JSON.parse(cached)
}
cacheMisses.add(1, { 'cache.name': 'users' })
const user = await db.users.findUnique({ where: { id: userId } })
await redis.setex(`user:${userId}`, 300, JSON.stringify(user))
return user
}
What to Skip
These generate significant data volume without corresponding debugging value for most web applications:
Span events for every log line. The OpenTelemetry SDK lets you attach log events to spans. Some implementations add every log statement as a span event. For a request that generates 50 log lines, this creates 50 span events you’ll almost never look at. Keep your structured logs in your logging system (Loki, CloudWatch Logs, etc.) and attach the trace ID to log records for correlation.
Instrumentation-fs spans. Every file system read (including module imports) generates a span. A startup trace can have thousands of spans from module loading alone. Disable this instrumentation.
Fine-grained database spans for bulk operations. If you’re processing 10,000 database records in a loop, instrumenting each operation individually generates 10,000 spans. Use a single span wrapping the loop with an attribute for the record count.
Custom spans for fast operations. Wrapping a 2-millisecond function in a span adds overhead (creating the span object, recording attributes, ending it) and generates data you won’t need. Manual spans are worth it for operations that might be slow or that have meaningful attributes for debugging.
Backends: Where to Send the Telemetry
OpenTelemetry separates the SDK from the backend. You can send to any OTLP-compatible endpoint. The options teams use:
Grafana Cloud (free tier for small volumes) gives you Tempo for traces, Mimir for metrics, and Loki for logs with a unified query interface. Reasonable pricing at scale.
Honeycomb is particularly strong for trace exploration. The query interface for slicing and dicing trace data is the best in the category. More expensive than self-hosted options.
Jaeger (self-hosted) is good for traces if you want to avoid vendor costs. Doesn’t include metrics or logs natively.
Datadog and New Relic accept OTLP data and provide full-stack observability. Higher cost, more features than you need for most web applications.
For a small team doing a few hundred requests per minute, Grafana Cloud’s free tier handles the volume. For teams that need to query traces across days or weeks at high request rates, evaluate cost carefully. Distributed tracing data volume adds up fast.
Sampling: The Production Necessity Nobody Mentions in Tutorials
Tracing every request in production is expensive at scale. A service handling 1000 requests per second generates 1000 traces per second. At even a small amount of data per trace, that’s terabytes per day.
Sampling reduces the volume. Head-based sampling makes the decision at the start of the trace (before you know if the request will be slow or error). Tail-based sampling makes the decision after the trace completes (you can always keep errors and slow traces).
For most web applications, this configuration covers 80% of use cases:
import { ParentBasedSampler, TraceIdRatioBasedSampler } from '@opentelemetry/sdk-trace-base'
// Keep 10% of normal traces, 100% of errors
const sampler = new ParentBasedSampler({
root: new TraceIdRatioBasedSampler(0.1),
})
If your observability backend supports tail-based sampling (Honeycomb and Grafana Tempo do), configure it there rather than in the SDK. Tail-based sampling can keep 100% of error traces and slow traces while sampling normal traces at 5-10%.
The mistake is not sampling in development but sampling in production, which means you can’t reproduce the exact conditions of a production trace in a dev environment. Keep sampling at 100% in development and staging.
Correlating Traces with Logs
The highest-value configuration that teams skip: attaching the current trace ID to every log record. This lets you jump from a trace in your observability backend to the logs for that specific request.
import { trace, context } from '@opentelemetry/api'
// Winston custom format
import { createLogger, format, transports } from 'winston'
const logger = createLogger({
format: format.combine(
format((info) => {
const span = trace.getActiveSpan()
if (span) {
const ctx = span.spanContext()
info.traceId = ctx.traceId
info.spanId = ctx.spanId
}
return info
})(),
format.json()
),
transports: [new transports.Console()],
})
With traceId in every log record, you can jump from “this request returned a 500” in your trace view to every log record from that request in your log system.
The instrumentation investment that pays back is narrow: auto-instrument everything (databases, HTTP), add custom spans for business-critical operations and slow background jobs, emit a handful of business-level counters, and wire up trace-log correlation. That covers the majority of production debugging you’ll actually do.
Sponsored
More from this category
More from Cloud & Infrastructure
Multi-Cloud vs Single Cloud in 2026: An Honest Cost-Benefit Analysis
Transactional Email Engineering: Why Your Emails Land in Spam and How to Fix It
KEDA, VPA, and Goldilocks: Kubernetes Autoscaling Beyond the HPA in 2026
Sponsored
The dispatch
Working notes from
the studio.
A short letter twice a month — what we shipped, what broke, and the AI tools earning their keep.
Discussion
Join the conversation.
Comments are powered by GitHub Discussions. Sign in with your GitHub account to leave a comment.
Sponsored