Master Guide: Application Error Tracking in EKS using Datadog, DogStatsD, APM, Logs, and Error Tracking
First, tiny naming correction: it is DogStatsD, not DogStashD. DogStatsD is Datadogβs StatsD-compatible custom metrics service. It is excellent for counting application errors, but it is not a full Sentry replacement by itself. For Sentry-like error debugging, you should combine:
DogStatsD metrics
+ Application logs with stack traces
+ Datadog APM traces
+ Datadog Error Tracking
+ Kubernetes / EKS metadata
+ Unified service tagging
This is the best implementation pattern for your app running inside containers/pods on EKS.
1. What we are trying to build
The target outcome is:
Application error happens
β
Datadog captures error count, log, trace, stack trace
β
Datadog links it to service, env, version
β
Datadog adds Kubernetes context
β
You can identify the exact pod/container/deployment/node
Final relationship should look like this:
Error Issue
βββ service: checkout-api
βββ env: prod
βββ version: 1.8.4
βββ error.type: PaymentTimeoutException
βββ endpoint: /api/checkout
βββ kube_namespace: prod
βββ kube_deployment: checkout-api
βββ pod_name: checkout-api-7c9d8f98c9-xz2lp
βββ container_name: checkout-api
βββ node: ip-10-0-12-25
βββ trace_id / log correlation
Datadogβs unified service tagging is built around the standard env, service, and version tags, which are used to correlate metrics, traces, logs, containers, and deployment versions. (Datadog Monitoring)
2. High-level architecture
flowchart TD
U[User / Client Request] --> ING[Ingress / ALB / API Gateway]
ING --> SVC[Kubernetes Service]
SVC --> POD[Application Pod in EKS]
POD --> APP[Application Container]
APP -->|DogStatsD custom error metrics| DSD[Datadog Agent DogStatsD]
APP -->|APM traces and exceptions| APM[Datadog Agent APM Receiver]
APP -->|stdout/stderr structured logs| LOGS[Kubernetes Node Log Files]
LOGS --> AGENT[Datadog Agent DaemonSet]
DSD --> AGENT
APM --> AGENT
KUBE[Kubernetes API / Kubelet Metadata] --> AGENT
CLUSTER[Datadog Cluster Agent] --> AGENT
AGENT --> DD[Datadog Platform]
DD --> METRICS[Metrics Explorer / Dashboards]
DD --> LOGEXP[Logs Explorer]
DD --> TRACE[APM Traces / Service Map]
DD --> ERR[Error Tracking]
DD --> MON[Monitors / Alerts]
ERR --> RCA[Root Cause Analysis]
TRACE --> RCA
LOGEXP --> RCA
METRICS --> RCA
3. Sentry to Datadog mapping
| Sentry capability | Datadog equivalent |
|---|---|
| Error issue grouping | Datadog Error Tracking |
| Stack trace | APM error span or structured error log |
| Release/version tracking | version tag |
| Environment | env tag |
| Project/service | service tag |
| Error count | DogStatsD custom metric |
| Request trace | Datadog APM |
| Breadcrumb-style context | Logs, trace spans, custom tags |
| Alert on new error | Error Tracking monitor |
| Alert on error volume | Metric monitor or APM monitor |
| Find pod/container | Kubernetes tags from Datadog Agent |
DogStatsD is useful for custom error counters, but Error Tracking, logs, and APM are what give you the Sentry-like debugging experience.
4. Recommended implementation model
Use four data streams together:
flowchart LR
A[Application Error] --> B[DogStatsD Metric]
A --> C[Structured Error Log]
A --> D[APM Trace / Span Error]
A --> E[Kubernetes Metadata]
B --> F[Dashboards and Metric Alerts]
C --> G[Log Search and Error Tracking]
D --> H[Trace Debugging and Service Map]
E --> I[Pod / Container / Deployment / Node Relationship]
F --> J[Datadog Incident View]
G --> J
H --> J
I --> J
| Data type | Purpose | Example |
|---|---|---|
| DogStatsD metric | Count and alert | app.error.count |
| Error log | Stack trace and message | JSON log with error.stack |
| APM trace | Request path and dependency failure | /checkout β payment-service timeout |
| Kubernetes metadata | Pod/container relationship | pod_name, kube_deployment, kube_namespace |
| Error Tracking issue | Group similar errors | PaymentTimeoutException grouped as one issue |
Datadog Error Tracking groups errors into issues and can alert on new, regressed, or high-impact errors. (Datadog Monitoring)
5. Install Datadog Agent in EKS
Datadog supports installation through Datadog Operator, Helm, or manual DaemonSet. Datadog currently recommends the Operator for Kubernetes because it reduces misconfiguration risk, but Helm is also a very common production approach. (Datadog Monitoring)
For EKS, the standard model is:
Datadog Agent = DaemonSet
Datadog Cluster Agent = Deployment
Application Pod sends logs/traces/metrics to local node Agent
5.1 Create namespace and secret
kubectl create namespace datadog
kubectl -n datadog create secret generic datadog-secret \
--from-literal api-key="$DD_API_KEY"
5.2 Example datadog-values.yaml
This is a practical production-style baseline for EKS application error tracking:
targetSystem: linux
datadog:
apiKeyExistingSecret: datadog-secret
# Example: datadoghq.com, datadoghq.eu, us3.datadoghq.com, us5.datadoghq.com, ap1.datadoghq.com
site: datadoghq.com
clusterName: eks-prod-apne1-01
kubeStateMetricsCore:
enabled: true
collectEvents: true
logs:
enabled: true
containerCollectAll: true
apm:
socketEnabled: true
portEnabled: false
dogstatsd:
originDetection: true
useSocketVolume: true
socketPath: /var/run/datadog/dsd.socket
tagCardinality: orchestrator
tags:
- cloud:aws
- platform:eks
- owner:devops
clusterAgent:
enabled: true
admissionController:
enabled: true
mutateUnlabelled: false
agents:
containers:
agent:
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
memory: 512Mi
Important notes:
logs.enabled and containerCollectAll allow the Agent to collect container logs. Datadogβs Kubernetes log collection docs show enabling features.logCollection.enabled and containerCollectAll with the Operator; the Helm values above express the same intent for Helm-based installs. (Datadog Monitoring)
dogstatsd.originDetection helps the Agent identify which container/pod emitted DogStatsD metrics. Datadog documents that DogStatsD origin detection can tag metrics with the same pod tags as Autodiscovery metrics, but the Agent-side origin detection is not enabled by default unless configured. (Datadog Monitoring)
apm.socketEnabled and dogstatsd.useSocketVolume use Unix Domain Socket communication. For Kubernetes APM, Datadog supports UDS, host IP, or Kubernetes service communication, and recommends UDS for trace submission. (Datadog Monitoring)
5.3 Install or upgrade Agent
helm upgrade --install datadog-agent datadog/datadog \
-n datadog \
-f datadog-values.yaml
Verify:
kubectl -n datadog get pods
kubectl -n datadog get ds
kubectl -n datadog get deploy
Expected resources:
datadog-agent DaemonSet
datadog-cluster-agent Deployment
6. Add unified service tags to your application
This is the most important part for relationship-building.
Every application Deployment should have:
env
service
version
Example:
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-api
namespace: prod
labels:
tags.datadoghq.com/env: "prod"
tags.datadoghq.com/service: "checkout-api"
tags.datadoghq.com/version: "1.8.4"
spec:
replicas: 3
selector:
matchLabels:
app: checkout-api
template:
metadata:
labels:
app: checkout-api
tags.datadoghq.com/env: "prod"
tags.datadoghq.com/service: "checkout-api"
tags.datadoghq.com/version: "1.8.4"
annotations:
admission.datadoghq.com/enabled: "true"
ad.datadoghq.com/checkout-api.logs: '[{"source":"java","service":"checkout-api"}]'
spec:
containers:
- name: checkout-api
image: myrepo/checkout-api:1.8.4
env:
- name: DD_ENV
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/env']
- name: DD_SERVICE
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/service']
- name: DD_VERSION
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/version']
- name: DD_ENTITY_ID
valueFrom:
fieldRef:
fieldPath: metadata.uid
- name: DD_TRACE_AGENT_URL
value: "unix:///var/run/datadog/apm.socket"
- name: DOGSTATSD_SOCKET
value: "/var/run/datadog/dsd.socket"
volumeMounts:
- name: datadog-socket
mountPath: /var/run/datadog
readOnly: true
volumes:
- name: datadog-socket
hostPath:
path: /var/run/datadog
Datadogβs Kubernetes unified service tagging documentation recommends applying tags.datadoghq.com/env, tags.datadoghq.com/service, and tags.datadoghq.com/version labels at the Deployment and pod template levels, and exposing them to the container as DD_ENV, DD_SERVICE, and DD_VERSION. (Datadog Monitoring)
7. Understand the exact error-tracking flow
sequenceDiagram
participant User
participant App as App Container
participant DogStatsD as DogStatsD Socket
participant Logs as Container Logs
participant APM as APM Tracer
participant Agent as Datadog Agent
participant DD as Datadog
participant ET as Error Tracking
User->>App: API request
App->>App: Exception occurs
App->>DogStatsD: increment app.error.count
DogStatsD->>Agent: custom metric with tags
App->>Logs: write structured ERROR log with stack trace
Logs->>Agent: collect stdout/stderr logs
App->>APM: mark span as error
APM->>Agent: send trace/span data
Agent->>Agent: attach Kubernetes metadata
Agent->>DD: send metrics, logs, traces
DD->>ET: group similar errors into issues
ET->>DD: issue with service/env/version/pod/container context
8. Implement DogStatsD error metrics
DogStatsD should be used for counting and alerting, not for full stack traces.
Recommended metric:
app.error.count
Recommended tags:
env
service
version
error_type
operation
endpoint
http_status
handled
Avoid high-cardinality tags:
request_id
user_id
session_id
order_id
full_url
full_error_message
stack_trace
pod_name unless intentionally needed
Bad metric design:
app.error.count{user_id:12345,request_id:abc,error_message:payment failed for order 998877}
Good metric design:
app.error.count{
env:prod,
service:checkout-api,
version:1.8.4,
error_type:PaymentTimeoutException,
operation:checkout,
endpoint:/api/checkout,
http_status:500,
handled:false
}
8.1 Generic application pattern
try:
process_request()
except Exception as err:
dogstatsd.increment(
"app.error.count",
tags=[
"error_type:" + err.class_name,
"operation:checkout",
"endpoint:/api/checkout",
"http_status:500",
"handled:false"
]
)
logger.error(
"Checkout failed",
error=err,
stack_trace=true,
fields={
"error.kind": err.class_name,
"error.message": err.message,
"error.stack": err.stack,
"operation": "checkout",
"endpoint": "/api/checkout",
"http.status_code": 500
}
)
raise
8.2 Python example
import os
import traceback
import logging
from datadog import DogStatsd
logger = logging.getLogger(__name__)
statsd = DogStatsd(
socket_path=os.getenv("DOGSTATSD_SOCKET", "/var/run/datadog/dsd.socket")
)
def checkout(request):
try:
# business logic here
process_payment(request)
except Exception as exc:
error_type = exc.__class__.__name__
stack = traceback.format_exc()
statsd.increment(
"app.error.count",
tags=[
f"error_type:{error_type}",
"operation:checkout",
"endpoint:/api/checkout",
"http_status:500",
"handled:false",
],
)
logger.error(
"Checkout failed",
extra={
"status": "error",
"error.kind": error_type,
"error.message": str(exc),
"error.stack": stack,
"operation": "checkout",
"endpoint": "/api/checkout",
"http.status_code": 500,
},
)
raise
8.3 Node.js example
const StatsD = require("hot-shots");
const logger = require("./logger");
const dogstatsd = new StatsD({
path: process.env.DOGSTATSD_SOCKET || "/var/run/datadog/dsd.socket"
});
async function checkout(req, res) {
try {
await processPayment(req.body);
res.status(200).send({ status: "ok" });
} catch (err) {
dogstatsd.increment("app.error.count", 1, [
`error_type:${err.name}`,
"operation:checkout",
"endpoint:/api/checkout",
"http_status:500",
"handled:false"
]);
logger.error({
status: "error",
message: "Checkout failed",
"error.kind": err.name,
"error.message": err.message,
"error.stack": err.stack,
operation: "checkout",
endpoint: "/api/checkout",
"http.status_code": 500
});
throw err;
}
}
9. Implement structured logs for Error Tracking
This is where you get the Sentry-like stack trace.
For Datadog Error Tracking from backend logs, the log should include:
status = ERROR / CRITICAL / ALERT / EMERGENCY
service
error.kind or error.stack
Datadog documents that backend error logs need either error.kind or a valid error.stack, a service attribute, and an error-level status. For better grouping, include error.message and error.stack. (Datadog Monitoring)
Recommended JSON log:
{
"timestamp": "2026-05-18T10:00:00.000Z",
"status": "error",
"service": "checkout-api",
"env": "prod",
"version": "1.8.4",
"message": "Checkout failed",
"error.kind": "PaymentTimeoutException",
"error.message": "Payment provider timed out",
"error.stack": "PaymentTimeoutException: Payment provider timed out\n at CheckoutService.pay...",
"operation": "checkout",
"endpoint": "/api/checkout",
"http.status_code": 500
}
Recommended log rule:
Application logs should go to stdout/stderr.
Datadog Agent should collect container logs from the node.
Logs should be JSON if possible.
Each error log should contain service, env, version, error.kind, error.message, error.stack.
For Kubernetes, Datadog recommends Agent-based log collection and can collect logs from Kubernetes log files. File-based collection is preferred over Docker socket-based collection for performance and reliability in containerized environments. (Datadog Monitoring)
10. Implement APM for request-level debugging
APM is what lets you answer:
Which API failed?
Which downstream service failed?
Was it database, cache, third-party API, timeout, or code exception?
Which trace/log belongs to this error?
Flow:
flowchart TD
REQ[Incoming Request /api/checkout] --> SPAN1[checkout-api span]
SPAN1 --> SPAN2[payment-service HTTP call]
SPAN1 --> SPAN3[database query]
SPAN2 --> ERR[Timeout Exception]
ERR --> TRACE[Trace marked as error]
TRACE --> ET[Error Tracking Issue]
TRACE --> LOG[Connected Logs]
TRACE --> POD[Pod and Container Metadata]
Recommended APM environment variables:
env:
- name: DD_ENV
value: "prod"
- name: DD_SERVICE
value: "checkout-api"
- name: DD_VERSION
value: "1.8.4"
- name: DD_TRACE_AGENT_URL
value: "unix:///var/run/datadog/apm.socket"
- name: DD_LOGS_INJECTION
value: "true"
- name: DD_RUNTIME_METRICS_ENABLED
value: "true"
Datadog APM on Kubernetes supports UDS, host IP, or Kubernetes service routing for traces. In containerized environments, sending traces to localhost is usually wrong because the Agent is in another container/pod; for Kubernetes, use UDS, node host IP, Admission Controller injection, or a Kubernetes service pattern. (Datadog Monitoring)
11. How Error Tracking groups errors
Datadog Error Tracking groups similar errors into issues based on properties such as:
service
error.type / error.kind
error.message
error.stack
top meaningful stack frame
So two errors may become separate issues if they happen in different services or have different error types/stack-frame locations. (Datadog Monitoring)
Example:
checkout-api + PaymentTimeoutException + CheckoutService.pay()
= One Error Tracking issue
payment-service + PaymentTimeoutException + PaymentClient.call()
= Different Error Tracking issue
This is why service, error.kind, and error.stack matter so much.
12. Recommended tag strategy
Mandatory tags
| Tag | Example | Purpose |
|---|---|---|
env | prod | Separate prod/stage/dev |
service | checkout-api | Service-level ownership |
version | 1.8.4 | Release/deployment tracking |
Strongly recommended tags
| Tag | Example | Purpose |
|---|---|---|
team | payments | Ownership |
product | motoshare | Product/application grouping |
component | api | API/worker/consumer grouping |
operation | checkout | Business flow |
endpoint | /api/checkout | API route |
error_type | PaymentTimeoutException | Error classification |
handled | true/false | Handled vs unhandled error |
cloud | aws | Cloud provider |
platform | eks | Runtime platform |
Kubernetes tags Datadog can add
kube_cluster_name
kube_namespace
kube_deployment
kube_replica_set
pod_name
container_name
image_name
image_tag
node
availability_zone
For DogStatsD metrics, be careful with tag cardinality. Datadog notes that for UDP DogStatsD, pod_name is not added by default to avoid creating too many custom metrics, and tag cardinality can be controlled globally or per metric. (Datadog Monitoring)
My recommendation:
Use service/version-level DogStatsD metrics for alerting.
Use logs/APM/Error Tracking for exact pod/container investigation.
Use pod-level metric tagging only when you really need it.
13. Complete application telemetry flow
flowchart TD
A[Exception in Application] --> B{Telemetry Type}
B --> C[DogStatsD Counter]
C --> C1[app.error.count]
C1 --> C2[Alert: Error spike by service/version]
B --> D[Structured Error Log]
D --> D1[error.kind]
D --> D2[error.message]
D --> D3[error.stack]
D3 --> D4[Error Tracking Issue]
B --> E[APM Trace]
E --> E1[Trace marked error]
E1 --> E2[Request path]
E2 --> E3[Downstream dependency failure]
B --> F[Kubernetes Metadata]
F --> F1[pod_name]
F --> F2[container_name]
F --> F3[kube_deployment]
F --> F4[node]
C2 --> G[Datadog Incident / Monitor]
D4 --> G
E3 --> G
F4 --> G
14. Build dashboards
14.1 Error count by service
sum:app.error.count{env:prod} by {service}.as_count()
14.2 Error count by version
sum:app.error.count{env:prod,service:checkout-api} by {version}.as_count()
Use this to answer:
Did the new release increase errors?
14.3 Error count by operation
sum:app.error.count{env:prod,service:checkout-api} by {operation}.as_count()
Use this to answer:
Which business flow is failing?
14.4 Error count by error type
sum:app.error.count{env:prod,service:checkout-api} by {error_type}.as_count()
Use this to answer:
Which exception is most common?
14.5 Error count by Kubernetes deployment
sum:app.error.count{env:prod} by {kube_namespace,kube_deployment}.as_count()
Use this to answer:
Which deployment is producing the errors?
14.6 Pod-level view
Only use this if your DogStatsD metric cardinality/tagging supports it:
sum:app.error.count{env:prod,service:checkout-api} by {pod_name}.as_count()
For exact pod-level investigation, I would rely more on logs/APM/Error Tracking because pod-level metrics can create high cardinality and cost/noise.
15. Build monitors and alerts
15.1 Metric monitor: service error spike
sum(last_5m):sum:app.error.count{env:prod,service:checkout-api}.as_count() > 50
Alert message:
High application error count detected.
Service: {{service.name}}
Environment: {{env.name}}
Version: {{version.name}}
Check:
- Error Tracking issue
- APM trace
- Logs for error.stack
- Kubernetes pod/container details
15.2 Metric monitor: new version error spike
sum(last_10m):sum:app.error.count{env:prod,service:checkout-api} by {version}.as_count() > 100
Use this after deployments.
15.3 Error Tracking monitor: new issue
Use this for Sentry-like behavior:
Alert when a new backend issue appears for service:checkout-api env:prod
Datadog Error Tracking monitors support alerting on new issues, regressions, and high-impact errors. (Datadog Monitoring)
15.4 APM monitor: error rate
Example logic:
Error rate for checkout-api > 5% during last 5 minutes
Use this for service reliability monitoring.
16. Recommended alerting strategy
Do not create only one giant alert.
Use layered alerting:
flowchart TD
A[Application Errors] --> B[Metric Alert]
A --> C[Error Tracking New Issue Alert]
A --> D[APM Error Rate Alert]
A --> E[Kubernetes Pod Restart Alert]
B --> F[High volume problem]
C --> G[New code issue]
D --> H[Request failure problem]
E --> I[Runtime/container problem]
F --> J[Incident]
G --> J
H --> J
I --> J
| Alert type | Detects | Best for |
|---|---|---|
| DogStatsD metric alert | Error volume spike | Fast service-level alert |
| Error Tracking alert | New/regressed grouped error | Sentry-like issue detection |
| APM error rate alert | Request failure percentage | API/SLO reliability |
| Log alert | Specific log pattern | Known failure modes |
| Kubernetes alert | CrashLoopBackOff/restarts | Pod/container health |
17. Best practice: use DogStatsD for counters, not stack traces
DogStatsD should answer:
How many errors happened?
Which service/version/operation is failing?
Did errors increase after deployment?
DogStatsD should not answer:
What is the stack trace?
Which line of code failed?
What was the exception body?
What user/request caused this?
Those belong in:
APM
Logs
Error Tracking
Trace/log correlation
18. Best practice: standardize error classification
Create a small taxonomy across all services.
Example:
validation_error
dependency_timeout
database_error
authentication_error
authorization_error
business_rule_error
unexpected_exception
Then tag DogStatsD metrics like this:
error_category:dependency_timeout
error_type:PaymentTimeoutException
operation:checkout
This gives clean dashboards:
Errors by category
Errors by operation
Errors by service
Errors by version
19. Best practice: release/version tracking
Every deployment should set a unique version.
Good:
version: 1.8.4
version: git-sha-a8f91cd
version: 2026.05.18.1
Bad:
version: latest
version: prod
version: main
Datadog expects version to change with each application deployment so deployment impact can be identified cleanly. (Datadog Monitoring)
20. Best practice: log format
Use JSON logs.
Recommended fields:
{
"status": "error",
"service": "checkout-api",
"env": "prod",
"version": "1.8.4",
"message": "Checkout failed",
"error.kind": "PaymentTimeoutException",
"error.message": "Payment provider timed out",
"error.stack": "...",
"operation": "checkout",
"endpoint": "/api/checkout",
"http.method": "POST",
"http.status_code": 500,
"customer_impact": true
}
Avoid logging sensitive data:
password
token
credit card
personal identity data
full request payloads
authorization headers
21. Best practice: deployment annotation for logs
For each application container, add Datadog log annotation:
annotations:
ad.datadoghq.com/checkout-api.logs: >
[{
"source": "java",
"service": "checkout-api",
"tags": ["team:payments","component:api"]
}]
Use the right source value:
| App language/runtime | source |
|---|---|
| Java | java |
| Node.js | nodejs |
| Python | python |
| Go | go |
| .NET | csharp or configured .NET source |
| Ruby | ruby |
The source tag matters because Datadogβs Error Tracking for logs uses language-specific handling, and Datadog recommends ensuring the source tag is properly configured. (Datadog Monitoring)
22. Pod/container relationship design
The relationship is built from three places:
flowchart TD
A[Application Deployment Labels] --> D[env/service/version]
B[Datadog Agent Kubernetes Metadata] --> E[pod/container/deployment/node]
C[Application Logs/APM/DogStatsD] --> F[error/trace/metric]
D --> G[Unified Datadog View]
E --> G
F --> G
G --> H[Which service failed?]
G --> I[Which version failed?]
G --> J[Which pod/container failed?]
G --> K[Which node hosted it?]
To make this work:
1. Datadog Agent must run in the cluster.
2. App pods must have unified service tags.
3. Logs/APM/DogStatsD must use the same service/env/version.
4. Error logs must include error.kind/error.stack.
5. APM tracer should inject trace/log correlation where supported.
6. DogStatsD origin detection should be enabled.
23. EKS-specific implementation notes
Standard EKS with EC2 worker nodes
Recommended:
Datadog Agent as DaemonSet
Use UDS for APM
Use UDS for DogStatsD
Collect container logs from nodes
Use Cluster Agent
Use Admission Controller where possible
EKS Fargate
Be careful. EKS Fargate does not behave like normal EC2 worker nodes because you do not manage the underlying node the same way. Datadogβs DogStatsD origin detection docs specifically mention shareProcessNamespace:true to assist the Agent for origin detection on EKS Fargate. (Datadog Monitoring)
If you are using Fargate, validate the Datadog deployment pattern separately.
24. End-to-end sample implementation
24.1 Datadog Agent values
targetSystem: linux
datadog:
apiKeyExistingSecret: datadog-secret
site: datadoghq.com
clusterName: eks-prod-apne1-01
logs:
enabled: true
containerCollectAll: true
apm:
socketEnabled: true
portEnabled: false
dogstatsd:
originDetection: true
useSocketVolume: true
socketPath: /var/run/datadog/dsd.socket
tagCardinality: orchestrator
kubeStateMetricsCore:
enabled: true
collectEvents: true
clusterAgent:
enabled: true
admissionController:
enabled: true
mutateUnlabelled: false
24.2 App deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: checkout-api
namespace: prod
labels:
tags.datadoghq.com/env: "prod"
tags.datadoghq.com/service: "checkout-api"
tags.datadoghq.com/version: "1.8.4"
spec:
replicas: 3
selector:
matchLabels:
app: checkout-api
template:
metadata:
labels:
app: checkout-api
tags.datadoghq.com/env: "prod"
tags.datadoghq.com/service: "checkout-api"
tags.datadoghq.com/version: "1.8.4"
annotations:
admission.datadoghq.com/enabled: "true"
ad.datadoghq.com/checkout-api.logs: >
[{
"source": "java",
"service": "checkout-api",
"tags": ["team:payments","component:api"]
}]
spec:
containers:
- name: checkout-api
image: myrepo/checkout-api:1.8.4
env:
- name: DD_ENV
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/env']
- name: DD_SERVICE
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/service']
- name: DD_VERSION
valueFrom:
fieldRef:
fieldPath: metadata.labels['tags.datadoghq.com/version']
- name: DD_ENTITY_ID
valueFrom:
fieldRef:
fieldPath: metadata.uid
- name: DD_TRACE_AGENT_URL
value: "unix:///var/run/datadog/apm.socket"
- name: DD_LOGS_INJECTION
value: "true"
- name: DD_RUNTIME_METRICS_ENABLED
value: "true"
- name: DOGSTATSD_SOCKET
value: "/var/run/datadog/dsd.socket"
volumeMounts:
- name: datadog-socket
mountPath: /var/run/datadog
readOnly: true
volumes:
- name: datadog-socket
hostPath:
path: /var/run/datadog
25. Validation checklist
25.1 Validate Datadog Agent
kubectl -n datadog get pods
kubectl -n datadog get ds
kubectl -n datadog get deploy
25.2 Check Agent status
kubectl -n datadog exec -it <datadog-agent-pod-name> -c agent -- agent status
Look for:
APM Agent: Running
DogStatsD: Running
Logs Agent: Running
Datadogβs APM troubleshooting guide says the Agent status output should show the APM Agent as running; otherwise traces cannot be submitted properly. (Datadog Monitoring)
25.3 Validate app tags
Check pod labels:
kubectl -n prod get pod <pod-name> --show-labels
Expected:
tags.datadoghq.com/env=prod
tags.datadoghq.com/service=checkout-api
tags.datadoghq.com/version=1.8.4
25.4 Validate logs
Generate a test exception, then search logs by:
service:checkout-api env:prod status:error
Expected fields:
error.kind
error.message
error.stack
kube_namespace
pod_name
container_name
25.5 Validate DogStatsD metric
Search metric:
app.error.count
Group by:
service
version
error_type
operation
25.6 Validate APM
Search service:
service:checkout-api env:prod
Expected:
Traces visible
Error traces visible
Service map visible
Trace/log correlation working
25.7 Validate Error Tracking
Search backend issues for:
service:checkout-api env:prod
Expected:
Grouped error issue
Stack trace visible
Occurrences visible
Related logs/traces visible
26. Common problems and fixes
| Problem | Likely cause | Fix |
|---|---|---|
| Error metric appears but no pod/container | DogStatsD origin detection/cardinality issue | Enable origin detection; use UDS; review tag cardinality |
| Error Tracking issue not created | Logs missing error.kind or error.stack | Add structured error fields |
| Logs visible but service name wrong | Missing log annotation or unified tags | Add service in log config and DD_SERVICE |
| APM traces missing | App cannot reach Agent | Use UDS or correct DD_AGENT_HOST; check Agent status |
| Trace/log correlation missing | Log injection not enabled | Enable tracer log injection |
| Too many custom metrics | High-cardinality metric tags | Remove request_id, user_id, pod_name from metrics |
| New release not visible | Static or missing version | Set unique DD_VERSION per deployment |
| Pod error not visible in metric | Pod tag not included for cardinality reasons | Use logs/APM for pod-level RCA or adjust cardinality carefully |
| Logs not collected | Agent log collection disabled | Enable container log collection |
27. Best implementation pattern for your migration
Do not migrate like this:
Sentry β DogStatsD only
That will give weak debugging.
Migrate like this:
Sentry
β Datadog Error Tracking
β Datadog APM
β Datadog Logs
β DogStatsD custom error metrics
β Kubernetes metadata correlation
Recommended production pattern:
flowchart TD
A[Sentry Replacement Requirement] --> B[Error Tracking]
A --> C[APM]
A --> D[Logs]
A --> E[DogStatsD Metrics]
B --> F[Grouped Issues]
C --> G[Trace and Dependency RCA]
D --> H[Stack Trace and Context]
E --> I[Fast Error Count Alerts]
F --> J[Service / Env / Version]
G --> J
H --> J
I --> J
J --> K[Kubernetes Pod / Container / Deployment / Node]
28. Final recommended standard
For every service running in EKS, implement this standard:
1. Add Datadog unified service labels:
- tags.datadoghq.com/env
- tags.datadoghq.com/service
- tags.datadoghq.com/version
2. Add application env vars:
- DD_ENV
- DD_SERVICE
- DD_VERSION
- DD_TRACE_AGENT_URL
- DD_LOGS_INJECTION
- DD_ENTITY_ID
3. Enable Datadog Agent features:
- logs
- APM
- DogStatsD
- DogStatsD origin detection
- Kubernetes metadata
- Cluster Agent
4. Application must emit:
- DogStatsD metric: app.error.count
- Structured error log with error.kind/error.message/error.stack
- APM trace/span errors
5. Dashboards should show:
- errors by service
- errors by version
- errors by operation
- errors by error_type
- errors by namespace/deployment
- related pods/containers through logs/APM
6. Alerts should include:
- new Error Tracking issue
- high error count
- high APM error rate
- pod restart/crashloop alerts
29. Final conclusion
The best Datadog design for application error tracking in EKS is:
DogStatsD for custom error counters
Logs for stack traces
APM for request/dependency tracing
Error Tracking for Sentry-like issue grouping
Unified service tagging for service/env/version relationship
Kubernetes metadata for pod/container/node relationship
In short:
DogStatsD tells you how many errors happened.
Logs tell you what exception happened.
APM tells you where in the request path it failed.
Error Tracking groups the issue.
Kubernetes metadata tells you which pod/container/deployment/node caused it.
That combination gives you a clean, production-grade replacement for Sentry while also giving stronger EKS infrastructure correlation than Sentry alone.