MOTOSHARE πŸš—πŸοΈ
Turning Idle Vehicles into Shared Rides & Earnings

From Idle to Income. From Parked to Purpose.
Earn by Sharing, Ride by Renting.
Where Owners Earn, Riders Move.
Owners Earn. Riders Move. Motoshare Connects.

With Motoshare, every parked vehicle finds a purpose. Owners earn. Renters ride.
πŸš€ Everyone wins.

Start Your Journey with Motoshare

wewqe

Uncategorized

Master Guide: Application Error Tracking in EKS using Datadog, DogStatsD, APM, Logs, and Error Tracking

First, tiny naming correction: it is DogStatsD, not DogStashD. DogStatsD is Datadog’s StatsD-compatible custom metrics service. It is excellent for counting application errors, but it is not a full Sentry replacement by itself. For Sentry-like error debugging, you should combine:

DogStatsD metrics
+ Application logs with stack traces
+ Datadog APM traces
+ Datadog Error Tracking
+ Kubernetes / EKS metadata
+ Unified service tagging

This is the best implementation pattern for your app running inside containers/pods on EKS.


1. What we are trying to build

The target outcome is:

Application error happens
        ↓
Datadog captures error count, log, trace, stack trace
        ↓
Datadog links it to service, env, version
        ↓
Datadog adds Kubernetes context
        ↓
You can identify the exact pod/container/deployment/node

Final relationship should look like this:

Error Issue
  β”œβ”€β”€ service: checkout-api
  β”œβ”€β”€ env: prod
  β”œβ”€β”€ version: 1.8.4
  β”œβ”€β”€ error.type: PaymentTimeoutException
  β”œβ”€β”€ endpoint: /api/checkout
  β”œβ”€β”€ kube_namespace: prod
  β”œβ”€β”€ kube_deployment: checkout-api
  β”œβ”€β”€ pod_name: checkout-api-7c9d8f98c9-xz2lp
  β”œβ”€β”€ container_name: checkout-api
  β”œβ”€β”€ node: ip-10-0-12-25
  └── trace_id / log correlation

Datadog’s unified service tagging is built around the standard env, service, and version tags, which are used to correlate metrics, traces, logs, containers, and deployment versions. (Datadog Monitoring)


2. High-level architecture

flowchart TD
    U[User / Client Request] --> ING[Ingress / ALB / API Gateway]
    ING --> SVC[Kubernetes Service]
    SVC --> POD[Application Pod in EKS]

    POD --> APP[Application Container]

    APP -->|DogStatsD custom error metrics| DSD[Datadog Agent DogStatsD]
    APP -->|APM traces and exceptions| APM[Datadog Agent APM Receiver]
    APP -->|stdout/stderr structured logs| LOGS[Kubernetes Node Log Files]

    LOGS --> AGENT[Datadog Agent DaemonSet]
    DSD --> AGENT
    APM --> AGENT

    KUBE[Kubernetes API / Kubelet Metadata] --> AGENT
    CLUSTER[Datadog Cluster Agent] --> AGENT

    AGENT --> DD[Datadog Platform]

    DD --> METRICS[Metrics Explorer / Dashboards]
    DD --> LOGEXP[Logs Explorer]
    DD --> TRACE[APM Traces / Service Map]
    DD --> ERR[Error Tracking]
    DD --> MON[Monitors / Alerts]

    ERR --> RCA[Root Cause Analysis]
    TRACE --> RCA
    LOGEXP --> RCA
    METRICS --> RCA

3. Sentry to Datadog mapping

Sentry capabilityDatadog equivalent
Error issue groupingDatadog Error Tracking
Stack traceAPM error span or structured error log
Release/version trackingversion tag
Environmentenv tag
Project/serviceservice tag
Error countDogStatsD custom metric
Request traceDatadog APM
Breadcrumb-style contextLogs, trace spans, custom tags
Alert on new errorError Tracking monitor
Alert on error volumeMetric monitor or APM monitor
Find pod/containerKubernetes tags from Datadog Agent

DogStatsD is useful for custom error counters, but Error Tracking, logs, and APM are what give you the Sentry-like debugging experience.


4. Recommended implementation model

Use four data streams together:

flowchart LR
    A[Application Error] --> B[DogStatsD Metric]
    A --> C[Structured Error Log]
    A --> D[APM Trace / Span Error]
    A --> E[Kubernetes Metadata]

    B --> F[Dashboards and Metric Alerts]
    C --> G[Log Search and Error Tracking]
    D --> H[Trace Debugging and Service Map]
    E --> I[Pod / Container / Deployment / Node Relationship]

    F --> J[Datadog Incident View]
    G --> J
    H --> J
    I --> J
Data typePurposeExample
DogStatsD metricCount and alertapp.error.count
Error logStack trace and messageJSON log with error.stack
APM traceRequest path and dependency failure/checkout β†’ payment-service timeout
Kubernetes metadataPod/container relationshippod_name, kube_deployment, kube_namespace
Error Tracking issueGroup similar errorsPaymentTimeoutException grouped as one issue

Datadog Error Tracking groups errors into issues and can alert on new, regressed, or high-impact errors. (Datadog Monitoring)


5. Install Datadog Agent in EKS

Datadog supports installation through Datadog Operator, Helm, or manual DaemonSet. Datadog currently recommends the Operator for Kubernetes because it reduces misconfiguration risk, but Helm is also a very common production approach. (Datadog Monitoring)

For EKS, the standard model is:

Datadog Agent = DaemonSet
Datadog Cluster Agent = Deployment
Application Pod sends logs/traces/metrics to local node Agent

5.1 Create namespace and secret

kubectl create namespace datadog

kubectl -n datadog create secret generic datadog-secret \
  --from-literal api-key="$DD_API_KEY"

5.2 Example datadog-values.yaml

This is a practical production-style baseline for EKS application error tracking:

targetSystem: linux

datadog:
  apiKeyExistingSecret: datadog-secret

  # Example: datadoghq.com, datadoghq.eu, us3.datadoghq.com, us5.datadoghq.com, ap1.datadoghq.com
  site: datadoghq.com

  clusterName: eks-prod-apne1-01

  kubeStateMetricsCore:
    enabled: true

  collectEvents: true

  logs:
    enabled: true
    containerCollectAll: true

  apm:
    socketEnabled: true
    portEnabled: false

  dogstatsd:
    originDetection: true
    useSocketVolume: true
    socketPath: /var/run/datadog/dsd.socket
    tagCardinality: orchestrator

  tags:
    - cloud:aws
    - platform:eks
    - owner:devops

clusterAgent:
  enabled: true
  admissionController:
    enabled: true
    mutateUnlabelled: false

agents:
  containers:
    agent:
      resources:
        requests:
          cpu: 200m
          memory: 256Mi
        limits:
          memory: 512Mi

Important notes:

logs.enabled and containerCollectAll allow the Agent to collect container logs. Datadog’s Kubernetes log collection docs show enabling features.logCollection.enabled and containerCollectAll with the Operator; the Helm values above express the same intent for Helm-based installs. (Datadog Monitoring)

dogstatsd.originDetection helps the Agent identify which container/pod emitted DogStatsD metrics. Datadog documents that DogStatsD origin detection can tag metrics with the same pod tags as Autodiscovery metrics, but the Agent-side origin detection is not enabled by default unless configured. (Datadog Monitoring)

apm.socketEnabled and dogstatsd.useSocketVolume use Unix Domain Socket communication. For Kubernetes APM, Datadog supports UDS, host IP, or Kubernetes service communication, and recommends UDS for trace submission. (Datadog Monitoring)

5.3 Install or upgrade Agent

helm upgrade --install datadog-agent datadog/datadog \
  -n datadog \
  -f datadog-values.yaml

Verify:

kubectl -n datadog get pods
kubectl -n datadog get ds
kubectl -n datadog get deploy

Expected resources:

datadog-agent DaemonSet
datadog-cluster-agent Deployment

6. Add unified service tags to your application

This is the most important part for relationship-building.

Every application Deployment should have:

env
service
version

Example:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-api
  namespace: prod
  labels:
    tags.datadoghq.com/env: "prod"
    tags.datadoghq.com/service: "checkout-api"
    tags.datadoghq.com/version: "1.8.4"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: checkout-api
  template:
    metadata:
      labels:
        app: checkout-api
        tags.datadoghq.com/env: "prod"
        tags.datadoghq.com/service: "checkout-api"
        tags.datadoghq.com/version: "1.8.4"
      annotations:
        admission.datadoghq.com/enabled: "true"
        ad.datadoghq.com/checkout-api.logs: '[{"source":"java","service":"checkout-api"}]'
    spec:
      containers:
        - name: checkout-api
          image: myrepo/checkout-api:1.8.4
          env:
            - name: DD_ENV
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['tags.datadoghq.com/env']

            - name: DD_SERVICE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['tags.datadoghq.com/service']

            - name: DD_VERSION
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['tags.datadoghq.com/version']

            - name: DD_ENTITY_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid

            - name: DD_TRACE_AGENT_URL
              value: "unix:///var/run/datadog/apm.socket"

            - name: DOGSTATSD_SOCKET
              value: "/var/run/datadog/dsd.socket"

          volumeMounts:
            - name: datadog-socket
              mountPath: /var/run/datadog
              readOnly: true

      volumes:
        - name: datadog-socket
          hostPath:
            path: /var/run/datadog

Datadog’s Kubernetes unified service tagging documentation recommends applying tags.datadoghq.com/env, tags.datadoghq.com/service, and tags.datadoghq.com/version labels at the Deployment and pod template levels, and exposing them to the container as DD_ENV, DD_SERVICE, and DD_VERSION. (Datadog Monitoring)


7. Understand the exact error-tracking flow

sequenceDiagram
    participant User
    participant App as App Container
    participant DogStatsD as DogStatsD Socket
    participant Logs as Container Logs
    participant APM as APM Tracer
    participant Agent as Datadog Agent
    participant DD as Datadog
    participant ET as Error Tracking

    User->>App: API request
    App->>App: Exception occurs

    App->>DogStatsD: increment app.error.count
    DogStatsD->>Agent: custom metric with tags

    App->>Logs: write structured ERROR log with stack trace
    Logs->>Agent: collect stdout/stderr logs

    App->>APM: mark span as error
    APM->>Agent: send trace/span data

    Agent->>Agent: attach Kubernetes metadata
    Agent->>DD: send metrics, logs, traces

    DD->>ET: group similar errors into issues
    ET->>DD: issue with service/env/version/pod/container context

8. Implement DogStatsD error metrics

DogStatsD should be used for counting and alerting, not for full stack traces.

Recommended metric:

app.error.count

Recommended tags:

env
service
version
error_type
operation
endpoint
http_status
handled

Avoid high-cardinality tags:

request_id
user_id
session_id
order_id
full_url
full_error_message
stack_trace
pod_name unless intentionally needed

Bad metric design:

app.error.count{user_id:12345,request_id:abc,error_message:payment failed for order 998877}

Good metric design:

app.error.count{
  env:prod,
  service:checkout-api,
  version:1.8.4,
  error_type:PaymentTimeoutException,
  operation:checkout,
  endpoint:/api/checkout,
  http_status:500,
  handled:false
}

8.1 Generic application pattern

try:
    process_request()
except Exception as err:
    dogstatsd.increment(
        "app.error.count",
        tags=[
            "error_type:" + err.class_name,
            "operation:checkout",
            "endpoint:/api/checkout",
            "http_status:500",
            "handled:false"
        ]
    )

    logger.error(
        "Checkout failed",
        error=err,
        stack_trace=true,
        fields={
            "error.kind": err.class_name,
            "error.message": err.message,
            "error.stack": err.stack,
            "operation": "checkout",
            "endpoint": "/api/checkout",
            "http.status_code": 500
        }
    )

    raise

8.2 Python example

import os
import traceback
import logging
from datadog import DogStatsd

logger = logging.getLogger(__name__)

statsd = DogStatsd(
    socket_path=os.getenv("DOGSTATSD_SOCKET", "/var/run/datadog/dsd.socket")
)

def checkout(request):
    try:
        # business logic here
        process_payment(request)

    except Exception as exc:
        error_type = exc.__class__.__name__
        stack = traceback.format_exc()

        statsd.increment(
            "app.error.count",
            tags=[
                f"error_type:{error_type}",
                "operation:checkout",
                "endpoint:/api/checkout",
                "http_status:500",
                "handled:false",
            ],
        )

        logger.error(
            "Checkout failed",
            extra={
                "status": "error",
                "error.kind": error_type,
                "error.message": str(exc),
                "error.stack": stack,
                "operation": "checkout",
                "endpoint": "/api/checkout",
                "http.status_code": 500,
            },
        )

        raise

8.3 Node.js example

const StatsD = require("hot-shots");
const logger = require("./logger");

const dogstatsd = new StatsD({
  path: process.env.DOGSTATSD_SOCKET || "/var/run/datadog/dsd.socket"
});

async function checkout(req, res) {
  try {
    await processPayment(req.body);
    res.status(200).send({ status: "ok" });
  } catch (err) {
    dogstatsd.increment("app.error.count", 1, [
      `error_type:${err.name}`,
      "operation:checkout",
      "endpoint:/api/checkout",
      "http_status:500",
      "handled:false"
    ]);

    logger.error({
      status: "error",
      message: "Checkout failed",
      "error.kind": err.name,
      "error.message": err.message,
      "error.stack": err.stack,
      operation: "checkout",
      endpoint: "/api/checkout",
      "http.status_code": 500
    });

    throw err;
  }
}

9. Implement structured logs for Error Tracking

This is where you get the Sentry-like stack trace.

For Datadog Error Tracking from backend logs, the log should include:

status = ERROR / CRITICAL / ALERT / EMERGENCY
service
error.kind or error.stack

Datadog documents that backend error logs need either error.kind or a valid error.stack, a service attribute, and an error-level status. For better grouping, include error.message and error.stack. (Datadog Monitoring)

Recommended JSON log:

{
  "timestamp": "2026-05-18T10:00:00.000Z",
  "status": "error",
  "service": "checkout-api",
  "env": "prod",
  "version": "1.8.4",
  "message": "Checkout failed",
  "error.kind": "PaymentTimeoutException",
  "error.message": "Payment provider timed out",
  "error.stack": "PaymentTimeoutException: Payment provider timed out\n    at CheckoutService.pay...",
  "operation": "checkout",
  "endpoint": "/api/checkout",
  "http.status_code": 500
}

Recommended log rule:

Application logs should go to stdout/stderr.
Datadog Agent should collect container logs from the node.
Logs should be JSON if possible.
Each error log should contain service, env, version, error.kind, error.message, error.stack.

For Kubernetes, Datadog recommends Agent-based log collection and can collect logs from Kubernetes log files. File-based collection is preferred over Docker socket-based collection for performance and reliability in containerized environments. (Datadog Monitoring)


10. Implement APM for request-level debugging

APM is what lets you answer:

Which API failed?
Which downstream service failed?
Was it database, cache, third-party API, timeout, or code exception?
Which trace/log belongs to this error?

Flow:

flowchart TD
    REQ[Incoming Request /api/checkout] --> SPAN1[checkout-api span]
    SPAN1 --> SPAN2[payment-service HTTP call]
    SPAN1 --> SPAN3[database query]
    SPAN2 --> ERR[Timeout Exception]
    ERR --> TRACE[Trace marked as error]
    TRACE --> ET[Error Tracking Issue]
    TRACE --> LOG[Connected Logs]
    TRACE --> POD[Pod and Container Metadata]

Recommended APM environment variables:

env:
  - name: DD_ENV
    value: "prod"

  - name: DD_SERVICE
    value: "checkout-api"

  - name: DD_VERSION
    value: "1.8.4"

  - name: DD_TRACE_AGENT_URL
    value: "unix:///var/run/datadog/apm.socket"

  - name: DD_LOGS_INJECTION
    value: "true"

  - name: DD_RUNTIME_METRICS_ENABLED
    value: "true"

Datadog APM on Kubernetes supports UDS, host IP, or Kubernetes service routing for traces. In containerized environments, sending traces to localhost is usually wrong because the Agent is in another container/pod; for Kubernetes, use UDS, node host IP, Admission Controller injection, or a Kubernetes service pattern. (Datadog Monitoring)


11. How Error Tracking groups errors

Datadog Error Tracking groups similar errors into issues based on properties such as:

service
error.type / error.kind
error.message
error.stack
top meaningful stack frame

So two errors may become separate issues if they happen in different services or have different error types/stack-frame locations. (Datadog Monitoring)

Example:

checkout-api + PaymentTimeoutException + CheckoutService.pay()
= One Error Tracking issue
payment-service + PaymentTimeoutException + PaymentClient.call()
= Different Error Tracking issue

This is why service, error.kind, and error.stack matter so much.


12. Recommended tag strategy

Mandatory tags

TagExamplePurpose
envprodSeparate prod/stage/dev
servicecheckout-apiService-level ownership
version1.8.4Release/deployment tracking

Strongly recommended tags

TagExamplePurpose
teampaymentsOwnership
productmotoshareProduct/application grouping
componentapiAPI/worker/consumer grouping
operationcheckoutBusiness flow
endpoint/api/checkoutAPI route
error_typePaymentTimeoutExceptionError classification
handledtrue/falseHandled vs unhandled error
cloudawsCloud provider
platformeksRuntime platform

Kubernetes tags Datadog can add

kube_cluster_name
kube_namespace
kube_deployment
kube_replica_set
pod_name
container_name
image_name
image_tag
node
availability_zone

For DogStatsD metrics, be careful with tag cardinality. Datadog notes that for UDP DogStatsD, pod_name is not added by default to avoid creating too many custom metrics, and tag cardinality can be controlled globally or per metric. (Datadog Monitoring)

My recommendation:

Use service/version-level DogStatsD metrics for alerting.
Use logs/APM/Error Tracking for exact pod/container investigation.
Use pod-level metric tagging only when you really need it.

13. Complete application telemetry flow

flowchart TD
    A[Exception in Application] --> B{Telemetry Type}

    B --> C[DogStatsD Counter]
    C --> C1[app.error.count]
    C1 --> C2[Alert: Error spike by service/version]

    B --> D[Structured Error Log]
    D --> D1[error.kind]
    D --> D2[error.message]
    D --> D3[error.stack]
    D3 --> D4[Error Tracking Issue]

    B --> E[APM Trace]
    E --> E1[Trace marked error]
    E1 --> E2[Request path]
    E2 --> E3[Downstream dependency failure]

    B --> F[Kubernetes Metadata]
    F --> F1[pod_name]
    F --> F2[container_name]
    F --> F3[kube_deployment]
    F --> F4[node]

    C2 --> G[Datadog Incident / Monitor]
    D4 --> G
    E3 --> G
    F4 --> G

14. Build dashboards

14.1 Error count by service

sum:app.error.count{env:prod} by {service}.as_count()

14.2 Error count by version

sum:app.error.count{env:prod,service:checkout-api} by {version}.as_count()

Use this to answer:

Did the new release increase errors?

14.3 Error count by operation

sum:app.error.count{env:prod,service:checkout-api} by {operation}.as_count()

Use this to answer:

Which business flow is failing?

14.4 Error count by error type

sum:app.error.count{env:prod,service:checkout-api} by {error_type}.as_count()

Use this to answer:

Which exception is most common?

14.5 Error count by Kubernetes deployment

sum:app.error.count{env:prod} by {kube_namespace,kube_deployment}.as_count()

Use this to answer:

Which deployment is producing the errors?

14.6 Pod-level view

Only use this if your DogStatsD metric cardinality/tagging supports it:

sum:app.error.count{env:prod,service:checkout-api} by {pod_name}.as_count()

For exact pod-level investigation, I would rely more on logs/APM/Error Tracking because pod-level metrics can create high cardinality and cost/noise.


15. Build monitors and alerts

15.1 Metric monitor: service error spike

sum(last_5m):sum:app.error.count{env:prod,service:checkout-api}.as_count() > 50

Alert message:

High application error count detected.

Service: {{service.name}}
Environment: {{env.name}}
Version: {{version.name}}

Check:
- Error Tracking issue
- APM trace
- Logs for error.stack
- Kubernetes pod/container details

15.2 Metric monitor: new version error spike

sum(last_10m):sum:app.error.count{env:prod,service:checkout-api} by {version}.as_count() > 100

Use this after deployments.

15.3 Error Tracking monitor: new issue

Use this for Sentry-like behavior:

Alert when a new backend issue appears for service:checkout-api env:prod

Datadog Error Tracking monitors support alerting on new issues, regressions, and high-impact errors. (Datadog Monitoring)

15.4 APM monitor: error rate

Example logic:

Error rate for checkout-api > 5% during last 5 minutes

Use this for service reliability monitoring.


16. Recommended alerting strategy

Do not create only one giant alert.

Use layered alerting:

flowchart TD
    A[Application Errors] --> B[Metric Alert]
    A --> C[Error Tracking New Issue Alert]
    A --> D[APM Error Rate Alert]
    A --> E[Kubernetes Pod Restart Alert]

    B --> F[High volume problem]
    C --> G[New code issue]
    D --> H[Request failure problem]
    E --> I[Runtime/container problem]

    F --> J[Incident]
    G --> J
    H --> J
    I --> J
Alert typeDetectsBest for
DogStatsD metric alertError volume spikeFast service-level alert
Error Tracking alertNew/regressed grouped errorSentry-like issue detection
APM error rate alertRequest failure percentageAPI/SLO reliability
Log alertSpecific log patternKnown failure modes
Kubernetes alertCrashLoopBackOff/restartsPod/container health

17. Best practice: use DogStatsD for counters, not stack traces

DogStatsD should answer:

How many errors happened?
Which service/version/operation is failing?
Did errors increase after deployment?

DogStatsD should not answer:

What is the stack trace?
Which line of code failed?
What was the exception body?
What user/request caused this?

Those belong in:

APM
Logs
Error Tracking
Trace/log correlation

18. Best practice: standardize error classification

Create a small taxonomy across all services.

Example:

validation_error
dependency_timeout
database_error
authentication_error
authorization_error
business_rule_error
unexpected_exception

Then tag DogStatsD metrics like this:

error_category:dependency_timeout
error_type:PaymentTimeoutException
operation:checkout

This gives clean dashboards:

Errors by category
Errors by operation
Errors by service
Errors by version

19. Best practice: release/version tracking

Every deployment should set a unique version.

Good:

version: 1.8.4
version: git-sha-a8f91cd
version: 2026.05.18.1

Bad:

version: latest
version: prod
version: main

Datadog expects version to change with each application deployment so deployment impact can be identified cleanly. (Datadog Monitoring)


20. Best practice: log format

Use JSON logs.

Recommended fields:

{
  "status": "error",
  "service": "checkout-api",
  "env": "prod",
  "version": "1.8.4",
  "message": "Checkout failed",
  "error.kind": "PaymentTimeoutException",
  "error.message": "Payment provider timed out",
  "error.stack": "...",
  "operation": "checkout",
  "endpoint": "/api/checkout",
  "http.method": "POST",
  "http.status_code": 500,
  "customer_impact": true
}

Avoid logging sensitive data:

password
token
credit card
personal identity data
full request payloads
authorization headers

21. Best practice: deployment annotation for logs

For each application container, add Datadog log annotation:

annotations:
  ad.datadoghq.com/checkout-api.logs: >
    [{
      "source": "java",
      "service": "checkout-api",
      "tags": ["team:payments","component:api"]
    }]

Use the right source value:

App language/runtimesource
Javajava
Node.jsnodejs
Pythonpython
Gogo
.NETcsharp or configured .NET source
Rubyruby

The source tag matters because Datadog’s Error Tracking for logs uses language-specific handling, and Datadog recommends ensuring the source tag is properly configured. (Datadog Monitoring)


22. Pod/container relationship design

The relationship is built from three places:

flowchart TD
    A[Application Deployment Labels] --> D[env/service/version]
    B[Datadog Agent Kubernetes Metadata] --> E[pod/container/deployment/node]
    C[Application Logs/APM/DogStatsD] --> F[error/trace/metric]

    D --> G[Unified Datadog View]
    E --> G
    F --> G

    G --> H[Which service failed?]
    G --> I[Which version failed?]
    G --> J[Which pod/container failed?]
    G --> K[Which node hosted it?]

To make this work:

1. Datadog Agent must run in the cluster.
2. App pods must have unified service tags.
3. Logs/APM/DogStatsD must use the same service/env/version.
4. Error logs must include error.kind/error.stack.
5. APM tracer should inject trace/log correlation where supported.
6. DogStatsD origin detection should be enabled.

23. EKS-specific implementation notes

Standard EKS with EC2 worker nodes

Recommended:

Datadog Agent as DaemonSet
Use UDS for APM
Use UDS for DogStatsD
Collect container logs from nodes
Use Cluster Agent
Use Admission Controller where possible

EKS Fargate

Be careful. EKS Fargate does not behave like normal EC2 worker nodes because you do not manage the underlying node the same way. Datadog’s DogStatsD origin detection docs specifically mention shareProcessNamespace:true to assist the Agent for origin detection on EKS Fargate. (Datadog Monitoring)

If you are using Fargate, validate the Datadog deployment pattern separately.


24. End-to-end sample implementation

24.1 Datadog Agent values

targetSystem: linux

datadog:
  apiKeyExistingSecret: datadog-secret
  site: datadoghq.com
  clusterName: eks-prod-apne1-01

  logs:
    enabled: true
    containerCollectAll: true

  apm:
    socketEnabled: true
    portEnabled: false

  dogstatsd:
    originDetection: true
    useSocketVolume: true
    socketPath: /var/run/datadog/dsd.socket
    tagCardinality: orchestrator

  kubeStateMetricsCore:
    enabled: true

  collectEvents: true

clusterAgent:
  enabled: true
  admissionController:
    enabled: true
    mutateUnlabelled: false

24.2 App deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-api
  namespace: prod
  labels:
    tags.datadoghq.com/env: "prod"
    tags.datadoghq.com/service: "checkout-api"
    tags.datadoghq.com/version: "1.8.4"
spec:
  replicas: 3
  selector:
    matchLabels:
      app: checkout-api
  template:
    metadata:
      labels:
        app: checkout-api
        tags.datadoghq.com/env: "prod"
        tags.datadoghq.com/service: "checkout-api"
        tags.datadoghq.com/version: "1.8.4"
      annotations:
        admission.datadoghq.com/enabled: "true"
        ad.datadoghq.com/checkout-api.logs: >
          [{
            "source": "java",
            "service": "checkout-api",
            "tags": ["team:payments","component:api"]
          }]
    spec:
      containers:
        - name: checkout-api
          image: myrepo/checkout-api:1.8.4
          env:
            - name: DD_ENV
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['tags.datadoghq.com/env']

            - name: DD_SERVICE
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['tags.datadoghq.com/service']

            - name: DD_VERSION
              valueFrom:
                fieldRef:
                  fieldPath: metadata.labels['tags.datadoghq.com/version']

            - name: DD_ENTITY_ID
              valueFrom:
                fieldRef:
                  fieldPath: metadata.uid

            - name: DD_TRACE_AGENT_URL
              value: "unix:///var/run/datadog/apm.socket"

            - name: DD_LOGS_INJECTION
              value: "true"

            - name: DD_RUNTIME_METRICS_ENABLED
              value: "true"

            - name: DOGSTATSD_SOCKET
              value: "/var/run/datadog/dsd.socket"

          volumeMounts:
            - name: datadog-socket
              mountPath: /var/run/datadog
              readOnly: true

      volumes:
        - name: datadog-socket
          hostPath:
            path: /var/run/datadog

25. Validation checklist

25.1 Validate Datadog Agent

kubectl -n datadog get pods
kubectl -n datadog get ds
kubectl -n datadog get deploy

25.2 Check Agent status

kubectl -n datadog exec -it <datadog-agent-pod-name> -c agent -- agent status

Look for:

APM Agent: Running
DogStatsD: Running
Logs Agent: Running

Datadog’s APM troubleshooting guide says the Agent status output should show the APM Agent as running; otherwise traces cannot be submitted properly. (Datadog Monitoring)

25.3 Validate app tags

Check pod labels:

kubectl -n prod get pod <pod-name> --show-labels

Expected:

tags.datadoghq.com/env=prod
tags.datadoghq.com/service=checkout-api
tags.datadoghq.com/version=1.8.4

25.4 Validate logs

Generate a test exception, then search logs by:

service:checkout-api env:prod status:error

Expected fields:

error.kind
error.message
error.stack
kube_namespace
pod_name
container_name

25.5 Validate DogStatsD metric

Search metric:

app.error.count

Group by:

service
version
error_type
operation

25.6 Validate APM

Search service:

service:checkout-api env:prod

Expected:

Traces visible
Error traces visible
Service map visible
Trace/log correlation working

25.7 Validate Error Tracking

Search backend issues for:

service:checkout-api env:prod

Expected:

Grouped error issue
Stack trace visible
Occurrences visible
Related logs/traces visible

26. Common problems and fixes

ProblemLikely causeFix
Error metric appears but no pod/containerDogStatsD origin detection/cardinality issueEnable origin detection; use UDS; review tag cardinality
Error Tracking issue not createdLogs missing error.kind or error.stackAdd structured error fields
Logs visible but service name wrongMissing log annotation or unified tagsAdd service in log config and DD_SERVICE
APM traces missingApp cannot reach AgentUse UDS or correct DD_AGENT_HOST; check Agent status
Trace/log correlation missingLog injection not enabledEnable tracer log injection
Too many custom metricsHigh-cardinality metric tagsRemove request_id, user_id, pod_name from metrics
New release not visibleStatic or missing versionSet unique DD_VERSION per deployment
Pod error not visible in metricPod tag not included for cardinality reasonsUse logs/APM for pod-level RCA or adjust cardinality carefully
Logs not collectedAgent log collection disabledEnable container log collection

27. Best implementation pattern for your migration

Do not migrate like this:

Sentry β†’ DogStatsD only

That will give weak debugging.

Migrate like this:

Sentry
  β†’ Datadog Error Tracking
  β†’ Datadog APM
  β†’ Datadog Logs
  β†’ DogStatsD custom error metrics
  β†’ Kubernetes metadata correlation

Recommended production pattern:

flowchart TD
    A[Sentry Replacement Requirement] --> B[Error Tracking]
    A --> C[APM]
    A --> D[Logs]
    A --> E[DogStatsD Metrics]

    B --> F[Grouped Issues]
    C --> G[Trace and Dependency RCA]
    D --> H[Stack Trace and Context]
    E --> I[Fast Error Count Alerts]

    F --> J[Service / Env / Version]
    G --> J
    H --> J
    I --> J

    J --> K[Kubernetes Pod / Container / Deployment / Node]

28. Final recommended standard

For every service running in EKS, implement this standard:

1. Add Datadog unified service labels:
   - tags.datadoghq.com/env
   - tags.datadoghq.com/service
   - tags.datadoghq.com/version

2. Add application env vars:
   - DD_ENV
   - DD_SERVICE
   - DD_VERSION
   - DD_TRACE_AGENT_URL
   - DD_LOGS_INJECTION
   - DD_ENTITY_ID

3. Enable Datadog Agent features:
   - logs
   - APM
   - DogStatsD
   - DogStatsD origin detection
   - Kubernetes metadata
   - Cluster Agent

4. Application must emit:
   - DogStatsD metric: app.error.count
   - Structured error log with error.kind/error.message/error.stack
   - APM trace/span errors

5. Dashboards should show:
   - errors by service
   - errors by version
   - errors by operation
   - errors by error_type
   - errors by namespace/deployment
   - related pods/containers through logs/APM

6. Alerts should include:
   - new Error Tracking issue
   - high error count
   - high APM error rate
   - pod restart/crashloop alerts

29. Final conclusion

The best Datadog design for application error tracking in EKS is:

DogStatsD for custom error counters
Logs for stack traces
APM for request/dependency tracing
Error Tracking for Sentry-like issue grouping
Unified service tagging for service/env/version relationship
Kubernetes metadata for pod/container/node relationship

In short:

DogStatsD tells you how many errors happened.
Logs tell you what exception happened.
APM tells you where in the request path it failed.
Error Tracking groups the issue.
Kubernetes metadata tells you which pod/container/deployment/node caused it.

That combination gives you a clean, production-grade replacement for Sentry while also giving stronger EKS infrastructure correlation than Sentry alone.

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x