Logging Policy

Introduction

This Logging Policy was born out of an internal effort at Blueground to rethink observability as our tech footprint kept expanding. With more services, more data, and more engineers involved, we needed a consistent way to capture logs that were actually useful not just noise.

What started as a set of internal guidelines and reference snippets grew into a full policy we now use across our teams. We decided to open-source it in case others find themselves facing similar challenges.

We hope you’ll find it useful whether you adopt it as-is, adapt parts of it, or simply take inspiration for your own approach to logging.

Purpose

This article outlines standardized logging practices that engineering teams can adopt to ensure consistency and operational excellence across applications.

Log analysis is the most essential tool for monitoring, troubleshooting, and auditing software systems. The effectiveness of these processes hinges on the standardization of log formats and practices. This guide establishes the logging standards for applications at our organization, ensuring uniformity and efficacy across our technology stack.

Logs are the stream of aggregated, time-ordered events collected from the output streams of all running processes and backing services. Logs in their raw form are typically a text or structured format with one event per line. Logs have no fixed beginning or end, but flow continuously as long as the app is operating.

Terms of Use

This Logging Policy was originally created for internal use within Blueground. We are open-sourcing it in the hope that it’s useful to others, whether as a ready-to-use baseline or as inspiration for your own practices.
By using this Logging Policy (including the documentation and reference implementations), you agree to the following:

  1. Free to Use and Adapt

    • You may copy, modify, and use this Logging Policy in your own projects.

    • You may not redistribute it as a product, package, or service under your own name. Sharing links to this repository or referencing it in your work is always welcome.

  2. No Warranty

    • This Logging Policy is provided β€œas is”, without warranty of any kind.

    • It was designed around our own systems and may not fit every use case.

  3. No Liability

    • We are not responsible for any issues, damages, or losses that may result from your use of this Logging Policy.

    • You assume all responsibility for adapting and applying it in your own environment.

  4. Attribution

    • If you use or adapt significant portions of this policy, a simple attribution back to this repository or mention of its source is appreciated (but not required).

The 10 commandments of logging

This policy fully endorses the 10 commandments of logging - which also happens to be a fun read πŸ˜†

https://www.masterzen.fr/2013/01/13/the-10-commandments-of-logging

What to log

Keep a balance between too much and too little logging.

Make sure you always log:

  • Application initialization

  • Logging initialization

  • Incoming requests

  • Outgoing requests

  • "Use case" initiation/completion

  • Application errors

  • Input and output validation failures

  • Authentication successes and failures

  • Authorization failures

  • Session management failures

  • Privilege elevation successes and failures

  • Other higher-risk events, like data import and export or bulk updates

Consider logging other events that address these use cases:

  • Troubleshooting

  • Monitoring and performance improvement

  • Testing

  • Understanding user behavior

  • Security and auditing

Log Format: JSON

JSON by default

All first-party applications should output their logs in JSON format without exceptions while running with default configuration (typically in production and staging deploys).

{
  "timestamp": "1714131374032",
  "logger": "foo.bar",
  "msg": "This is a log message",
  "host": "i-f823e12ac",
  "service": "example-application",
  "source": "k8s/spring-boot",
  "status": "info",
}

Example: log message in JSON

Why a structured Log Format?

βœ… Allows for a custom log data model

βœ… Easy parsing

Semi-structured log formats like SysLog and Commons Log Format, once favored for their compact size and human readability, present challenges due to error-prone parsing, often handled through Grok expressions. Although their smaller byte size was advantageous when storage costs were high, this benefit has diminished. Modern log management providers, such as Datadog or Loggly, charge the number of indexed log events (not their size*). Moreover, they address readability issues by converting JSON logs into easily interpretable formats, making structured formats more appealing despite their larger size. Finally, semi-structured log formats typically make assumptions about the data being logged, while JSON allows us to customize our log data model.


* Actually, they do charge bytes of ingested data, but these costs are a whole scale of magnitude lower than those of indexing.

TEXT in development

While JSON may be perfect for tools and log management providers, it makes it hard for humans to pinpoint and read the actual log message. In development mode, applications should take a hybrid approach and display each log event as follows:

 <TIMESTAMP> <STATUS> [<THREAD>] [<LOGGER>] <MSG>
 <attributes in JSON or key-value pairs>
  • TIMESTAMP: the log timestamp in the user/system data format/timezone

  • STATUS: the log level/status as string, e.g. INFO, WARN

  • LOGGER: the code component that emits the log

  • MSG: the log message

[Fri Apr 26 2024 14:36:17] | INFO | This is a log message
{
  "host": "i-f823e12ac",
  "service": "example-application",
  "source": "k8s/spring-boot"
}

Example: log message in TEXT + JSON

[Fri Apr 26 2024 14:36:17] | INFO | This is a log message
  host=i-f823e12ac
  service=example-application
  source=k8s/spring-boo

Example: log message in TEXT + Key-Value pairs

Use delimiters and/or colors.

Use a delimiter to separate between the different parts of the log message.

[Fri Apr 26 2024 14:36:17] | INFO | This is a log message

Or color code them using ANSI color escape sequences.

πŸ‘‡

Log Routing: STDOUT

A cloud-native app NEVER concerns itself with routing or storage of its output stream.

It should not attempt to write to or manage log files. Instead, each running process writes its event stream, unbuffered, to stdout. During local development, the developer will view this stream in the foreground of their terminal to observe the app's behavior.

In staging or production deploys, each process stream will be captured by the execution environment (Datadog agent for apps running in K8S), collated together with all other streams from the app, and routed to one or more final destinations for viewing and long-term archival. These archival destinations are not visible to or configurable by the app and instead are completely managed by the execution environment.

The stream can be sent to a log indexing and analysis system - we currently use Datadog - or a general-purpose data warehousing system such as Hadoop/Hive. These systems allow for great power and flexibility for introspecting an app's behavior over time, including:

  • Finding specific events in the past.

  • Large-scale graphing of trends (such as requests per minute).

  • Active alerting according to user-defined heuristics (such as an alert when the quantity of errors per minute exceeds a certain threshold).

What if I want to collect logs in a file during development?

Inside a terminal? Use tee to view logs on stdout but also store to a file

$ ./run-my-app | tee .logs/log.out

Inside an IDE?

Then, you should have the option to save the output to a file. For example, this is how to save the console output to a file using Intellij.

Log Levels

Applications should emit log events mapping to one or more of the following log levels.

Name

Severity

Description

FATAL

High

The service/app is going to stop or become unusable now. An engineer should definitely look into this soon.

ERROR

Mid

Fatal for a particular request, but the service/app continues servicing other requests. An engineer should look at this soon(ish).

WARN

Low

A note on something that should probably be looked at by an engineer eventually.

INFO

None

Detail on regular operation

DEBUG

None

Anything else, i.e., it is too verbose to be included at the "info" level.

TRACE

None

Logging from external libraries used by your app or very detailed application logging.

How do TRACE log messages look like?

The following types of messages are probably appropriate at the TRACE level:


Entering <class name>.<method name>, <argument name>: <argument value>, [<argument name>: <argument value>]
Method <class name>.<method name> completed [, returning: <return value>]
<class name>.<method name>: <description of some action being taken, complete with context information>
<class name>.<method name>: <description of some calculated value, or decision made, complete with context information>

Log Level Mapping

Once deployed, the application logs will be sent to our log management provider (Datadog).

For each log event to be mapped to the correct status on Datadog, you can do one of the following:

  1. Configure a log status remapper to map your custom log level to Datadog's log status

  2. Use the following codification (recommended)

{
  // Note that the name of the property 
  // is "status" not "level"
  "status": "TRACE|DEBUG|INFO|WARN|ERROR|FATAL"
}

For reference, here's how Datadog remaps each incoming status value:

  • Integers from 0 to 7 map to the Syslog severity standards

  • Strings beginning with emerg or f (case-insensitive) map to emerg (fatal)

  • Strings beginning with a (case-insensitive) map to alert

  • Strings beginning with c (case-insensitive) map to critical

  • Strings beginning with e (case-insensitive)β€”that do not match emerg β€”map to error

  • Strings beginning with w (case-insensitive) map to warning

  • Strings beginning with n (case-insensitive) map to notice

  • Strings beginning with i (case-insensitive) map to info

  • Strings beginning with d, trace or verbose (case-insensitive) map to debug

  • Strings beginning with o or s, or matching OK or Success (case-insensitive) map to OK

  • All others map to info

Log Data Model

Adopting a standardized data model for logging offers several key benefits:

  • Unified Understanding: Establishes a common framework for what constitutes a log event, ensuring clarity across all teams.

  • Clear Semantics: Provides log attributes with well-defined meanings, removing ambiguity and promoting consistent interpretation.

  • Enhanced Troubleshooting: Facilitates more effective and efficient problem-solving across applications in a microservices-based architecture.

Embracing a "Shift Left" approach, we aim for our log events to conform to this model right from their origin. In cases where direct adoption is impractical or complex, a Datadog log pipeline should be utilized to transform the log events into this data model.

Model definition

Consider a log event as a collection of attributes, each akin to a field in a flattened JSON object.

LOG EVENT = ATTRIBUTE1 + ATTRIBUTE2 + ...

Accordingly, the data model is described as a series of attribute specifications.

To maximize efficiency and leverage established practices, we build upon Datadog's default set of standard log attributes and split those attributes into functional domains like Network, HTTP, and more.

Reserved attributes

A "fixed" set of attributes that bear special semantics on Datadog. Reserved attributes are automatically identified and parsed.

Path

Type

Required

Description

host

string

Automatically added by theDatadog Agent

The name of the originating host as defined in metrics. Datadog automatically retrieves corresponding host tags from the matching host in Datadog and applies them to your logs. The Agent sets this value automatically. Example: myapp-5c74d5d5d4-trqg8

source

string

Automatically added by theDatadog Agent

This corresponds to the integration name, the technology from which the log originated. When it matches an integration name, Datadog automatically installs the corresponding parsers and facets. For example, nginx, PostgreSQL, and so on. Example: nodejs, java

status

string

Always

This corresponds to the level of a log. It is used to define patterns and has a dedicated layout in the Datadog Log UI. Example: warn, info etc

service

string

Automatically added by theDatadog Agent

The name of the application or service generating the log events. It is used to switch from Logs to APM, so make sure you define the same value when you use both products. Example: myapp

trace_id

string

Automatically added by theDatadog Agent

This corresponds to the Trace ID used for traces. It is used to correlate your log with its trace. Example: 6631532b000121232131241412412

message

string

Always

By default, Datadog ingests the value of the message attribute as the body of the log entry. That value is then highlighted and displayed in Live Tail, where it is indexed for full-text search.

date

stringnumber

Always

The log event creation timestamp. Example: 2024-04-30T11:06:20.812538642Z

Source code attributes

This attribute set helps identify the origin of the log event in the source code.

Path

Type

Required

Description

logger.name, name

string

Always

The name of the logger. Example: TaskHandler

logger.thread_name

string

Multithreaded apps

The name of the current thread when the log is fired.Example: task_consumer_1

logger.method_name

string

Optional

The class method name.Example: handleTask

logger.version

string

Optional

The version of the logger.Example: 1.2.3

Network attributes

These attributes are related to the data used in network communication.
All fields and metrics are prefixed by network

Path

Type

Required

Description

network.bytes_read

number

Network requests

Total number of bytes transmitted from the client to the server when the log is emitted.Example: 2048

network.bytes_written

number

Network requests

Total number of bytes transmitted from the server to the client when the log is emitted.Example: 2048

network.client.external_ip,network.client.ip

string

Network requests

The IP address of the original client that initiated the inbound connection.Example: 192.11.22.02

network.client.external_port,network.client.port

string

Network requests

The port of the original client that initiated the connection.Example: 1903

network.client.internal_ip

string

Network requests

The IP address of the internal host proxying the connection. Typically, a load balancer or another pod in the same k8s cluster.Example: 10.244.0.22

network.destination.ip

string

Network requests

For outbound connections, that is the destination IPExample: 192.11.22.02

network.destination.port

number

Network requests

The remote port number of the outbound connection. Example: 443

Applications should try to include those attributes in their network-related requests, both incoming and outgoing. See the full list of supported network and geolocation attributes.

Error attributes

These attributes are related to error-specific data and are required for all error events.

Path

Type

Required

Description

error.kind

string

Mapped by Datadog

The error type or kind (or code in some cases).e.g. BadRequest or CardError

error.message

string

Mapped by Datadog

A concise, human-readable, one-line message explaining the event.e.g. Booking duration violates minimum stay rules

error.stack

string

Errors

The stack trace or complementary information about the error.

HTTP attributes

Required for all events that log HTTP requests. As they describe the HTTP request itself, they should not be propagated to downstream log events (MDC) or logged out of context.

Path

Type

Required

Description

http.url

string

HTTP requests

The URL of the HTTP request

http.referer

string

HTTP requests

HTTP header field that identifies the address of the webpage that linked to the resource being requested.

http.method

string

HTTP requests

The port of the client that initiated the connection.

http.status_code

string

HTTP requests

The HTTP response status code.

http.useragent

string

HTTP requests

The User-Agent header received with the request.

http.version

string

HTTP requests

The version of HTTP used for the request.

http.url_details.host

string

HTTP requests

The HTTP host part of the URL.

http.url_details.port

number

HTTP requests

The HTTP port part of the URL.

http.url_details.path

string

HTTP requests

The HTTP path part of the URL.

http.url_details.queryString

object

HTTP requests

The HTTP query string parts of the URL decomposed as query params key/value attributes.

http.url_details.scheme

string

HTTP requests

The protocol name of the URL (HTTP or HTTPS).

http.useragent_details.os.family

string

HTTP requests

The OS family reported by the User-Agent.

http.useragent_details.browser.family

string

HTTP requests

The Browser Family reported by the User-Agent.

http.useragent_details.device.family

string

HTTP requests

The Device family reported by the User-Agent.

Performance attributes

Required in any log that describes a task whose performance we are interested in. E.g., an HTTP request, a DB operation or any I/O operation, a CPU crunching computation, etc.

Path

Type

Required

Description

duration

number (nanoseconds)

When applicable

A duration of any kind in nanoseconds: HTTP response time, database query time, latency, and so on. ⭐ Remap any durations within logs to this attribute because Datadog displays and uses it as a default measure for trace search.

User attributes

All execution flows starting from a user action, should capture the user information in their log events.

Path

Type

Logged

Description

usr.id

string

When part of a user request

The user identifier, e.g. 12752, 0 for unknown users

usr.name

string

When part of a user request

The user's name, e.g. John Doe, Unknown

usr.email

string

When part of a user request

The user's email, e.g. john@foo.bar

Domain attributes

Domain attributes capture COMMON domain-specific information. By common we mean that those attributes are organization-wide and can be found across services and apps. Domain attributes are typically provided via a context propagation mechanism like MDC.

Path

Type

Logged

Description

domain.market

string

When applicable

The our organization market as a 3-letter city code. E.g. NYC, IST, DXB

domain.property.id

number

When applicable

The property id, e.g. 46245

domain.property.code

string

When applicable

The property code, e.g. NYC-345

domain.booking.id

number

When applicable

The booking id, e.g. 663145

domain.booking.code

string

When applicable

The booking code, e.g. NYC-3405

domain.booking.version

number

When applicable

The booking version, e.g. 1

domain.task.id

number

When applicable

The task id, e.g. 663145

domain.*

string

When applicable

Any domain-specific attribute related to the log event

Service attributes

Service attributes are service-specific domain attributes. To avoid name "collisions" that may prevent us from using features such as Log Facets , the service attributes are placed under a [service] key instead of domain.

Why both domain.* and [service].* attributes?


If service-specific domain attributes were placed under domain, we would get values under the same key with totally different semantics.

Example:


For Hermes channel may be slack while for PCM may be apartments.com. By placing channel under hermes.channel and pcm.channel respectively, we get the required isolation to build effective filters and facets.

Path

Type

Logged

Description

[service].*

any

When applicable

A service-specific domain attribute. E.g. app1_name.channel, app2_name.paymentGateway, app3_name.channel, app4_name.priceStrategy

Other attributes

Path

Type

Logged

Description

correlation_id

string

Always

A trace ID that is decoupled from APM. As such, it should be present in all log events regardless of whether it is sampled from APM or not. For traces starting from an external incoming HTTP request, it should default to X-Amzn-Trace-Id. See ALB request tracing.

entrypoint

string

Always

Identifies the "port" through which the request reached the application, e.g. http/api kafka/consumer background_job

clientinfo.id

string

service-to-service

The ID of the client service. Typically, the pod UID when referring to k8s service.

clientinfo.name

string

service-to-service

The name of the client service. E.g. hermes

team

string

Always

The team responsible for interpreting/monitoring this log event. E.g. pricing or pms/bookings

How to generate the correlation_id?

Format: version-timestamp-uniqueid where:

  • version: 1

  • timestamp: Epoch in seconds

  • uniqueid: KSUID

How to propagate the correlation_id?

JVM

Use Mapped Diagnostic Context

Node.js

Use AsyncLocalStorage - example with Express

The log message

Generic

Standard log events should use the following generic message format in the default configuration.

[SCOPE] TEXT (key=value)*
  • SCOPE: Optional context in square brackets. Typically, the name of a use case, command, or edge case being handled. E.g. [CreateBooking], [UpdateUser], [SendEmailJob]. The context is redundant when it's semantically equivalent to the logger.name

  • TEXT: The human-readable description of the log event. See Writing style for details.

(key=value): Key/Value pairs go at the end and are enclosed in parentheses.

Writing style

To maintain consistency and readability, adhere to the following guidelines for the tone, style, and tense of log messages:

  1. Tone: Use a neutral and professional tone. Avoid slang, jargon, or overly technical language that may not be easily understood by all team members.

  2. Style: Be concise and clear. Ensure that each log message provides sufficient context to understand the issue without being verbose. Use structured formats to facilitate easy parsing and analysis.

  3. Tense: Use appropriate tenses based on the action being logged. Present tense can be used for ongoing actions, past tense for completed actions, and present perfect for recent completions relevant to the current context.

    • Ongoing Actions (Present Tense): Use the present tense for actions that are currently happening.

      • Example: Starting the payment processing

    • Completed Actions (Past Tense): Use past tense for actions that have already been completed.

      • Example: Payment processed successfully

    • Recent Completions (Present Perfect): Use present perfect tense for actions that happened in the past but are still relevant (with a result in the present):

      • Example: File has been deleted

HTTP

HTTP log events follow a particular message format that allows for very fast eye scanning among hundreds of events. Since applications typically both accept and issue HTTP requests, we employ a slightly different format between the two.

INCOMING REQUEST

[req] METHOD PATH 

Should we log incoming requests or only responses?


__Recommendation__: No by default, but make it available under a feature flag.

  • Logging incoming requests == increased logging management costs.

  • Logging incoming requests == increased log verbosity

  • Logging incoming requests == ability to identify poisonous requests that may hang the server before it gets a chance to log a response. Without logging incoming requests, poisonous requests could go "stealth".

OUTGOING RESPONSE

[res] METHOD PATH STATUS_CODE STATUS_TEXT (duration)

OUTGOING REQUEST (e.g., Axios, fetch, RestClient)

-> [req] [host] METHOD PATH

INCOMING RESPONSE

<- [res] [host] METHOD PATH STATUS_CODE STATUS_TEXT (duration)

Examples:

# INCOMING REQUEST
[req] GET /foo/bar?qux=baz
[req] POST /foo/bar

# OUTGOING RESPONSE
[res] GET /foo/bar?qux=baz 200 OK (0.17s)
[res] POST /foo/bar 201 CREATED (0.52s)

# OUTGOING REQUEST
-> [req] [foo-svc] GET /foo/bar?qux=baz
-> [req] [exchangerate.com] GET /api/rates

# INCOMING RESPONSE
<- [res] [foo-svc] GET /foo/bar?qux=baz 200 OK (0.2s)
<- [res] [exchangerate.com] GET /api/rates 200 OK (0.3s)

GraphQL

GraphQL requests are somewhat special HTTP requests

  • They cannot be differentiated by the HTTP path, which is the same for all, typically /graphql

  • Are highly polymorphic

    • The same query may list different attributes

    • The same request may contain multiple queries

For the reasons above, to fully capture GraphQL requests in our log events, we add extra, specific metadata under the graphql JSON property.

  • operationType: Whether it is a query, mutation, or subscription.

  • operationName: The name of the query or mutation being executed.

  • operationBody: The actual GraphQL query or mutation string.

  • variables: Any variables passed along with the query.

  • responseTime: The time taken to execute the query.

  • responseStatus: Success or error status of the query execution.

  • errors: Detailed error messages if the query fails.

  • performanceMetrics: Optional, but can include detailed timing information for various parts of the query execution.


export const schema = z.object({
  graphql: z.object({
    operations: z.array(
      z.object({
        operationType: z.union([
          z.literal("query"), 
          z.literal("mutation"), 
          z.literal("subscription")
        ]),
        operationName: z.string(),
        operationBody: z.string(),
        variables: z.object({}).passthrough(),
        responseTimeMs: z.number(),
        responseStatus: z.string(),
        errors: z.array(z.string()),
        performanceMetrics: z.object({
          parsingTimeMs: z.number(),
          validationTimeMs: z.number(),
          executionTimeMs: z.number()
        })
      })
    )
  })
})

graphql zod schema

Example:

{
  // ...
  "msg": "POST /graphql 200 OK",
  "graphql": {
    "operations": [
      {
        "operationType": "query",
        "operationName": "getUser",
        "operationBody": "query getUser($id: ID!) { user(id: $id) { id, name, email } }",
        "variables": {
          "id": "user_456"
        },
        "responseTimeMs": 125,
        "responseStatus": "success",
        "errors": null,
        "performanceMetrics": {
          "parsingTimeMs": 10,
          "validationTimeMs": 5,
          "executionTimeMs": 110
        }
      },
      {
        "operationType": "mutation",
        "operationName": "createUser",
        "operationBody": "mutation createUser($name: String!, $email: String!) { id }",
        "variables": {
          "name": "Joe",
          "email": "joe@test.com"
        },
        "responseTimeMs": 125,
        "responseStatus": "success",
        "errors": null,
        "performanceMetrics": {
          "parsingTimeMs": 10,
          "validationTimeMs": 5,
          "executionTimeMs": 110
        }
      }
    ]
  }
}

Best Practices

Correlation IDs

All apps should accept and propagate correlation IDs from all entry points to all exit points.

HTTP

  1. Inbound HTTP requests should set the MDC from the x-correlation-id HTTP header

  2. Outbound HTTP requests should set the x-correlation-id HTTP header from the MDC

Kafka

  1. Message consumers should set the MDC from the x-correlation-id Message header

  2. Message producers should set the Message x-correlation-id header from the MDC

RabbitMQ / AMQP

  1. Message consumers should set the MDC from the correlation-id Message property

  2. Message producers should set the Message correlation-id property from the MDC

Writing a good log message

  1. Follow a consistent format: See the log message format.

  2. Be clear and concise: Ensure that your log messages are easy to understand. Use simple and clear language to describe what is happening in the system.

  3. Include Relevant Context: Provide enough context to make the log message useful for debugging. Include information such as user IDs, request IDs, or relevant state information. See logging attributes.

Some examples:

❌ Error in module X
βœ… Failed to connect to database in user authentication module. (userID=12345) (action=login)
❌ User not found
βœ… User not found during password reset request. (userID=98765) (requestID=abc123)
❌ Order failed
βœ… Order processing failed due to invalid payment method. (orderID=54321) (paymentMethod=Bitcoin)
❌ Payment error
βœ… Payment failed. (userID=12345) (orderID=98765) (errorCode=PAYMENT_DECLINED)

Don't log at the wrong layer

Log where you have the right context for the type of log event you want to create.

This is best described through an example:

πŸ˜• Bad


fun validateAddress(address: Address): Boolean {
    if (address.street.isNullOrBlank()) {
        // We don't have enough context here since validateAddress 
        // may be used in a dozen places.
        logger.info("Street can't be blank")
        return false
    }

    if (address.zipCode.length < 5) {
        // We don't have enough context here since validateAddress 
        // may be used in a dozen places.
        logger.info("Zip code should be at least 5 chars")
        return false
    }

    return true
}

fun createProperty(property: propertyDetails) {
    val isAddressValid = validateAddress(property.address)
    if (!isAddressValid) {
        throw InvalidAddressError()
    }
}

fun createUserProfile(profile: ProfileDetails) {
    val isAddressValid = validateAddress(profile.address)
        if (!isAddressValid) {
        throw InvalidAddressError()
    }
}

😐 Better


fun validateAddress(address: Address): Boolean {
    if (address.street.isNullOrBlank()) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Street can't be blank. (street=${address.street})")
        return false
    }

    if (address.zipCode.length < 5) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Zip code should be at least 5 chars. (zipCode=${address.zipCode})")
        return false
    }

    return true
}

fun createProperty(property: propertyDetails) {
    val isAddressValid = validateAddress(property.address)
    if (!isAddressValid) {
        logger.info("[CreateProperty] Validation failed. Invalid address ({})", address)
        throw InvalidAddressError()
    }
}

fun createUserProfile(profile: ProfileDetails) {
    val isAddressValid = validateAddress(profile.address)
    if (!isAddressValid) {
        logger.info("[CreateUserProfile] Validation failed. Invalid address ({})", address)
        throw InvalidAddressError()
    }
}

πŸ‘Œ Best


fun validateAddress(address: Address): AddressValidation {
    if (address.street.isNullOrBlank()) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Street can't be blank. (street=${address.street})")
        return AddressValidation.INVALID_STREET
    }

    if (address.zipCode.length < 5) {
        // Only a trace log makes sense here to troubleshoot
        // validateAddress itself if needed
        logger.trace("Zip code should be at least 5 chars. (zipCode=${address.zipCode})")
        return AddressValidation.INVALID_ZIP
    }

    return AddressValidation.OK
}

fun createProperty(property: propertyDetails) {
    val addressValidation = validateAddress(property.address)
    if (!addressValidation.isOK()) {
        logger.info("[CreateProperty] Address validation failed (type={})", addressValidation.type)
        throw InvalidAddressError()
    }
}

fun createUserProfile(profile: ProfileDetails) {
    val isAddressValid = validateAddress(profile.address)
    if (!isAddressValid) {
        logger.info("[CreateUserProfile] Address validation failed (type={})", addressValidation.type)
        throw InvalidAddressError()
    }
}

Logging errors

When to log errors

  • Log all global exceptions e.g. via an UncaughtExceptionHandler on the JVM

  • Log at the layer where you are handling the error

  • Use log and rethrow with caution

Log and rethrow


try {
  // Some code that might throw an exception
  throw KaboomError("Simulated error")
} catch (e: RuntimeException) {
  // Transforming and rethrowing the exception
  val transformedException = BangError("Kaboom", KaboomError)
  logger.error("Kaboom: ${e.message}", e)
  throw transformedException
}


Why you should avoid

  • Duplicate logging

  • Loss of context if not done properly

  • Performance/cost (if on a hot path)


Why you may want it

  • If you need to log the error at different log levels

  • If you need to make sure you add contextual info at a lower level

  • If you want to transfrom an exception into a more meaningful one that fits the abstraction of the current layer.

What to log as an error

  • Unhandled exceptions

  • Unexpected Errors

    • Database Errors

    • Failed calls to external APIs or microservices.

    • Critical business operations that fail, such as order processing or payment transactions.

  • Resource Failures

    • Insufficient resources (e.g., memory, disk space).

    • Failures to read or write critical files.

    • Network connectivity issues affecting application performance.

  • HTTP 500s should also be logged as errors

How to log errors with slf4j


logger.error("Failed to do something: ${e.message}", e)

How to log errors with pino


logger.error({err}, "Failed to do something: ${err.message}")

Data Privacy & Log Redaction

When it comes to PII (SPII) we can generally split data in two groups.

  • Known: Where you know exactly the type and origin. E.g. User.email

  • Unknown: Where you know it may be personal information but don't know the exact type or key. E.g. metadata or request.headers

Out of the box, our logs are redacted on Datadog via the following ways:

The following types of personal or sensitive information are detected and redacted automatically:

  • Phone number

  • Email

  • Address

  • DoB

  • Passport/ID number

  • Social Security Number

  • Card numbers

  • JSON web tokens

  • API keys (AWS, Slack, etc)

  • Access tokens and secrets

IP and MAC addresses are not redacted on purpose for defensive security purposes

With that in mind, all applications should adhere to the following best practices

  • Don't include non-useful data in your logs (just because it's there).

  • Mask known in that you want complete control over how they are redacted from within the application. E.g. mask the 60% of the email address but leave domain intact.

  • Leave unknown data to be redacted by Datadog's security scanner.

  • Use RBAC for logs to limit access to known PII that you must include in your logs for effective troubleshooting/security.

Loggers & Reference Implementations